Revenue AI Lab. Book the audit
All writing
POST · 04 CLUSTER / AI AGENTS FIELD NOTES · 8 MIN READ APRIL 2026

What makes an agent actually production-ready.

We grade every agent before we ship it against the same eleven checkpoints. Most demo-ware agents pass three of them. Anything below seven fails a real engagement inside thirty days. Here's the list.

01 · Idempotence

Running the agent twice on the same input produces the same side-effect count. If you re-process a batch after a crash, you don't send the same email twice, create duplicate CRM rows, or bill a customer twice. Every external action keys off a deterministic ID stored in your database before the call fires.

02 · Observability

You can look at any single past run and see: inputs, prompt, model response, tool calls, tool responses, final output, cost. No "it worked a minute ago" debugging. Structured logs beat token traces; both beat nothing.

03 · Explicit scope

The agent has a written one-paragraph job description that an engineer can show a lawyer. It says what the agent does, what it reads, what it writes, what it will refuse. If you can't produce that paragraph, the agent will over-reach in month two.

04 · Human-in-the-loop gate

Any action with an irreversible external consequence — emails, CRM updates that delete data, charges, messages to customers — passes through a review surface at least until the agent has produced a few hundred approved outputs. The gate is explicit, not implied. "Nobody is watching the dashboard" is not a gate.

01 IDEMPOTENCE STANDARD 02 OBSERVABILITY STANDARD 03 EXPLICIT SCOPE STANDARD 04 HUMAN GATE NON-NEGOTIABLE · FIRST 200 RUNS 05 TIMEOUT+RETRY STANDARD 06 RATE CONTROL STANDARD 07 SECRETS HYGIENE STANDARD 08 COST CEILING STANDARD 09 GRACEFUL DEGRADE STANDARD 010 KILL SWITCH STANDARD 011 DATED CORPUS STANDARD

FIG. 01 / THE 11-CHECKPOINT GRADE · FULL PASS BEFORE SHIP

05 · Timeout and retry

Every external call has a timeout and a retry policy. No unbounded hangs. Temporal, Inngest, or a plain state machine — pick one. "Hope it comes back" is not a strategy.

06 · Rate control

The agent respects rate limits of every downstream tool, with a ceiling on its own throughput. A misconfigured loop that sends 10,000 requests per minute to a partner API will get you banned before lunch. Bucket your sends; back off on 429s.

07 · Secrets hygiene

API keys, tokens, and customer data never appear in prompts or logs. If your prompt template includes a literal API key, your logging pipeline is now a credential exfil. We've seen this twice in client audits.

08 · Cost ceiling

A daily maximum on inference spend. A monthly maximum on third-party API spend. Alerts before ceiling, hard stops at ceiling. The number of agents that have run away with $10k of tokens before anyone noticed is higher than anyone admits.

IMG / PLACEHOLDER

Cost ceiling dashboard

Real-time spend across agents with per-agent daily and monthly caps. Swap with the live chart.

09 · Graceful degradation

The model is down. The third-party is down. What does the agent do? "Error" is not an answer. It should return a structured failure, queue the work for retry, and surface to the human an explicit notice: "agent paused, will resume in 10 minutes." Nothing silently stops.

10 · Kill switch

One operator — usually the founder or the ops lead — can stop the agent right now, without deploying code. A database flag, a feature-flag, an environment variable. The moment you doubt what it's doing, you need to be able to flip it off in ten seconds.

11 · Dated corpus

Any knowledge the agent relies on is dated — the date the claim was pulled, the date it was last verified. Stale data is worse than no data because the agent confidently states wrong things. We have seen agents tell prospects about competitors that acquired them six months ago.

Why eleven, not five

The first five look like software engineering hygiene. They are. The last six look specific to agents — and they are the ones that burn clients who thought "agents are just code." The joint probability of catastrophe is small on any one, meaningful on any two, and embarrassing on any three. Eleven is the list that has actually caught us since we started shipping these.

How we use the list

Before any client-facing ship, one of us runs the eleven checkpoints top-to-bottom with the author of the agent sitting next to us. Unchecked items either get fixed or get explicitly deferred with a risk note and a kill-switch path. No checkpoint gets silently skipped. Every one has a one-paragraph answer logged to the engagement's readme.

This is the difference between an agent that runs in demo and an agent that runs on a Sunday at 3am while everyone is asleep. The latter is what founders are actually paying for.

Most of the "agent failures" we've been called in to diagnose are not model failures. They are missing checkpoints. The model is usually fine; the engineering around it is not. If your agent makes it through these eleven, ship it. If it doesn't, do not.

Work with us

We'll audit your agents against the list.

Thirty-minute intake. We grade what you've shipped and tell you honestly what's production-ready and what isn't.

  • Free
  • No deck
  • Written scope in 48 hrs