What makes an agent actually production-ready.
We grade every agent before we ship it against the same eleven checkpoints. Most demo-ware agents pass three of them. Anything below seven fails a real engagement inside thirty days. Here's the list.
01 · Idempotence
Running the agent twice on the same input produces the same side-effect count. If you re-process a batch after a crash, you don't send the same email twice, create duplicate CRM rows, or bill a customer twice. Every external action keys off a deterministic ID stored in your database before the call fires.
02 · Observability
You can look at any single past run and see: inputs, prompt, model response, tool calls, tool responses, final output, cost. No "it worked a minute ago" debugging. Structured logs beat token traces; both beat nothing.
03 · Explicit scope
The agent has a written one-paragraph job description that an engineer can show a lawyer. It says what the agent does, what it reads, what it writes, what it will refuse. If you can't produce that paragraph, the agent will over-reach in month two.
04 · Human-in-the-loop gate
Any action with an irreversible external consequence — emails, CRM updates that delete data, charges, messages to customers — passes through a review surface at least until the agent has produced a few hundred approved outputs. The gate is explicit, not implied. "Nobody is watching the dashboard" is not a gate.
FIG. 01 / THE 11-CHECKPOINT GRADE · FULL PASS BEFORE SHIP
05 · Timeout and retry
Every external call has a timeout and a retry policy. No unbounded hangs. Temporal, Inngest, or a plain state machine — pick one. "Hope it comes back" is not a strategy.
06 · Rate control
The agent respects rate limits of every downstream tool, with a ceiling on its own throughput. A misconfigured loop that sends 10,000 requests per minute to a partner API will get you banned before lunch. Bucket your sends; back off on 429s.
07 · Secrets hygiene
API keys, tokens, and customer data never appear in prompts or logs. If your prompt template includes a literal API key, your logging pipeline is now a credential exfil. We've seen this twice in client audits.
08 · Cost ceiling
A daily maximum on inference spend. A monthly maximum on third-party API spend. Alerts before ceiling, hard stops at ceiling. The number of agents that have run away with $10k of tokens before anyone noticed is higher than anyone admits.
Cost ceiling dashboard
Real-time spend across agents with per-agent daily and monthly caps. Swap with the live chart.
09 · Graceful degradation
The model is down. The third-party is down. What does the agent do? "Error" is not an answer. It should return a structured failure, queue the work for retry, and surface to the human an explicit notice: "agent paused, will resume in 10 minutes." Nothing silently stops.
10 · Kill switch
One operator — usually the founder or the ops lead — can stop the agent right now, without deploying code. A database flag, a feature-flag, an environment variable. The moment you doubt what it's doing, you need to be able to flip it off in ten seconds.
11 · Dated corpus
Any knowledge the agent relies on is dated — the date the claim was pulled, the date it was last verified. Stale data is worse than no data because the agent confidently states wrong things. We have seen agents tell prospects about competitors that acquired them six months ago.
Why eleven, not five
The first five look like software engineering hygiene. They are. The last six look specific to agents — and they are the ones that burn clients who thought "agents are just code." The joint probability of catastrophe is small on any one, meaningful on any two, and embarrassing on any three. Eleven is the list that has actually caught us since we started shipping these.
How we use the list
Before any client-facing ship, one of us runs the eleven checkpoints top-to-bottom with the author of the agent sitting next to us. Unchecked items either get fixed or get explicitly deferred with a risk note and a kill-switch path. No checkpoint gets silently skipped. Every one has a one-paragraph answer logged to the engagement's readme.
This is the difference between an agent that runs in demo and an agent that runs on a Sunday at 3am while everyone is asleep. The latter is what founders are actually paying for.
Most of the "agent failures" we've been called in to diagnose are not model failures. They are missing checkpoints. The model is usually fine; the engineering around it is not. If your agent makes it through these eleven, ship it. If it doesn't, do not.
We'll audit your agents against the list.
Thirty-minute intake. We grade what you've shipped and tell you honestly what's production-ready and what isn't.
- Free
- No deck
- Written scope in 48 hrs