Designing agentic systems that don't burn money
Agents fail loudly when they crash. They fail expensively when they don't.
Agentic systems have a special failure mode: they don't fall over, they just keep going. A bad loop in a web service throws a 500. A bad loop in an agent racks up tool calls, model calls, and the occasional storage bill while everything looks green.
The three budgets every agent needs
- Token budget — cumulative input + output tokens across the whole task.
- Tool budget — number and cost of tool invocations per task.
- Wall-clock budget — agents that take hours to fail are agents that take hours to debug.
These should be enforced by the runtime, not by the model. Asking the model nicely to limit itself is asking a fish to count.
Trace structure that pays off
If your trace UI only shows the prompts and the final answer, you're flying blind on cost. Capture, per step:
- Cumulative tokens and tool cost so far.
- What the agent decided to do, and why (the planner's output).
- What it observed (the tool result, summarized to a fixed length).
- Whether a budget was approached or breached.
Eval the trajectory, not just the answer
Traditional evals score the final output. Agentic evals also score the path: did the agent take the cheapest reasonable trajectory, did it call the right tools, did it avoid the dead-end loops we've already seen? This is where most teams cut their agent costs in half — not in the model, in the trajectory.
# anatomy of an agent eval 1. Replay real traces, masking the model output 2. Score trajectory cost, tool selection, loop avoidance 3. Score final-answer quality with a rubric, not a single grade 4. Track per-trajectory $ as a first-class metric 5. Gate deploys when trajectory-cost regresses
Done well, this looks suspiciously like SRE. That's not a coincidence — production agents are systems software, and we're going to have to start treating them like it.
Working on something like this?
Tell us about your stack. We'll come back with a scoped plan in two business days.