For engineers running AI in production
LLM cost optimization: token budgeting, caching, and routing
Your LLM bill is driven by multi-turn context growth, not the per-token rate. A multi-turn agent re-sends a growing context every turn, so input tokens — not output, not the headline price — dominate the spend.
The three highest-leverage moves: cache the large static prefix so repeated turns bill at a fraction of the input rate, route cheap sub-tasks down the model ladder (Opus to Sonnet to Haiku), and cap turns with explicit done-conditions so a runaway loop can't quietly run up the meter.
Why input tokens dominate
The first instinct is to compare per-token prices and pick the cheapest model. That's the wrong frame for an agent. The cost shape of a multi-turn loop is what matters: on each turn the model gets the entire conversation so far — system prompt, tool definitions, retrieved context, and every prior message — re-sent as input. As the run grows, the input side of every call grows with it. Output is usually a small fraction by comparison.
So the bill is dominated by the same large prefix being re-read turn after turn. For reference, current Claude rates per 1M tokens (input / output):
| Model | Input (per 1M) | Output (per 1M) |
|---|---|---|
| Haiku 4.5 | $1 | $5 |
| Sonnet 4.6 | $3 | $15 |
| Opus 4.8 | $5 | $25 |
Notice the output rate is the larger number per model — but in a multi-turn agent you re-send far more input than you generate output, so the input column is where the money goes. Optimize the thing that scales with every turn.
The levers
1. Prompt caching
Most of what you re-send each turn is static: the system prompt, the tool definitions, and any fixed context. Cache that prefix once and repeated turns read it from cache instead of re-billing it at the full input rate. Cache reads bill at roughly 0.1x the input rate; the trade is that the initial cache write costs about 1.25x the input rate. With a stable prefix re-read across many turns, that one-time premium pays for itself quickly. This is the single biggest lever for a long-running agent because it attacks the exact cost — the re-sent prefix — that dominates the bill.
2. Model routing
Not every sub-task needs your strongest model. Route work down the ladder — Opus to Sonnet to Haiku — and send cheap or simple steps (classification, extraction, routing decisions, short formatting) to a smaller, cheaper model. Reserve the expensive model for the steps that actually need its reasoning. The price spread above makes this material: Haiku input is one-fifth of Opus input.
3. Turn and tool-call caps with done-conditions
An agent that doesn't know when it's finished will keep looping, and every loop re-sends the growing context. Set an explicit maximum on turns and tool calls, and define a clear done-condition so the agent stops the moment the goal is met. This is the guardrail that prevents the worst-case bill, not the average one.
4. Context management and compaction
Dead context is pure overhead: you pay to re-send it every turn for no benefit. Prune resolved tool outputs, summarize or compact older turns, and keep only what the next step needs. Smaller carried context means a smaller input side on every subsequent call.
5. Batch where latency allows
For work that isn't latency-sensitive — overnight processing, bulk classification, offline evals — batch the requests. You trade immediacy for a lower effective rate, and a large share of production LLM work doesn't actually need a real-time response.
6. Instrument per-route cost
You can't cut what you can't see. Attribute token usage and cost to each route, endpoint, or agent so the hotspot is obvious. Per-route cost turns "the bill went up" into "this one flow is 80% of spend" — which is what tells you where to apply the levers above.
The levers at a glance
| Lever | What it does | Impact |
|---|---|---|
| Prompt caching | Cache reads bill at ~0.1x input (write ~1.25x) for the static prefix | Large |
| Model routing | Sends cheap sub-tasks to a smaller model (Opus to Sonnet to Haiku) | Large |
| Turn / tool-call caps | Stops runaway loops with explicit done-conditions | Large (worst-case) |
| Context compaction | Drops dead context so it isn't re-sent each turn | Situational |
| Batching | Lower effective rate where latency isn't required | Situational |
| Per-route cost instrumentation | Surfaces the hotspot so you know where to act | Enabling |
- Anthropic — prompt caching and pricing docs (platform.claude.com).
- Course material: AI Architect Academy Track B (cost modeling and routing).
Prices and caching mechanics change; verify against current provider docs before implementing. Corrections: hello@aiarch.dev.
Learn to model and control LLM cost as a first-class skill.
AI Architect Academy teaches cost modeling, prompt caching, and model routing as production skills — across Anthropic, AWS, and Cloudflare.
Get notified when new tracks ship.