Why is my LLM agent so expensive?

Because a multi-turn agent re-sends a growing context every turn, so input tokens dominate the bill. The same large prefix - system prompt, tool definitions, retrieved context, and prior messages - is re-read on every call, and an uncontrolled loop multiplies that cost. The per-token rate is rarely the real driver.

Does prompt caching reduce cost?

Yes, significantly for a stable prefix. Cache reads bill at roughly 0.1x the input rate, against a one-time cache write of about 1.25x the input rate. When a large static prefix is re-read across many turns, the write premium pays for itself quickly, which makes caching the single biggest lever for a long-running agent.

How do I cut LLM costs in production?

Cache the static prefix, route cheap sub-tasks down the model ladder (Opus to Sonnet to Haiku), cap turns and tool calls with explicit done-conditions, compact dead context, batch latency-tolerant work, and instrument per-route cost so you can see the hotspot. Bound the loop first - uncontrolled loops, not inference, are what bite the bill.

For engineers running AI in production

LLM cost optimization: token budgeting, caching, and routing

By Wibo · Amsterdam Published 26 Jun 2026 ~7 min read

Short answer

Your LLM bill is driven by multi-turn context growth, not the per-token rate. A multi-turn agent re-sends a growing context every turn, so input tokens — not output, not the headline price — dominate the spend.

The three highest-leverage moves: cache the large static prefix so repeated turns bill at a fraction of the input rate, route cheap sub-tasks down the model ladder (Opus to Sonnet to Haiku), and cap turns with explicit done-conditions so a runaway loop can't quietly run up the meter.

Why input tokens dominate

The first instinct is to compare per-token prices and pick the cheapest model. That's the wrong frame for an agent. The cost shape of a multi-turn loop is what matters: on each turn the model gets the entire conversation so far — system prompt, tool definitions, retrieved context, and every prior message — re-sent as input. As the run grows, the input side of every call grows with it. Output is usually a small fraction by comparison.

So the bill is dominated by the same large prefix being re-read turn after turn. For reference, current Claude rates per 1M tokens (input / output):

Model	Input (per 1M)	Output (per 1M)
Haiku 4.5	$1	$5
Sonnet 4.6	$3	$15
Opus 4.8	$5	$25

Notice the output rate is the larger number per model — but in a multi-turn agent you re-send far more input than you generate output, so the input column is where the money goes. Optimize the thing that scales with every turn.

The levers

1. Prompt caching

Most of what you re-send each turn is static: the system prompt, the tool definitions, and any fixed context. Cache that prefix once and repeated turns read it from cache instead of re-billing it at the full input rate. Cache reads bill at roughly 0.1x the input rate; the trade is that the initial cache write costs about 1.25x the input rate. With a stable prefix re-read across many turns, that one-time premium pays for itself quickly. This is the single biggest lever for a long-running agent because it attacks the exact cost — the re-sent prefix — that dominates the bill.

2. Model routing

Not every sub-task needs your strongest model. Route work down the ladder — Opus to Sonnet to Haiku — and send cheap or simple steps (classification, extraction, routing decisions, short formatting) to a smaller, cheaper model. Reserve the expensive model for the steps that actually need its reasoning. The price spread above makes this material: Haiku input is one-fifth of Opus input.

3. Turn and tool-call caps with done-conditions

An agent that doesn't know when it's finished will keep looping, and every loop re-sends the growing context. Set an explicit maximum on turns and tool calls, and define a clear done-condition so the agent stops the moment the goal is met. This is the guardrail that prevents the worst-case bill, not the average one.

4. Context management and compaction

Dead context is pure overhead: you pay to re-send it every turn for no benefit. Prune resolved tool outputs, summarize or compact older turns, and keep only what the next step needs. Smaller carried context means a smaller input side on every subsequent call.

5. Batch where latency allows

For work that isn't latency-sensitive — overnight processing, bulk classification, offline evals — batch the requests. You trade immediacy for a lower effective rate, and a large share of production LLM work doesn't actually need a real-time response.

6. Instrument per-route cost

You can't cut what you can't see. Attribute token usage and cost to each route, endpoint, or agent so the hotspot is obvious. Per-route cost turns "the bill went up" into "this one flow is 80% of spend" — which is what tells you where to apply the levers above.

The levers at a glance

Lever	What it does	Impact
Prompt caching	Cache reads bill at ~0.1x input (write ~1.25x) for the static prefix	Large
Model routing	Sends cheap sub-tasks to a smaller model (Opus to Sonnet to Haiku)	Large
Turn / tool-call caps	Stops runaway loops with explicit done-conditions	Large (worst-case)
Context compaction	Drops dead context so it isn't re-sent each turn	Situational
Batching	Lower effective rate where latency isn't required	Situational
Per-route cost instrumentation	Surfaces the hotspot so you know where to act	Enabling

What actually bites the bill

Inference is usually cheap. The thing that runs up an LLM bill is an uncontrolled loop: an agent that re-sends a growing context turn after turn with no cap and no done-condition. Bound the loop first, cache the prefix second, route work down third. The per-token rate is the last thing to worry about.

Sources & provenance

Anthropic — prompt caching and pricing docs (platform.claude.com).
Course material: AI Architect Academy Track B (cost modeling and routing).

Prices and caching mechanics change; verify against current provider docs before implementing. Corrections: hello@aiarch.dev.

Learn to model and control LLM cost as a first-class skill.

AI Architect Academy teaches cost modeling, prompt caching, and model routing as production skills — across Anthropic, AWS, and Cloudflare.

Browse the curriculum → Try a sample lesson