Directing requests across models
LLM routing: sending each request to the right model
LLM routing is sending each request to the most suitable model rather than one fixed model: a cheap, fast model for easy requests and a stronger, costlier one for hard ones. An LLM router is the component that makes that decision per request — by cost, latency, capability, or availability — so you cut spend without dropping quality on the requests that need a frontier model. Routing is one function; an AI gateway wraps a router with caching, observability, and cost control.
This page is about routing as a concept — directing traffic across models and providers. For the proxy that contains a router, see AI gateway; for choosing among Claude's own tiers, see Claude model selection. Below: what a router does, why route, the strategies, build vs buy, and the pitfalls.
What an LLM router does
An LLM router is a decision function: given a request, it picks which model or provider should handle it, then forwards the call. That is the whole job — pick the destination. Everything around the call (the unified API, caching, logging, budgets) belongs to the gateway the router sits inside, not to the router itself.
The decision is made on one or more signals:
- Cost — route easy requests to a cheaper model and reserve the expensive one for requests that need it.
- Capability — match the request to a model that can actually do it (long context, vision, code, tool use).
- Latency — send latency-sensitive requests to a faster model, batchable ones to a slower, cheaper one.
- Availability — route around a provider that is erroring, rate-limiting, or down (this is the fallback case).
A router can route across providers (Anthropic vs another vendor) or within one provider's tiers (Opus vs Sonnet vs Haiku). The within-provider case is a model-selection decision dressed as routing — see how to choose between Claude models.
Why route: cost vs quality
The reason to route is that most production traffic is not uniformly hard. A classify-or-route step, a short reply, a formatting pass — these don't need a frontier model, but a single-model setup pays frontier prices for all of them. Routing the easy majority to a small model and escalating only the hard minority is the highest-leverage cost lever after caching.
The published numbers are concrete: LMSYS's open-source RouteLLM reports cutting cost by over 85% on the MT-Bench evaluation while retaining about 95% of GPT-4-level quality, by routing simpler queries to a weaker, cheaper model and only the hard ones to the strong model. The exact figures are benchmark-specific, but the shape generalises — routing trades a small, bounded quality loss for a large cost reduction.
Quality is the constraint, not an afterthought. A router only helps if the cheap model genuinely handles the requests sent to it; route a hard request to a weak model and you have saved money by shipping a worse answer. That is why routing pairs with evals — see LLM cost optimization for where routing fits among caching, batching, and prompt trimming.
Routing strategies
Routers differ mainly in how they decide. Four strategies cover most real systems, from a hard-coded rule to a trained model. They are not exclusive — production routers often combine a cheap rule with a cascade fallback.
| Strategy | How it decides | Best for | Main trade-off |
|---|---|---|---|
| Static rules | Hard-coded rules on request type, prompt length, or user tier pick the model. | Predictable workloads with clear request classes. | Brittle — rules drift from reality and miss novel requests. |
| Classifier / semantic routing | A small classifier or embedding-similarity match predicts the right model from the prompt's meaning, no generation step. | Routing by intent or complexity at scale, in milliseconds. | Needs labelled data or an utterance corpus, and a routing model to maintain. |
| Cascade / fallback | Try the cheap model first; escalate to a stronger one if confidence or quality is low, or on a provider error. | Maximising cheap-model coverage; resilience to outages. | The escalated case pays for two calls and adds latency. |
| Cost-aware / threshold | Route on a cost-quality threshold tuned on preference data (e.g. a target win-rate against the strong model). | Hitting a chosen quality bar at the lowest spend. | Threshold tuning, and the routing quality depends on benchmark fit. |
Semantic routing deserves a note because it is fast and cheap: rather than asking an LLM to choose, it embeds the request and matches it against a set of pre-defined routes by vector similarity, deciding in milliseconds with no extra generation. The open-source semantic-router library (Aurelio Labs, MIT) is the common reference implementation. Cost-aware routing is what RouteLLM and managed routers like NotDiamond automate — a router trained to predict, per request, whether the cheap model will be good enough.
Build vs buy
You can build a router: at its simplest it is an if on request type that selects a model id. For one team with two clear request classes, that static rule is the right, simple call — don't train a model to do a job a rule already does.
Buy (or adopt open source) when the decision gets harder than a rule can express: routing by prompt complexity, hitting a measured cost-quality target, or routing across many providers. That capability already exists. Open-source options include RouteLLM (LMSYS, a framework that drops in as an OpenAI-compatible client) and semantic-router for embedding-based routing; managed options include NotDiamond (a router that predicts the best model per query) and the auto-routers built into aggregators like OpenRouter and proxies like LiteLLM. Most teams that "buy" routing get it bundled inside a gateway rather than as a standalone product.
Either way, keep the router behind a thin internal interface — the same isolation an architect applies to the whole gateway — so swapping a hand-rolled rule for a trained router later is a one-file change, not a rewrite.
Trade-offs and pitfalls
Routing is not free, and a router introduces failure modes a single-model setup doesn't have:
- Added latency. Every routing decision happens before the real call. A static rule costs nothing; an LLM-based router adds a whole extra model call. Semantic or classifier routing keeps this to milliseconds — choose the cheapest decision method that is accurate enough.
- Routing errors. The router can send a hard request to the weak model (a quality miss) or an easy one to the strong model (a cost miss). The first is worse — it ships a bad answer silently — so tune the threshold to fail toward the stronger model when unsure.
- A second thing to evaluate. The router itself needs evals: route accuracy and the realised cost-quality trade-off, measured on your traffic, not a public benchmark whose distribution may not match yours.
- Drift. Rules and trained routers both decay as traffic, prompts, and model line-ups change. A router pinned to last quarter's models or request mix quietly routes worse over time.
- Hidden complexity. Routing adds a moving part. If a single capable model comfortably fits the budget, the simplest correct system has no router at all — resist routing as a default.
Frequently asked questions
What is an LLM router?
An LLM router is the component that decides which model or provider should handle a given request, then forwards the call. It picks the destination by cost, latency, capability, or availability. Routing is one function; a full AI gateway contains a router but also adds caching, observability, and cost control around the call.
What is LLM routing?
LLM routing is the practice of sending each request to the most suitable model instead of using one fixed model for everything — a cheap, fast model for easy requests and a stronger, costlier one for hard ones. The goal is to cut cost and latency without losing quality on the requests that genuinely need a frontier model.
Why use an LLM router?
Because production traffic is not uniformly hard. Routing the easy majority of requests to a small, cheap model and escalating only the hard minority to a frontier model can cut cost substantially — published frameworks report cost reductions over 85% on some benchmarks while keeping around 95% of top-model quality — provided the cheap model genuinely handles what it is sent.
What are the routing strategies?
Four recur: static rules (hard-coded on request type or length), classifier or semantic routing (a small model or embedding match predicts the right model), cascade or fallback (try the cheap model first, escalate on low confidence or error), and cost-aware or threshold routing (route on a tuned cost-quality target). Real systems often combine a cheap rule with a cascade fallback.
What is semantic routing?
Semantic routing decides where a request goes by embedding it and matching it against pre-defined routes by vector similarity, rather than asking an LLM to choose. It is fast and cheap — decisions in milliseconds with no extra generation — and is the approach behind libraries like the open-source semantic-router. It suits routing by intent or topic at scale.
Should you build or buy an LLM router?
Build when a static rule on request type captures the decision — for one team with clear request classes that is the simplest correct choice. Buy or adopt open source (RouteLLM, semantic-router, NotDiamond, or the auto-router inside a gateway) when you need complexity-based routing, a measured cost-quality target, or routing across many providers. Keep the router behind a thin interface so the choice stays a one-file swap.
- RouteLLM — open-source LLM-routing framework, cost reduction over 85% on MT-Bench at ~95% of GPT-4 quality, OpenAI-compatible drop-in: LMSYS
lmsys.org/blog/2024-07-01-routellm/andgithub.com/lm-sys/RouteLLM. - semantic-router — embedding-similarity routing with no LLM generation step, MIT-licensed:
github.com/aurelio-labs/semantic-router. - NotDiamond — managed router that predicts the best model per query across quality, cost, and latency:
docs.notdiamond.ai/docs/what-is-model-routing. - OpenRouter Auto Router and LiteLLM auto routing — routing bundled inside aggregators / proxies:
openrouter.ai/docs/guides/routinganddocs.litellm.ai/docs/proxy/auto_routing.
Tool behaviour and benchmark figures verified against the sources above on 26 Jun 2026; numbers are benchmark-specific and tools move fast — confirm against live docs before building. Corrections: hello@aiarch.dev.
Learn to design the routing layer — by building one.
AI Architect Academy teaches the operational plane of production AI systems — routing, caching, observability, and cost control — as first-class skills, on a platform that is itself a production agentic system routing real traffic across models. The build is the curriculum.
Free sample — no signup · every claim cited · cancel anytime
Or get notified when new tracks ship.