The decision: knowledge vs behaviour
RAG vs fine-tuning: which to use, and when
Use RAG when the model lacks knowledge; fine-tune when it lacks the right behaviour. RAG injects facts at query time by retrieving them and putting them in the context window, so answers stay current and traceable. Fine-tuning bakes a pattern — tone, format, a narrow skill — into the model's weights through further training. They solve different problems, so the honest default is: start with prompting, add RAG for facts, and fine-tune only when you need consistent behaviour that prompting can't hold.
This page is the decision — which approach for which problem. For how retrieval itself works end to end, see the retrieval-augmented generation guide.
What RAG does: knowledge at query time
Retrieval-augmented generation leaves the model unchanged. At query time it fetches relevant documents from an external store — usually via embeddings and a vector search — and places them in the prompt, so the model answers from supplied facts instead of its parametric memory. The knowledge lives outside the weights, which is exactly what makes it powerful: you can add, update, or delete a document and the next answer reflects it immediately, with no retraining.
Because retrieval knows which documents informed a response, RAG supports citations natively — provenance is a built-in property, not an afterthought. That is why it is the right tool when knowledge changes often, when answers must be traceable, and when you want a system you can maintain by editing source data. The deep dive lives in the RAG explainer; when the agent itself decides what and when to retrieve, that is agentic RAG.
What fine-tuning does: behaviour in the weights
Fine-tuning continues training a base model on your examples, adjusting its parameters so a desired pattern becomes the default. It is how you get reliable structure (always return this JSON shape), a consistent voice (our brand tone, every time), or a narrow specialised skill the base model hedges on. The behaviour is now in the weights, so it needs no instructions in the prompt and no documents retrieved at run time.
The cost is that the knowledge is frozen at training time. To change what a fine-tuned model knows you retrain it — and research on commercial fine-tuning shows it is an unreliable way to inject facts in the first place: models struggle to reliably recall knowledge taught only through fine-tuning, and can degrade on facts they previously knew. Fine-tune to change how the model behaves, not to teach it what is true.
RAG vs fine-tuning: the trade-offs
Most of the decision comes down to a handful of axes. Prompt engineering belongs in the comparison too: it is the cheapest lever and the one to exhaust first.
| Dimension | Prompt engineering | RAG | Fine-tuning |
|---|---|---|---|
| Solves | Instruction & format gaps | Missing or changing knowledge | Persistent behaviour, tone, narrow skill |
| Knowledge freshness | Only what you paste in | Live — edit the source | Frozen at training time |
| Provenance / citations | None inherent | Native (you know the source) | None — opaque weights |
| Cost to start | Lowest | Moderate (index + retrieval) | Highest (training + data prep) |
| Effort / skill | Prompt iteration | Retrieval & data plumbing | Curated training set, ML eval loop |
| Update speed | Instant | Instant (re-index a doc) | A retraining cycle |
| Runtime cost | Extra prompt tokens | Retrieval + larger context | Shorter prompts; can be cheaper per call |
Note the rows aren't symmetric — RAG and fine-tuning win on different axes because they target different failures. That is the whole point: this is not a contest for one winner.
When to use which
Work the cheap levers first, then reach for the one that matches the failure you actually have:
- Start with prompt engineering. Clear instructions, examples, and a good system prompt close more gaps than people expect — at near-zero cost and zero infrastructure. Do this before anything else.
- Reach for RAG when the gap is knowledge. The model doesn't know your docs, your pricing, this week's policy, or anything past its cutoff — and you need current, citable answers. RAG is the default for grounding.
- Reach for fine-tuning when the gap is behaviour. The model knows enough but won't hold a format, a tone, or a specialised pattern reliably through prompting alone — and that behaviour is stable enough to be worth baking in.
A quick test: ask whether your problem is "facts the model doesn't have" or "behaviour the model won't exhibit." Facts point to RAG; behaviour points to fine-tuning. If prompting already fixes it, you need neither.
Why most systems combine them
In production the question is rarely either/or. A common pattern is a lightly fine-tuned model — adapted for domain vocabulary, tone, or output structure — sitting behind a RAG layer that supplies current, company-specific facts. Fine-tuning handles how the system responds; retrieval handles what it knows. Each covers the other's weakness: RAG keeps knowledge fresh and citable, fine-tuning keeps behaviour consistent.
The architect's job is to sequence the investment: prove the use case with prompting, ground it with RAG, and only fine-tune once you have evidence that a behavioural gap is real and stable. Layer cost matters here — retrieval adds tokens and latency, fine-tuning adds training and maintenance — so weigh both against the value. See LLM cost optimization for the spend side of that decision.
Frequently asked questions
What is the difference between RAG and fine-tuning?
RAG retrieves relevant documents at query time and puts them in the prompt, so the model answers from external knowledge without any change to the model itself. Fine-tuning continues training the model on examples so a pattern — a tone, a format, a skill — becomes part of its weights. RAG adds knowledge; fine-tuning changes behaviour.
When should you use RAG instead of fine-tuning?
Use RAG when the problem is knowledge the model lacks or knowledge that changes — your documents, current data, anything past the training cutoff — and especially when answers must be traceable to a source. RAG lets you update by editing data instead of retraining, which makes it the right choice for fast-moving or citation-critical use cases.
When should you fine-tune?
Fine-tune when the model knows enough but won't behave consistently through prompting alone: it should always return a specific structure, hold a particular tone, or perform a narrow specialised task reliably. Fine-tuning is for stable behaviour worth baking into the weights — not for facts that change.
Can you use RAG and fine-tuning together?
Yes, and most mature systems do. A typical pattern is a fine-tuned model adapted for tone, domain language, or output format, layered with RAG that supplies current, company-specific facts. Fine-tuning governs how the system responds; retrieval governs what it knows, so the two are complementary rather than competing.
Which is cheaper, RAG or fine-tuning?
RAG is usually cheaper to start and to maintain: there is no training run and updates are just edits to the source data, though it adds retrieval and larger prompts at run time. Fine-tuning has higher upfront cost in data preparation and training, but can shorten prompts and lower per-call cost once in place. Prompt engineering is cheaper than both — try it first.
Should you fine-tune to add knowledge?
Generally no. Fine-tuning is an unreliable way to teach facts — studies of commercial fine-tuning APIs find models struggle to recall knowledge taught this way and can lose facts they already knew. For knowledge, use RAG, which keeps facts current, editable, and citable. Reserve fine-tuning for behaviour.
- Knowledge-vs-behaviour framing and the prompting-first ordering synthesized from AI Architect Academy's curriculum (Track B, retrieval and adaptation) and practitioner guidance from Heavybit and Elastic Search Labs.
- Fine-tuning is an unreliable knowledge-injection mechanism: FineTuneBench (arXiv:2411.05059) — commercial fine-tuning APIs struggle to reliably infuse new knowledge into LLMs.
- Hybrid (fine-tune for behaviour + RAG for facts) as the common production pattern: practitioner write-ups above; cost framing cross-referenced with our own LLM cost optimization notes.
Vendor fine-tuning capabilities and pricing change; treat this as a decision framework, not a guaranteed spec. Corrections: hello@aiarch.dev.
Learn to make these calls by building the system that needs them.
AI Architect Academy teaches retrieval, adaptation, and the cost trade-offs between them as first-class skills — on a platform that is itself a production RAG system built across Anthropic, AWS, and Cloudflare. The build is the curriculum.
Free sample — no signup · every claim cited · cancel anytime
Or get notified when new tracks ship.