AI Architect Academy

The grounding layer

Retrieval-augmented generation (RAG): what it is and how it works

Short answer

Retrieval-augmented generation (RAG) is a pattern that grounds a language model's answer in documents fetched at query time: you retrieve relevant text from an external store and pass it to the model as context, so it answers from your data instead of only its training weights. It is the standard fix for the three things a bare model cannot do — cite a source, know a private fact, or stay current.

This is the general RAG explainer. For retrieval the agent decides to run, see agentic RAG; for the store underneath it, vector database comparison; for the build-or-buy call against tuning the model itself, RAG vs fine-tuning.

What is retrieval-augmented generation?

RAG is the practice of giving a model the facts it needs at the moment you ask, rather than hoping it memorised them in training. The term comes from a 2020 paper by Lewis et al., which paired a retriever over an external corpus with a generator and showed the combination beat a model relying on its parameters alone on knowledge-intensive tasks. The shape it introduced — retrieve, then generate over what you retrieved — is what every RAG system since has elaborated.

The mental model is simple: a language model is a closed book that is fluent but fixed. Its knowledge stops at its training cut-off, it cannot see your private data, and it cannot tell you where an answer came from. RAG turns it into an open-book exam. You hand the model the relevant pages alongside the question, and it answers from those pages — which means the answer can be fresh, private, and attributable.

Why use RAG: grounding, freshness, and private data

RAG earns its place by solving four specific problems, and it is worth naming them because they are the only reasons to take on the extra machinery:

  • Grounding (against hallucination). A model asked something outside its knowledge will often produce a confident, wrong answer. Supplying the actual source text constrains it to answer from evidence, which is the single biggest reduction in fabricated facts you can make without changing the model.
  • Freshness. Training data has a cut-off; the world does not. Retrieval reads from a store you keep current, so the model can answer about yesterday's release or this morning's ticket without retraining.
  • Private and proprietary data. Your internal docs, contracts, and tickets were never in the model's training set. RAG is how a general model answers questions about data only your organisation holds — without sending that data into a training run.
  • Provenance. Because each retrieved chunk has a known source, the answer can cite it. An answer you can trace is one you can ship in a regulated or high-stakes setting; an unattributable one is not.

The honest framing is that RAG buys grounding at the cost of a retrieval pipeline you now have to build, keep fresh, and evaluate. When a model's built-in knowledge already covers the question, that pipeline is overhead. Reach for RAG when the answer must come from data the model does not, and should not, hold in its weights.

How RAG works: the pipeline

A RAG system has two halves that run on different clocks. Indexing happens offline, ahead of time, and is refreshed as the source data changes. Retrieval and generation happen per request, in the time it takes to answer. The stages below are the whole pipeline, end to end.

StageWhat happensWhen it runs
IngestCollect source documents and normalise them to clean text (strip markup, parse PDFs, pull metadata).Offline
ChunkSplit each document into passages small enough to retrieve precisely but large enough to stay coherent.Offline
EmbedTurn each chunk into a vector with an embedding model, so semantic similarity becomes distance in vector space.Offline
IndexStore the vectors, original text, and metadata in a vector index built for nearest-neighbour search.Offline
RetrieveEmbed the incoming question, find the nearest chunks (often re-ranked, sometimes with keyword search as hybrid retrieval).Per request
AugmentAssemble the retrieved chunks and the question into a prompt, carrying each chunk's source through for citation.Per request
GenerateThe model writes the answer from the supplied context, grounded in the chunks rather than its parametric memory.Per request

Two stages quietly decide whether the whole thing works. Chunking sets the granularity of what can be retrieved — chunks too large drown the signal, too small lose the context. And carrying provenance through the augment step is the part teams skip, then regret, because without it no answer can be cited and nothing can be evaluated.

RAG architecture

Architecturally, RAG is three components and the embedding model that ties them together. None of them is exotic; the engineering is in keeping them honest.

  • The embedding model turns both documents and queries into vectors in the same space. The same model must embed both sides, or similarity is meaningless. Its quality sets the ceiling on retrieval.
  • The vector store / index holds the embedded chunks and serves nearest-neighbour search at query time. The realistic options — pgvector, Pinecone, Weaviate, Qdrant, Cloudflare Vectorize, and managed services like AWS Bedrock Knowledge Bases or Cloudflare AI Search — are compared in the vector database comparison. Pick on data gravity and operations, not a leaderboard.
  • The retriever / generator orchestrates a request: embed the question, query the index, optionally re-rank, build the augmented prompt, and call the model.

That is the baseline, single-pass architecture: retrieve once, generate once. It is enough for a great deal of question-answering, and you should not reach past it without a reason. The reason, when it comes, is usually that one fixed retrieval is not enough — which is where the agent takes over, covered in agentic RAG. For where retrieval sits among the other components of an agent system, see agentic AI architecture.

Limits and failure modes

RAG moves the failure surface rather than removing it. Knowing where it breaks is the difference between a demo and a system you can operate:

  • Retrieval misses. If the right chunk is not retrieved, the model cannot use it — and a fluent answer built on the wrong chunks looks just as confident as a correct one. Most RAG quality problems are retrieval problems, not generation problems.
  • Bad chunking. Split a table mid-row or a clause from its condition and the retrieved passage is technically relevant but unusable. Chunk strategy is the most under-rated knob in the pipeline.
  • Stale or noisy sources. RAG is only as fresh and as clean as its index. A store that is not re-indexed, or full of contradictory documents, grounds the model in the wrong facts — confidently.
  • Ungrounded generation. Even with good chunks, a model can drift off them and assert things the sources do not support. This is why groundedness, not just retrieval recall, has to be measured.
  • Context and cost limits. You cannot stuff unlimited chunks into the prompt; more context means more tokens, more latency, and eventually degraded attention. Retrieval has to be precise, not just generous.

Because the same question can take a different retrieval path and produce a different answer, you pin RAG quality with an eval harness — retrieval recall and precision, plus groundedness — rather than manual spot-checks. The full treatment is in how to evaluate an LLM agent.

RAG, agents, and fine-tuning

RAG is one of three ways to give a model knowledge or behaviour it lacks, and they are not rivals so much as different tools:

  • RAG injects knowledge at query time. Best when the facts change, are private, or must be cited. The model stays general; the data stays external and current.
  • Fine-tuning changes the model's weights to bake in a skill, format, or domain style. Best for how the model behaves, not what it currently knows — and it does not solve freshness or provenance. The decision between them gets its own page: RAG vs fine-tuning. In practice many systems do both.
  • Agentic RAG is the next step up from baseline RAG: instead of retrieving once on a fixed schedule, the agent decides when to retrieve, rewrites the query, grades the results, and searches again if they fall short. Retrieval becomes a tool the agent calls, not a pre-step. That is agentic RAG.

The progression is worth holding in your head: a bare model, then RAG to ground it, then an agent to control the retrieval. Each step adds capability and cost. Start at the simplest one that answers your questions, and move up only when it stops.

Frequently asked questions

What is retrieval-augmented generation?

Retrieval-augmented generation (RAG) is a pattern that fetches relevant text from an external store at query time and passes it to a language model as context, so the model answers from that supplied data rather than only its training weights. It grounds answers in real sources, which lets them be current, private, and citable. The term was introduced by Lewis et al. in 2020, pairing a retriever over a corpus with a generator.

Why use RAG?

RAG solves four problems a bare model cannot: it reduces hallucination by grounding answers in supplied text, it keeps answers fresh by reading from a store you update instead of retraining, it lets a general model answer about private data that was never in its training set, and it gives provenance because each retrieved chunk has a known source the answer can cite. If the model's built-in knowledge already covers the question, RAG is overhead you do not need.

How does RAG work?

RAG runs in two phases. Offline, it ingests documents, splits them into chunks, embeds each chunk into a vector, and stores them in a vector index. Per request, it embeds the question, retrieves the nearest chunks by similarity (often re-ranked or combined with keyword search), augments the prompt with those chunks and their sources, and the model generates an answer grounded in the supplied text. Indexing happens ahead of time; retrieval and generation happen when you ask.

What is the difference between RAG and fine-tuning?

RAG injects knowledge at query time by retrieving external documents, so it is the tool for facts that change, are private, or must be cited; the model stays general and the data stays current. Fine-tuning changes the model's weights to bake in a skill, format, or domain style, so it is the tool for how the model behaves, not what it currently knows. Fine-tuning does not solve freshness or provenance. Many production systems use both. The full decision is in the RAG vs fine-tuning guide.

What are the limitations of RAG?

RAG moves the failure surface rather than removing it. The common failure modes are retrieval misses (the right chunk is never fetched), bad chunking (passages split so they are relevant but unusable), stale or noisy sources (the index grounds the model in wrong facts), ungrounded generation (the model drifts off the supplied chunks), and context and cost limits (you cannot pass unlimited chunks). Most RAG quality problems are retrieval problems, which is why you measure retrieval recall and groundedness with an eval harness.

What is agentic RAG?

Agentic RAG is RAG where an agent controls retrieval rather than a fixed pipeline. Instead of retrieving once before generating, the agent decides whether to search, writes or rewrites the query, judges whether the results are sufficient, and can search again or switch source before answering. Retrieval becomes a tool the agent calls inside its loop. It buys recall and robustness on multi-step questions at the cost of extra model turns; baseline single-pass RAG is the cheaper default.

Sources & provenance
  • The term and the retrieve-then-generate shape originate in Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS 2020 (arXiv:2005.11401) — read the paper for the original retriever/generator mechanism; this page summarises the pattern, not its benchmarks.
  • Pipeline, architecture, and failure-mode framing synthesized from AI Architect Academy's curriculum (Track B, retrieval and agentic systems) and the platform's own build (docs/DESIGN.md, src/lib/rag.ts).
  • Vector-store and managed-retrieval options reflect current products — pgvector, Pinecone, Weaviate, Qdrant, Cloudflare Vectorize / AI Search, AWS Bedrock Knowledge Bases — compared in the vector database comparison; verify exact API shapes against each vendor's live docs before building.

Vendor products and API shapes change; treat any mapping as a design template, not a guaranteed signature. Corrections: hello@aiarch.dev.

Learn to build grounded AI systems by building one.

AI Architect Academy teaches the retrieval layer — chunking, embeddings, RAG, provenance, and evaluation — as a first-class skill, on a platform that is itself a production system built across Anthropic, AWS, and Cloudflare. The build is the curriculum.

Free sample — no signup · every claim cited · cancel anytime

Or get notified when new tracks ship.