AI Architect Academy

The retrieval layer

Agentic RAG: making retrieval a decision the agent controls

Short answer

Agentic RAG is retrieval-augmented generation where the agent decides when to retrieve, what to query, and whether the results are good enough — sometimes searching again — instead of running a fixed retrieve-then-generate pipeline. Retrieval stops being a pre-step bolted onto the prompt and becomes a tool the orchestrator can call, evaluate, and re-call until the context is sufficient to answer.

This page goes deep on the retrieval layer specifically. For how retrieval sits among the other five components of an agent system, see agentic AI architecture; this page is the part it points to.

The short version

Classic RAG is a straight line: take the user's question, embed it, search a vector index, stuff the top chunks into the prompt, generate. It runs once, the same way, every time. That is enough for a lot of question-answering, and you should not reach past it without a reason.

Agentic RAG turns that line into a loop with a decision at the top of it. The agent treats retrieval as one of its tools. It chooses whether to search at all, writes its own query (often rewriting the user's), inspects what came back, and decides whether to answer, search again with a better query, or try a different source. Retrieval becomes controlled rather than scripted.

Agentic RAG vs classic RAG

The difference is not the vector store or the embeddings — both architectures use those. The difference is who is in charge of retrieval: a fixed pipeline, or the model.

DimensionClassic RAGAgentic RAG
When to retrieveAlways, once, before generationThe agent decides — maybe zero, maybe several times
What to queryThe user's question, embedded as-isA query the agent writes or rewrites for the index
Result sufficiencyAssumed — top-k goes straight to the promptJudged — the agent grades relevance and can retry
SourcesUsually one indexCan route across indexes or tools (search, SQL, API)
Control flowLinear pipeline, runs identically each timeA loop with a stop condition
Cost / latencyLow and predictableHigher and variable — extra model turns per search
Best forDirect lookups over one corpusMulti-step questions, mixed sources, recall that needs checking

Read the table as a cost curve, not a verdict. Agentic RAG buys recall and robustness with extra model turns. If classic RAG answers your questions, the loop is overhead you do not need.

How a classic RAG pipeline works (the baseline)

You cannot reason about the agentic version without the baseline it is built on. A RAG pipeline has two phases.

Indexing (offline): split source documents into chunks, embed each chunk into a vector with an embedding model, and store those vectors — plus the original text and metadata — in a vector index. This runs ahead of time and is refreshed as the source changes.

Retrieval and generation (per request): embed the incoming question, find the nearest chunks by vector similarity (often re-ranked, sometimes combined with keyword search as hybrid retrieval), and pass those chunks to the model as grounding context alongside the question. The model answers from the supplied text rather than from its parametric memory.

Everything load-bearing about RAG lives in this baseline: chunking strategy, embedding quality, the index, re-ranking, and — the part teams skip — carrying each chunk's source through to the answer so it can be cited. Agentic RAG does not replace any of this. It wraps a decision-maker around the retrieval step.

What makes RAG agentic

Three capabilities move a pipeline from classic to agentic. None of them is exotic; together they change who is in control.

  • Retrieval on demand. The agent decides whether a question even needs a search. A greeting or a follow-up it can already answer from context should not trigger a vector lookup. The model gates retrieval instead of the pipeline forcing it.
  • Query formulation. The user's wording is often a poor search query. An agent rewrites it — expanding acronyms, splitting a compound question into sub-questions, or phrasing it the way the corpus is written. This is the well-established query rewriting step, now driven by the model rather than a fixed template.
  • Result grading and iteration. The agent inspects what came back and judges whether it is relevant and sufficient. If not, it reformulates and searches again, or switches source. This document-grading-and-retry shape is what published patterns like Self-RAG (the generator critiques its own retrieval) and Corrective RAG (an evaluator scores the retrieved set and triggers a fallback) formalise.

Frameworks expose these as first-class building blocks. LangGraph documents an "agentic RAG" pattern where the retriever is a tool the agent calls, grades, and re-queries; LlamaIndex offers router and sub-question query engines that let the model choose an index or decompose a question. Treat these as confirmation that the shape is standard, not as a requirement — the capabilities matter more than any one library.

Architecture: retrieval as a tool the agent calls

The cleanest way to think about agentic RAG is to stop thinking of retrieval as a phase and start thinking of it as a tool in the agent's action layer. The orchestrator runs its normal bounded loop; one of the tools it can invoke is search(query).

A single iteration looks like this: the model receives the goal, decides a search is warranted, emits a search tool call with a query it wrote, the orchestrator runs the retrieval and returns the chunks, and the model reads them and either answers — citing sources — or calls search again with a refined query. The same turn-and-tool-call budget that bounds any agent loop bounds the number of retrievals, which is what stops a hard question from triggering an unbounded search spiral.

This is why agentic RAG belongs to the same architecture as everything else an agent does, not to a separate "RAG system." Retrieval is in the tools/action layer; the decision to use it is in the orchestrator; the bound on it is in the operational plane. For how those layers fit together — and where retrieval sits among them — see agentic AI architecture, and for the loop that bounds the tool calls, agentic AI design patterns. When the search tool lives behind a standard interface, the Model Context Protocol (MCP) is a common way to expose it so the same retrieval tool is reusable across agents.

Choosing the knowledge and vector layer

Agentic RAG still rests on a vector index, and the index choice is mostly orthogonal to whether retrieval is agentic — both architectures need it. Keep this decision boring and reversible. The realistic options in 2026:

  • pgvector — a Postgres extension. The "use the database you already have" answer; strong default when your data and ops already live in Postgres.
  • Pinecone — a fully managed, serverless vector database; you trade control for not running it.
  • Weaviate / Qdrant — vector databases with first-class hybrid search; Qdrant is open-source-first with a managed tier, Weaviate leans into hybrid retrieval.
  • Cloudflare Vectorize — an edge-native vector store; on this stack it pairs with Cloudflare AI Search (the managed RAG product formerly called AutoRAG) and Workers AI embeddings.
  • Managed RAG services — AWS Bedrock Knowledge Bases (Retrieve / RetrieveAndGenerate APIs) and Cloudflare AI Search handle chunk-embed-store-retrieve for you, and a Bedrock agent can query a knowledge base during orchestration. They shorten the path to a baseline pipeline; you still own the agentic decision layer on top.

Pick on data gravity, hybrid-search needs, and operational appetite — not on a benchmark leaderboard. The vector layer is the easiest piece to swap later; the retrieval logic and provenance handling are the parts worth getting right.

Provenance is non-negotiable. Carry each chunk's source identifier from the index through retrieval into the answer, so every claim can be traced back. An answer you cannot attribute is one you cannot ship — and the same discipline is what makes evaluation possible.

Evaluating agentic RAG

Agentic RAG has more to evaluate than classic RAG because both the retrieval and the agent's decisions can fail. Split it:

  • Retrieval quality — did the right chunks come back? Measure with recall and precision over a labelled set of question-to-relevant-chunk pairs. This is the classic-RAG metric, and it still applies per search.
  • Groundedness / faithfulness — is the answer actually supported by the retrieved chunks, or did the model wander off them? This is where citations earn their keep: an answer that names its sources is one you can check.
  • Agentic decisions — the new surface. Did the agent retrieve when it should have, skip when it should have, write a good query, and stop at the right time? These are decisions, so you evaluate them with a held-out set of scenarios and an LLM-as-judge, not a single accuracy number.

Because the same question can take a different retrieval path on different runs, you pin quality with an eval harness rather than manual spot-checks — the same discipline used for any non-deterministic agent. The full treatment is in how to evaluate an LLM agent; the short rule is that you cannot tune what you do not measure, and agentic RAG gives you two layers to measure.

Frequently asked questions

What is agentic RAG?

Agentic RAG is retrieval-augmented generation where an agent controls retrieval rather than a fixed pipeline. The agent decides when to search, writes or rewrites the query, judges whether the results are sufficient, and can search again before answering. Retrieval becomes a tool the agent calls inside its loop, instead of a single retrieve-then-generate step run the same way every time.

How is agentic RAG different from RAG?

Classic RAG runs one fixed sequence: embed the question, search the index, put the top chunks in the prompt, generate. Agentic RAG wraps a decision around that step — the model chooses whether to retrieve, what to query, and whether the results are good enough, sometimes iterating. Same vector index and embeddings; the difference is that the agent, not the pipeline, is in charge of retrieval.

What is a RAG pipeline?

A RAG pipeline has two phases. Offline indexing splits documents into chunks, embeds each into a vector, and stores them in a vector index. Per request, it embeds the question, retrieves the nearest chunks by similarity (often re-ranked or combined with keyword search), and passes them to the model as grounding context so it answers from supplied text rather than parametric memory. Agentic RAG builds a decision layer on top of this baseline.

When should you use agentic RAG instead of classic RAG?

Use classic RAG when questions are direct lookups over a single corpus — it is cheaper, faster, and predictable. Reach for agentic RAG when questions are multi-step, span multiple sources, or when retrieval quality is shaky enough that the agent needs to grade results and retry. The agentic loop buys recall and robustness with extra model turns; if the baseline already answers your questions, that cost is overhead you do not need.

What vector database should you use for RAG?

Choose on data gravity and operations, not benchmarks. pgvector is the strong default if you already run Postgres. Pinecone is fully managed and serverless. Weaviate and Qdrant offer first-class hybrid search. Cloudflare Vectorize is edge-native and pairs with Cloudflare AI Search. Managed RAG services like AWS Bedrock Knowledge Bases handle indexing and retrieval for you. The vector layer is the easiest piece to swap later, so keep the choice reversible.

How do you evaluate agentic RAG?

Evaluate three things. Retrieval quality with recall and precision over labelled question-to-chunk pairs. Groundedness — whether the answer is actually supported by the retrieved chunks, which citations make checkable. And the agent's decisions — whether it retrieved, queried, and stopped well — using a held-out set of scenarios with an LLM-as-judge. Because runs are non-deterministic, you pin quality with an eval harness rather than manual spot-checks.

Sources & provenance
  • RAG pipeline and agentic-retrieval framing synthesized from AI Architect Academy's curriculum (Track B, agentic systems and retrieval) and the platform's own build (docs/DESIGN.md, src/lib/rag.ts).
  • Agentic-RAG capabilities cross-checked against framework docs: LangGraph's agentic RAG tutorial (retriever-as-tool, document grading, query rewrite) and LlamaIndex's router / sub-question query engines.
  • Named patterns referenced conceptually: Self-RAG (Asai et al.) and Corrective RAG / CRAG (arXiv 2401.15884) — read the papers for the exact mechanisms; this page summarises the shape, not benchmarks.
  • Managed retrieval and vector primitives reflect current AWS (Bedrock Knowledge Bases — Retrieve / RetrieveAndGenerate) and Cloudflare (AI Search, Vectorize, Workers AI embeddings) products — verify exact API shapes against each vendor's live docs before building.

Vendor products and API shapes change; treat the mapping as a design template, not a guaranteed signature. Corrections: hello@aiarch.dev.

Learn to build grounded agents by building one.

AI Architect Academy teaches the retrieval layer — indexing, agentic RAG, provenance, and evaluation — as a first-class skill, on a platform that is itself a production agentic system built across Anthropic, AWS, and Cloudflare. The build is the curriculum.

Free sample — no signup · every claim cited · cancel anytime

Or get notified when new tracks ship.