AI & Cloud Infrastructure

Agentic RAG Patterns That Beat Classic Retrieval

By Technspire TeamMarch 3, 202657 views

Classic retrieval-augmented generation solves the most common question shape, "here is my question, fetch relevant docs, answer using them." It hits a ceiling on questions that require multiple lookups, query refinement, or reasoning about whether the retrieval worked. Agentic RAG. Retrieval as a tool inside an agent loop. Routinely outperforms at exactly those questions.

What Classic RAG Does and Where It Ceilings

Classic RAG is linear: user query → embedding → top-k similarity retrieval → context window → generation. It is simple, fast, and sufficient for question shapes that match a single lookup. It fails predictably on:

  • Multi-hop questions. "Which customers bought X and also churned in Q3?". Needs two retrievals with an intermediate reasoning step.
  • Queries where the user's wording diverges from the corpus wording. Vector similarity is only as good as the embedding's training; domain-specific vocabulary often confuses it.
  • Ambiguous questions. "What did we ship last sprint?" has no single good retrieval.
  • Questions that need date or entity filtering. Pure similarity retrieval does not respect structured constraints.

Pattern 1. Retrieval as a Tool

Instead of pre-fetching, let the model decide when and what to retrieve. Expose search as a tool the agent can call with its own synthesised query. The model learns to reshape the question into a search query, examine results, and retrieve again if needed.

const tools = [{
 name: 'search_docs',
 description: 'Search the internal documentation. Returns the top 5 chunks with source.',
 input_schema: {
 type: 'object',
 properties: {
 query: { type: 'string' },
 since: { type: 'string', format: 'date', description: 'Optional date filter' },
 tags: { type: 'array', items: { type: 'string' } },
 },
 required: ['query'],
 },
}];

// The agent now chooses when to search and with what query.
// It may search, see the results are weak, and search again with a refined query.

Pattern 2. Query Decomposition

For multi-hop questions, have the agent decompose the question into sub-questions, retrieve for each, and compose the final answer. The decomposition can be explicit (first call a decompose tool) or emergent (the agent reasons about sub-questions inline and makes multiple retrieval calls).

Pattern 3. HyDE (Hypothetical Document Embeddings)

When the user's query language differs sharply from the corpus language, generate a hypothetical answer first and embed that for retrieval. The hypothetical is usually wrong in detail, but shares vocabulary with real answers. Which makes embedding-based search much sharper.

Pattern 4. In-Loop Reranking

Agentic retrieval can run a reranker inside the loop. Fetch twenty candidates, let a cross-encoder score them, feed the top five to the model. Classic RAG can do this too, but the agent-in-the-loop version lets the model request a fresh rerank with different criteria if the initial top-five do not answer the question.

Pattern 5. Self-Correction

After generating an answer, let the agent check the answer against the retrieved context (a short LLM call: "is this answer fully supported by these sources?"). If not, it can retrieve more or flag uncertainty. This turns "confident wrong answers" into "honest partial answers". A significant UX improvement for many B2B search interfaces.

// Self-correction check — called by the agent after drafting an answer
async function supportsAnswer(answer: string, sources: Chunk[]): Promise<'yes' | 'partial' | 'no'> {
 const res = await llm.run({
 system: 'Reply with only: yes, partial, or no.',
 messages: [{
 role: 'user',
 content: `Does the following answer stay within the provided sources?\n\nAnswer:\n${answer}\n\nSources:\n${sources.map(c => c.text).join('\n---\n')}`,
 }],
 });
 return res.text.trim().toLowerCase() as any;
}

When Classic RAG Still Wins

  • Latency-sensitive product features. Agent loops add latency; classic RAG returns in a single round-trip.
  • Cost-constrained workloads. Agents make multiple LLM calls per query; classic RAG makes one.
  • Simple FAQ-style questions. The upside of the agentic approach is small when the question is linear.

The Production Pattern

In practice, the strongest production systems route queries: simple ones go through classic RAG, complex ones enter the agent loop. A tiny classifier model at the front decides the route. The result is low latency for the 80% of easy queries and high accuracy for the 20% that need real search intelligence. Without paying the agent cost for every query.