Sharad Raj

AI Research & Engineering

Sharad Raj profile photo

Agentic RAG Workflows: Building Tool-Using Agents

March 15, 2026 3 min read Sharad Raj

Retrieval-augmented generation (RAG) has become a cornerstone of modern AI systems, enabling language models to access external knowledge bases and provide more accurate, contextual responses. However, static RAG systems have limitations when dealing with complex queries that require multiple steps or tool interactions.

The Problem with Static RAG

Traditional RAG systems follow a simple pattern: retrieve relevant documents, augment the prompt, and generate a response. This approach works well for straightforward queries but struggles with:

  • Multi-step reasoning tasks
  • Queries requiring tool interactions
  • Dynamic document selection based on intermediate results
  • Complex information synthesis across multiple sources

A typical static RAG flow is easy to sketch as a linear pipeline:

Diagram: Retrieve, Augment, Generate
Static RAG: retrieve relevant chunks, add them to the prompt, then generate an answer.

Shorter positional form (same order: src, alt, caption):

Retrieve → augment → generate
Optional caption via three quoted args.

You can also use plain markdown — paths are relative to the site static/ folder (file static/images/photo.jpg/images/photo.jpg):

Same pipeline as a simple markdown image

Introducing Agentic RAG

A simple retrieval score is often written inline as \( s(q, d) = \langle \mathbf{e}_q, \mathbf{e}_d \rangle \), while a block layout can show a softmax over passages:

\[ P(d \mid q) = \frac{\exp\bigl(\alpha\, s(q, d)\bigr)}{\sum_{d' \in \mathcal{D}} \exp\bigl(\alpha\, s(q, d')\bigr)} \]

You can use $$ ... $$ as well:

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V $$

Agentic RAG systems address these limitations by combining the power of language models with autonomous decision-making capabilities. Instead of a fixed retrieval-generation pipeline, agentic systems can:

  • Decide when and what to retrieve
  • Use multiple tools in sequence
  • Adapt their strategy based on intermediate results
  • Handle complex, multi-turn reasoning

Implementation Considerations

Building production-grade agentic RAG systems requires careful attention to:

  • Latency: Minimize round-trips and optimize retrieval operations
  • Cost: Manage API calls and token usage efficiently
  • Reliability: Implement fallback strategies and error handling
  • Observability: Track agent decisions and reasoning paths

Example: retrieval-then-answer loop

A minimal pattern is: embed the query, fetch chunks, pack them into the prompt, then call the model. In practice you add tools, retries, and tracing around this core.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from typing import Sequence

def retrieve(query: str, top_k: int = 5) -> Sequence[str]:
    """Return text chunks from your vector store / search index."""
    ...

def build_prompt(query: str, chunks: Sequence[str]) -> str:
    context = "\n\n".join(chunks)
    return (
        "Use only the context below to answer.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}"
    )

def answer(query: str, llm) -> str:
    chunks = retrieve(query)
    prompt = build_prompt(query, chunks)
    return llm.complete(prompt)

You can gate retrieval with a cheap classifier, cache embeddings for hot queries, or expose retrieve as a tool so an agent decides when to call it:

1
2
3
4
tools = [
  {"name": "search_docs", "description": "Semantic search over the knowledge base."},
  {"name": "calculator", "description": "Evaluate arithmetic expressions."},
]

For local iteration, a one-liner smoke test is often enough:

1
hugo server --bind 127.0.0.1 --port 1313

The future of AI systems lies in combining the reasoning capabilities of language models with the flexibility of agentic architectures and the accuracy of retrieval-augmented generation.

Tags