What is RAG (Retrieval-Augmented Generation)?

Definition

RAG stands for Retrieval-Augmented Generation — a technique that improves LLM responses by first retrieving relevant documents from an external knowledge base, then feeding them to the model as context.

Think of it this way: instead of relying solely on what the model memorized during training, RAG lets the model “look things up” before answering. This is exactly how Perplexity AI works — it searches the web, retrieves relevant pages, and uses them to generate a cited answer.

RAG solves two critical LLM problems: hallucination (making things up) and knowledge cutoff (not knowing about recent events).

How It Works

A RAG pipeline has two phases:

User Query: "What are Claude's latest pricing changes?"

Phase 1 — RETRIEVE
  ├─ Convert query to vector embedding
  ├─ Search vector database for similar documents
  └─ Return top-k relevant chunks

Phase 2 — GENERATE
  ├─ Combine: system prompt + retrieved chunks + user query
  ├─ Send to LLM
  └─ Model generates answer grounded in retrieved context

The retrieval step typically uses embeddings and a vector database to find semantically similar content — not just keyword matches.

Why It Matters

Accuracy — The model answers based on actual documents, not memorized (possibly outdated) knowledge
Verifiability — You can trace every answer back to its source document
Freshness — New documents can be added to the knowledge base without retraining the model
Cost — Much cheaper than fine-tuning for most knowledge-grounding use cases
Privacy — Company data stays in the retrieval layer, never enters model training

Example

# Minimal RAG pipeline using ChromaDB + Claude
import chromadb
from anthropic import Anthropic

# 1. Set up vector store and add documents
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

collection.add(
    documents=[
        "Claude Opus 4.6 is Anthropic's most capable model, released in 2026.",
        "Claude Sonnet 4.6 costs $3/M input tokens and $15/M output tokens.",
        "Claude supports up to 200K tokens of context in a single request.",
    ],
    ids=["doc1", "doc2", "doc3"]
)

# 2. Retrieve relevant documents
query = "How much does Claude Sonnet cost?"
results = collection.query(query_texts=[query], n_results=2)

# 3. Generate answer with retrieved context
client = Anthropic()
context = "\n".join(results["documents"][0])

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[{
        "role": "user",
        "content": f"Based on this context:\n{context}\n\nAnswer: {query}"
    }]
)

print(response.content[0].text)
# → "Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens."

RAG vs Fine-tuning

	RAG	Fine-tuning
Cost	Low (no training)	High (GPU hours)
Speed to deploy	Hours	Days-weeks
Knowledge updates	Add new docs instantly	Retrain required
Best for	Factual Q&A, search	Style/behavior changes
Hallucination risk	Low (grounded)	Medium

Key Takeaways

RAG = retrieve first, then generate — the model answers based on actual documents
It’s the standard approach for building Q&A systems over private data
Perplexity AI is essentially a sophisticated RAG system over the entire web
Quality depends heavily on the retrieval step — bad retrieval = bad answers
For most use cases, RAG is cheaper and more effective than fine-tuning

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

RAG