RAG
Learn what RAG (Retrieval-Augmented Generation) means in AI and machine learning, with examples and related concepts.
Definition
RAG stands for Retrieval-Augmented Generation — a technique that improves LLM responses by first retrieving relevant documents from an external knowledge base, then feeding them to the model as context.
Think of it this way: instead of relying solely on what the model memorized during training, RAG lets the model “look things up” before answering. This is exactly how Perplexity AI works — it searches the web, retrieves relevant pages, and uses them to generate a cited answer.
RAG solves two critical LLM problems: hallucination (making things up) and knowledge cutoff (not knowing about recent events).
How It Works
A RAG pipeline has two phases:
User Query: "What are Claude's latest pricing changes?"
Phase 1 — RETRIEVE
├─ Convert query to vector embedding
├─ Search vector database for similar documents
└─ Return top-k relevant chunks
Phase 2 — GENERATE
├─ Combine: system prompt + retrieved chunks + user query
├─ Send to LLM
└─ Model generates answer grounded in retrieved context
The retrieval step typically uses embeddings and a vector database to find semantically similar content — not just keyword matches.
Why It Matters
- Accuracy — The model answers based on actual documents, not memorized (possibly outdated) knowledge
- Verifiability — You can trace every answer back to its source document
- Freshness — New documents can be added to the knowledge base without retraining the model
- Cost — Much cheaper than fine-tuning for most knowledge-grounding use cases
- Privacy — Company data stays in the retrieval layer, never enters model training
Example
# Minimal RAG pipeline using ChromaDB + Claude
import chromadb
from anthropic import Anthropic
# 1. Set up vector store and add documents
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
collection.add(
documents=[
"Claude Opus 4.6 is Anthropic's most capable model, released in 2026.",
"Claude Sonnet 4.6 costs $3/M input tokens and $15/M output tokens.",
"Claude supports up to 200K tokens of context in a single request.",
],
ids=["doc1", "doc2", "doc3"]
)
# 2. Retrieve relevant documents
query = "How much does Claude Sonnet cost?"
results = collection.query(query_texts=[query], n_results=2)
# 3. Generate answer with retrieved context
client = Anthropic()
context = "\n".join(results["documents"][0])
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{
"role": "user",
"content": f"Based on this context:\n{context}\n\nAnswer: {query}"
}]
)
print(response.content[0].text)
# → "Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens."
RAG vs Fine-tuning
| RAG | Fine-tuning | |
|---|---|---|
| Cost | Low (no training) | High (GPU hours) |
| Speed to deploy | Hours | Days-weeks |
| Knowledge updates | Add new docs instantly | Retrain required |
| Best for | Factual Q&A, search | Style/behavior changes |
| Hallucination risk | Low (grounded) | Medium |
Key Takeaways
- RAG = retrieve first, then generate — the model answers based on actual documents
- It’s the standard approach for building Q&A systems over private data
- Perplexity AI is essentially a sophisticated RAG system over the entire web
- Quality depends heavily on the retrieval step — bad retrieval = bad answers
- For most use cases, RAG is cheaper and more effective than fine-tuning
Part of the DeepRaft Glossary — AI and ML terms explained for developers.