llm

RAG

Learn what RAG (Retrieval-Augmented Generation) means in AI and machine learning, with examples and related concepts.

Definition

RAG stands for Retrieval-Augmented Generation — a technique that improves LLM responses by first retrieving relevant documents from an external knowledge base, then feeding them to the model as context.

Think of it this way: instead of relying solely on what the model memorized during training, RAG lets the model “look things up” before answering. This is exactly how Perplexity AI works — it searches the web, retrieves relevant pages, and uses them to generate a cited answer.

RAG solves two critical LLM problems: hallucination (making things up) and knowledge cutoff (not knowing about recent events).

How It Works

A RAG pipeline has two phases:

User Query: "What are Claude's latest pricing changes?"

Phase 1 — RETRIEVE
  ├─ Convert query to vector embedding
  ├─ Search vector database for similar documents
  └─ Return top-k relevant chunks

Phase 2 — GENERATE
  ├─ Combine: system prompt + retrieved chunks + user query
  ├─ Send to LLM
  └─ Model generates answer grounded in retrieved context

The retrieval step typically uses embeddings and a vector database to find semantically similar content — not just keyword matches.

Why It Matters

Example

# Minimal RAG pipeline using ChromaDB + Claude
import chromadb
from anthropic import Anthropic

# 1. Set up vector store and add documents
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

collection.add(
    documents=[
        "Claude Opus 4.6 is Anthropic's most capable model, released in 2026.",
        "Claude Sonnet 4.6 costs $3/M input tokens and $15/M output tokens.",
        "Claude supports up to 200K tokens of context in a single request.",
    ],
    ids=["doc1", "doc2", "doc3"]
)

# 2. Retrieve relevant documents
query = "How much does Claude Sonnet cost?"
results = collection.query(query_texts=[query], n_results=2)

# 3. Generate answer with retrieved context
client = Anthropic()
context = "\n".join(results["documents"][0])

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[{
        "role": "user",
        "content": f"Based on this context:\n{context}\n\nAnswer: {query}"
    }]
)

print(response.content[0].text)
# → "Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens."

RAG vs Fine-tuning

RAGFine-tuning
CostLow (no training)High (GPU hours)
Speed to deployHoursDays-weeks
Knowledge updatesAdd new docs instantlyRetrain required
Best forFactual Q&A, searchStyle/behavior changes
Hallucination riskLow (grounded)Medium

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.