What is Context Window?

Definition

Context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request — including both your input and the model’s output.

Think of it as the model’s working memory. Everything the model needs to “know” for a conversation — the system prompt, chat history, uploaded documents, and its own response — must fit inside this window. Once you exceed it, the oldest content gets dropped or the request fails.

Context windows have grown dramatically: GPT-3 had just 4K tokens (about 3,000 words). Today, Claude offers 200K tokens and Gemini 1.5 Pro supports up to 2M tokens. This expansion is one of the most impactful improvements in modern AI.

How It Works

┌─────────────────────────────────────────────────┐
│              Context Window (200K tokens)         │
│                                                   │
│  ┌──────────────┐                                │
│  │ System Prompt │  ~500 tokens                  │
│  └──────────────┘                                │
│  ┌──────────────────────────────┐                │
│  │ Conversation History          │  ~5,000 tokens │
│  │ (previous messages)           │                │
│  └──────────────────────────────┘                │
│  ┌──────────────────────────────────────┐        │
│  │ Retrieved Documents / Files           │ ~50,000│
│  │ (RAG context, uploaded PDFs, code)    │ tokens │
│  └──────────────────────────────────────┘        │
│  ┌──────────────┐                                │
│  │ User's Query  │  ~200 tokens                  │
│  └──────────────┘                                │
│  ┌──────────────────────────────┐                │
│  │ Model's Response              │  ~4,000 tokens │
│  │ (generated output)            │                │
│  └──────────────────────────────┘                │
│                                                   │
│  Remaining capacity: ~140,300 tokens              │
└─────────────────────────────────────────────────┘

The model processes all tokens in the context window simultaneously using attention mechanisms. This means it can reference any part of the context when generating a response — but longer contexts use more compute and increase latency.

Why It Matters

Document analysis — A 200K context window can hold a 500-page book. You can ask questions about an entire codebase or legal contract in one request.
Long conversations — Larger windows mean the model remembers more of your chat history without losing context.
RAG trade-off — With large enough context windows, you can sometimes skip building a RAG pipeline and just pass documents directly. But retrieval is still more cost-effective for very large knowledge bases.
Cost implications — More input tokens = higher cost. Claude charges per input token, so stuffing the full context window on every request gets expensive.

Current Context Window Sizes (as of 2026)

Model	Context Window	Approx. Words
Claude (Anthropic)	200K tokens	~150,000
Gemini 1.5 Pro (Google)	2M tokens	~1,500,000
GPT-4o (OpenAI)	128K tokens	~96,000
Llama 3 (Meta)	128K tokens	~96,000

Example

from anthropic import Anthropic

client = Anthropic()

# Read an entire file and analyze it within the context window
with open("large_codebase.py", "r") as f:
    code = f.read()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": f"Review this code for security vulnerabilities:\n\n{code}"
    }]
)

# Check how much of the context window we used
print(f"Input tokens used: {response.usage.input_tokens:,}")
print(f"Remaining capacity: {200_000 - response.usage.input_tokens:,} tokens")

# Managing context in a multi-turn conversation
conversation = []

def chat(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=conversation
    )

    assistant_msg = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_msg})

    # Warn if approaching context limit
    total_tokens = response.usage.input_tokens + response.usage.output_tokens
    if total_tokens > 150_000:
        print(f"Warning: {total_tokens:,} tokens used — consider summarizing history")

    return assistant_msg

Key Takeaways

The context window is the model’s working memory — everything (input + output) must fit inside it
Modern models range from 128K to 2M tokens, enabling analysis of entire books or codebases
Larger context windows enable new use cases but cost more per request
For very large knowledge bases, RAG is still more practical than stuffing everything into context
Context window size alone doesn’t guarantee quality — models may struggle with information buried in the middle of very long contexts (the “lost in the middle” problem)

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Context Window