llm

Context Window

Learn what Context Window means in AI and machine learning, with examples and related concepts.

Definition

Context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request — including both your input and the model’s output.

Think of it as the model’s working memory. Everything the model needs to “know” for a conversation — the system prompt, chat history, uploaded documents, and its own response — must fit inside this window. Once you exceed it, the oldest content gets dropped or the request fails.

Context windows have grown dramatically: GPT-3 had just 4K tokens (about 3,000 words). Today, Claude offers 200K tokens and Gemini 1.5 Pro supports up to 2M tokens. This expansion is one of the most impactful improvements in modern AI.

How It Works

┌─────────────────────────────────────────────────┐
│              Context Window (200K tokens)         │
│                                                   │
│  ┌──────────────┐                                │
│  │ System Prompt │  ~500 tokens                  │
│  └──────────────┘                                │
│  ┌──────────────────────────────┐                │
│  │ Conversation History          │  ~5,000 tokens │
│  │ (previous messages)           │                │
│  └──────────────────────────────┘                │
│  ┌──────────────────────────────────────┐        │
│  │ Retrieved Documents / Files           │ ~50,000│
│  │ (RAG context, uploaded PDFs, code)    │ tokens │
│  └──────────────────────────────────────┘        │
│  ┌──────────────┐                                │
│  │ User's Query  │  ~200 tokens                  │
│  └──────────────┘                                │
│  ┌──────────────────────────────┐                │
│  │ Model's Response              │  ~4,000 tokens │
│  │ (generated output)            │                │
│  └──────────────────────────────┘                │
│                                                   │
│  Remaining capacity: ~140,300 tokens              │
└─────────────────────────────────────────────────┘

The model processes all tokens in the context window simultaneously using attention mechanisms. This means it can reference any part of the context when generating a response — but longer contexts use more compute and increase latency.

Why It Matters

Current Context Window Sizes (as of 2026)

ModelContext WindowApprox. Words
Claude (Anthropic)200K tokens~150,000
Gemini 1.5 Pro (Google)2M tokens~1,500,000
GPT-4o (OpenAI)128K tokens~96,000
Llama 3 (Meta)128K tokens~96,000

Example

from anthropic import Anthropic

client = Anthropic()

# Read an entire file and analyze it within the context window
with open("large_codebase.py", "r") as f:
    code = f.read()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": f"Review this code for security vulnerabilities:\n\n{code}"
    }]
)

# Check how much of the context window we used
print(f"Input tokens used: {response.usage.input_tokens:,}")
print(f"Remaining capacity: {200_000 - response.usage.input_tokens:,} tokens")
# Managing context in a multi-turn conversation
conversation = []

def chat(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=conversation
    )

    assistant_msg = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_msg})

    # Warn if approaching context limit
    total_tokens = response.usage.input_tokens + response.usage.output_tokens
    if total_tokens > 150_000:
        print(f"Warning: {total_tokens:,} tokens used — consider summarizing history")

    return assistant_msg

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.