llm

Top-p Sampling

Learn what Top-p Sampling (Nucleus Sampling) means in AI and machine learning, with examples and related concepts.

Definition

Top-p sampling (also called nucleus sampling) is a text generation strategy where the model only considers the smallest set of tokens whose cumulative probability adds up to p. Everything outside that set is ignored.

For example, with top_p=0.9, the model looks at the most likely tokens until their combined probability reaches 90%, then samples only from those. The remaining 10% of the probability mass — spread across thousands of unlikely tokens — is discarded.

This is smarter than picking a fixed number of top tokens (top-k sampling) because the number of “reasonable” next tokens varies by context. After “The capital of France is”, only 1-2 tokens make sense. After “My favorite color is”, many tokens are plausible.

How It Works

Model output probabilities for next token:
  "Paris"     → 70%  ─┐
  "the"       → 12%   │ cumulative: 82%
  "Lyon"      →  5%   │ cumulative: 87%
  "a"         →  4%   │ cumulative: 91% ← exceeds p=0.9, stop here
  "located"   →  2%  ─┘ (excluded)
  "Berlin"    →  1%    (excluded)
  ... thousands more   (excluded)

With top_p=0.9:
  → Sample from {"Paris": 70%, "the": 12%, "Lyon": 5%, "a": 4%}
  → Probabilities are renormalized to sum to 100%

With top_p=0.5:
  → Only "Paris" qualifies (70% > 50%)
  → Always picks "Paris" — behaves like greedy decoding here

The dynamic cutoff is what makes top-p powerful. When the model is confident, the nucleus is small (1-2 tokens). When the model is uncertain, the nucleus expands to include more options.

Why It Matters

Common Settings

Use Casetop_pEffect
Code generation0.1 - 0.3Very focused, predictable
Factual Q&A0.5 - 0.7Accurate but natural
General chat0.8 - 0.95Good balance of quality and variety
Creative writing0.95 - 1.0Maximum diversity

Example

from anthropic import Anthropic

client = Anthropic()

prompt = "Suggest three names for an AI startup."

# Focused — top_p=0.3 (only the most likely tokens)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    top_p=0.3,
    messages=[{"role": "user", "content": prompt}]
)
print("Focused (top_p=0.3):", response.content[0].text)

# Diverse — top_p=0.95 (wide range of tokens)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    top_p=0.95,
    messages=[{"role": "user", "content": prompt}]
)
print("Diverse (top_p=0.95):", response.content[0].text)

Top-p vs Top-k vs Temperature

ParameterWhat It DoesAdaptive?
TemperatureReshapes the entire probability distributionNo — affects all tokens equally
Top-pRemoves tokens below a cumulative probability thresholdYes — nucleus size varies by context
Top-kKeeps only the k most likely tokensNo — always picks exactly k tokens

Best practice: adjust temperature first, then use top-p for fine-tuning. Avoid changing both aggressively at the same time.

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.