What is Top-p Sampling (Nucleus Sampling)?

Definition

Top-p sampling (also called nucleus sampling) is a text generation strategy where the model only considers the smallest set of tokens whose cumulative probability adds up to p. Everything outside that set is ignored.

For example, with top_p=0.9, the model looks at the most likely tokens until their combined probability reaches 90%, then samples only from those. The remaining 10% of the probability mass — spread across thousands of unlikely tokens — is discarded.

This is smarter than picking a fixed number of top tokens (top-k sampling) because the number of “reasonable” next tokens varies by context. After “The capital of France is”, only 1-2 tokens make sense. After “My favorite color is”, many tokens are plausible.

How It Works

Model output probabilities for next token:
  "Paris"     → 70%  ─┐
  "the"       → 12%   │ cumulative: 82%
  "Lyon"      →  5%   │ cumulative: 87%
  "a"         →  4%   │ cumulative: 91% ← exceeds p=0.9, stop here
  "located"   →  2%  ─┘ (excluded)
  "Berlin"    →  1%    (excluded)
  ... thousands more   (excluded)

With top_p=0.9:
  → Sample from {"Paris": 70%, "the": 12%, "Lyon": 5%, "a": 4%}
  → Probabilities are renormalized to sum to 100%

With top_p=0.5:
  → Only "Paris" qualifies (70% > 50%)
  → Always picks "Paris" — behaves like greedy decoding here

The dynamic cutoff is what makes top-p powerful. When the model is confident, the nucleus is small (1-2 tokens). When the model is uncertain, the nucleus expands to include more options.

Why It Matters

Better than top-k — Top-k always picks from a fixed number of tokens (say, 40), regardless of whether 2 or 200 tokens are reasonable. Top-p adapts.
Controls diversity — Lower p = more focused output. Higher p = more diverse output.
Pairs with temperature — Temperature reshapes the probability distribution, then top-p cuts off the tail. Together they give fine-grained control.

Common Settings

Use Case	top_p	Effect
Code generation	0.1 - 0.3	Very focused, predictable
Factual Q&A	0.5 - 0.7	Accurate but natural
General chat	0.8 - 0.95	Good balance of quality and variety
Creative writing	0.95 - 1.0	Maximum diversity

Example

from anthropic import Anthropic

client = Anthropic()

prompt = "Suggest three names for an AI startup."

# Focused — top_p=0.3 (only the most likely tokens)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    top_p=0.3,
    messages=[{"role": "user", "content": prompt}]
)
print("Focused (top_p=0.3):", response.content[0].text)

# Diverse — top_p=0.95 (wide range of tokens)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    top_p=0.95,
    messages=[{"role": "user", "content": prompt}]
)
print("Diverse (top_p=0.95):", response.content[0].text)

Top-p vs Top-k vs Temperature

Parameter	What It Does	Adaptive?
Temperature	Reshapes the entire probability distribution	No — affects all tokens equally
Top-p	Removes tokens below a cumulative probability threshold	Yes — nucleus size varies by context
Top-k	Keeps only the k most likely tokens	No — always picks exactly k tokens

Best practice: adjust temperature first, then use top-p for fine-tuning. Avoid changing both aggressively at the same time.

Key Takeaways

Top-p sampling keeps only the most likely tokens that sum to probability p, discarding the rest
It adapts dynamically — fewer tokens when the model is confident, more when it’s uncertain
Common default is top_p=0.9 or top_p=0.95 for general use
Use lower top-p (0.1-0.5) for deterministic tasks, higher (0.9-1.0) for creative tasks
Adjust temperature first for broad control, then fine-tune with top-p if needed

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Top-p Sampling