Top-p Sampling
Learn what Top-p Sampling (Nucleus Sampling) means in AI and machine learning, with examples and related concepts.
Definition
Top-p sampling (also called nucleus sampling) is a text generation strategy where the model only considers the smallest set of tokens whose cumulative probability adds up to p. Everything outside that set is ignored.
For example, with top_p=0.9, the model looks at the most likely tokens until their combined probability reaches 90%, then samples only from those. The remaining 10% of the probability mass — spread across thousands of unlikely tokens — is discarded.
This is smarter than picking a fixed number of top tokens (top-k sampling) because the number of “reasonable” next tokens varies by context. After “The capital of France is”, only 1-2 tokens make sense. After “My favorite color is”, many tokens are plausible.
How It Works
Model output probabilities for next token:
"Paris" → 70% ─┐
"the" → 12% │ cumulative: 82%
"Lyon" → 5% │ cumulative: 87%
"a" → 4% │ cumulative: 91% ← exceeds p=0.9, stop here
"located" → 2% ─┘ (excluded)
"Berlin" → 1% (excluded)
... thousands more (excluded)
With top_p=0.9:
→ Sample from {"Paris": 70%, "the": 12%, "Lyon": 5%, "a": 4%}
→ Probabilities are renormalized to sum to 100%
With top_p=0.5:
→ Only "Paris" qualifies (70% > 50%)
→ Always picks "Paris" — behaves like greedy decoding here
The dynamic cutoff is what makes top-p powerful. When the model is confident, the nucleus is small (1-2 tokens). When the model is uncertain, the nucleus expands to include more options.
Why It Matters
- Better than top-k — Top-k always picks from a fixed number of tokens (say, 40), regardless of whether 2 or 200 tokens are reasonable. Top-p adapts.
- Controls diversity — Lower p = more focused output. Higher p = more diverse output.
- Pairs with temperature — Temperature reshapes the probability distribution, then top-p cuts off the tail. Together they give fine-grained control.
Common Settings
| Use Case | top_p | Effect |
|---|---|---|
| Code generation | 0.1 - 0.3 | Very focused, predictable |
| Factual Q&A | 0.5 - 0.7 | Accurate but natural |
| General chat | 0.8 - 0.95 | Good balance of quality and variety |
| Creative writing | 0.95 - 1.0 | Maximum diversity |
Example
from anthropic import Anthropic
client = Anthropic()
prompt = "Suggest three names for an AI startup."
# Focused — top_p=0.3 (only the most likely tokens)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
top_p=0.3,
messages=[{"role": "user", "content": prompt}]
)
print("Focused (top_p=0.3):", response.content[0].text)
# Diverse — top_p=0.95 (wide range of tokens)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
top_p=0.95,
messages=[{"role": "user", "content": prompt}]
)
print("Diverse (top_p=0.95):", response.content[0].text)
Top-p vs Top-k vs Temperature
| Parameter | What It Does | Adaptive? |
|---|---|---|
| Temperature | Reshapes the entire probability distribution | No — affects all tokens equally |
| Top-p | Removes tokens below a cumulative probability threshold | Yes — nucleus size varies by context |
| Top-k | Keeps only the k most likely tokens | No — always picks exactly k tokens |
Best practice: adjust temperature first, then use top-p for fine-tuning. Avoid changing both aggressively at the same time.
Key Takeaways
- Top-p sampling keeps only the most likely tokens that sum to probability p, discarding the rest
- It adapts dynamically — fewer tokens when the model is confident, more when it’s uncertain
- Common default is
top_p=0.9ortop_p=0.95for general use - Use lower top-p (0.1-0.5) for deterministic tasks, higher (0.9-1.0) for creative tasks
- Adjust temperature first for broad control, then fine-tune with top-p if needed
Part of the DeepRaft Glossary — AI and ML terms explained for developers.