Tokenization
Learn what Tokenization means in AI and machine learning, with examples and related concepts.
Definition
Tokenization is the process of breaking text into smaller units called tokens that an LLM can process. Tokens aren’t always full words — they can be word fragments, individual characters, or even punctuation marks.
Every interaction with an AI model starts with tokenization. When you type “I love programming” into ChatGPT or Claude, the model doesn’t see words — it sees something like ["I", " love", " program", "ming"]. The model processes, understands, and generates text entirely through these tokens.
Tokenization matters because it directly affects what you pay (API pricing is per-token), how much text fits in a context window, and even how well the model handles different languages. English typically gets about 1 token per word, while languages like Japanese or Korean may need 2-3 tokens per character.
How It Works
Most modern LLMs use BPE (Byte-Pair Encoding) or a variant of it. The idea is to build a vocabulary of common character sequences:
Step 1: Start with individual characters
"programming" → ["p", "r", "o", "g", "r", "a", "m", "m", "i", "n", "g"]
Step 2: Merge the most frequent pairs repeatedly
"p" + "r" → "pr"
"o" + "g" → "og"
"pr" + "og" → "prog"
"r" + "a" → "ra"
"m" + "m" → "mm"
"i" + "ng" → "ing"
Step 3: Final tokens
"programming" → ["prog", "ra", "mm", "ing"]
Common words like “the” or “and” become single tokens. Rare words get split into pieces. This gives BPE an efficient balance — a vocabulary of 50K-100K tokens can represent any text.
| Model | Tokenizer | Vocab Size |
|---|---|---|
| GPT-4o | o200k_base | ~200K |
| Claude | Custom BPE | ~100K |
| Llama 3 | SentencePiece BPE | 128K |
Why It Matters
- Cost — API pricing is per-token. Efficient prompts = lower bills. A 1,000-word English document is roughly 1,300 tokens.
- Context limits — Claude supports 200K tokens (~150K words). GPT-4o supports 128K. If your tokenizer is inefficient, you burn context on encoding overhead.
- Multilingual fairness — English-centric tokenizers make non-English text 2-3x more expensive per word. This is an active area of improvement.
- Code handling — Modern tokenizers treat code specially. Common patterns like
def,return, orimportoften become single tokens.
Example
# Count tokens using tiktoken (OpenAI's tokenizer)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Retrieval-Augmented Generation improves LLM accuracy."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# → ['Ret', 'riev', 'al', '-', 'Aug', 'mented', ' Generation', ' improves', ' LLM', ' accuracy', '.']
# Anthropic's token counting
from anthropic import Anthropic
client = Anthropic()
# The API returns token usage in every response
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
messages=[{"role": "user", "content": "Explain tokenization in one sentence."}]
)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
Key Takeaways
- Tokenization converts text into numerical tokens that LLMs can process — it’s the first and last step of every AI interaction
- Most models use BPE, which splits text into common subword pieces (50K-200K vocabulary)
- Token count determines both cost and context window usage — roughly 1 token per English word
- Non-English languages and code may tokenize differently — check token counts before assuming costs
- You can inspect tokenization with tools like
tiktoken(OpenAI) or the Anthropic API’s usage stats
Part of the DeepRaft Glossary — AI and ML terms explained for developers.