ml-fundamentals

Tokenization

Learn what Tokenization means in AI and machine learning, with examples and related concepts.

Definition

Tokenization is the process of breaking text into smaller units called tokens that an LLM can process. Tokens aren’t always full words — they can be word fragments, individual characters, or even punctuation marks.

Every interaction with an AI model starts with tokenization. When you type “I love programming” into ChatGPT or Claude, the model doesn’t see words — it sees something like ["I", " love", " program", "ming"]. The model processes, understands, and generates text entirely through these tokens.

Tokenization matters because it directly affects what you pay (API pricing is per-token), how much text fits in a context window, and even how well the model handles different languages. English typically gets about 1 token per word, while languages like Japanese or Korean may need 2-3 tokens per character.

How It Works

Most modern LLMs use BPE (Byte-Pair Encoding) or a variant of it. The idea is to build a vocabulary of common character sequences:

Step 1: Start with individual characters
  "programming" → ["p", "r", "o", "g", "r", "a", "m", "m", "i", "n", "g"]

Step 2: Merge the most frequent pairs repeatedly
  "p" + "r" → "pr"
  "o" + "g" → "og"
  "pr" + "og" → "prog"
  "r" + "a" → "ra"
  "m" + "m" → "mm"
  "i" + "ng" → "ing"

Step 3: Final tokens
  "programming" → ["prog", "ra", "mm", "ing"]

Common words like “the” or “and” become single tokens. Rare words get split into pieces. This gives BPE an efficient balance — a vocabulary of 50K-100K tokens can represent any text.

ModelTokenizerVocab Size
GPT-4oo200k_base~200K
ClaudeCustom BPE~100K
Llama 3SentencePiece BPE128K

Why It Matters

Example

# Count tokens using tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Retrieval-Augmented Generation improves LLM accuracy."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# → ['Ret', 'riev', 'al', '-', 'Aug', 'mented', ' Generation', ' improves', ' LLM', ' accuracy', '.']
# Anthropic's token counting
from anthropic import Anthropic

client = Anthropic()

# The API returns token usage in every response
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    messages=[{"role": "user", "content": "Explain tokenization in one sentence."}]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.