What is Tokenization? — AI Glossary

Definition

Tokenization is the process of breaking text into smaller units called tokens that an LLM can process. Tokens aren’t always full words — they can be word fragments, individual characters, or even punctuation marks.

Every interaction with an AI model starts with tokenization. When you type “I love programming” into ChatGPT or Claude, the model doesn’t see words — it sees something like ["I", " love", " program", "ming"]. The model processes, understands, and generates text entirely through these tokens.

Tokenization matters because it directly affects what you pay (API pricing is per-token), how much text fits in a context window, and even how well the model handles different languages. English typically gets about 1 token per word, while languages like Japanese or Korean may need 2-3 tokens per character.

How It Works

Most modern LLMs use BPE (Byte-Pair Encoding) or a variant of it. The idea is to build a vocabulary of common character sequences:

Step 1: Start with individual characters
  "programming" → ["p", "r", "o", "g", "r", "a", "m", "m", "i", "n", "g"]

Step 2: Merge the most frequent pairs repeatedly
  "p" + "r" → "pr"
  "o" + "g" → "og"
  "pr" + "og" → "prog"
  "r" + "a" → "ra"
  "m" + "m" → "mm"
  "i" + "ng" → "ing"

Step 3: Final tokens
  "programming" → ["prog", "ra", "mm", "ing"]

Common words like “the” or “and” become single tokens. Rare words get split into pieces. This gives BPE an efficient balance — a vocabulary of 50K-100K tokens can represent any text.

Model	Tokenizer	Vocab Size
GPT-4o	o200k_base	~200K
Claude	Custom BPE	~100K
Llama 3	SentencePiece BPE	128K

Why It Matters

Cost — API pricing is per-token. Efficient prompts = lower bills. A 1,000-word English document is roughly 1,300 tokens.
Context limits — Claude supports 200K tokens (~150K words). GPT-4o supports 128K. If your tokenizer is inefficient, you burn context on encoding overhead.
Multilingual fairness — English-centric tokenizers make non-English text 2-3x more expensive per word. This is an active area of improvement.
Code handling — Modern tokenizers treat code specially. Common patterns like def , return , or import often become single tokens.

Example

# Count tokens using tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Retrieval-Augmented Generation improves LLM accuracy."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# → ['Ret', 'riev', 'al', '-', 'Aug', 'mented', ' Generation', ' improves', ' LLM', ' accuracy', '.']

# Anthropic's token counting
from anthropic import Anthropic

client = Anthropic()

# The API returns token usage in every response
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    messages=[{"role": "user", "content": "Explain tokenization in one sentence."}]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

Key Takeaways

Tokenization converts text into numerical tokens that LLMs can process — it’s the first and last step of every AI interaction
Most models use BPE, which splits text into common subword pieces (50K-200K vocabulary)
Token count determines both cost and context window usage — roughly 1 token per English word
Non-English languages and code may tokenize differently — check token counts before assuming costs
You can inspect tokenization with tools like tiktoken (OpenAI) or the Anthropic API’s usage stats

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Tokenization

Definition

How It Works

Why It Matters

Example

Key Takeaways

Related Terms

Tools That Use This