1. The Core Problem
Neural networks need numbers as input. But text is a sequence of characters, words, and symbols. How do we bridge this gap?
Three Approaches
| Approach | Example | Problem |
|---|---|---|
| Character-level | "cat" → [c=3, a=1, t=20] | Sequences too long, hard to learn patterns |
| Word-level | "cat" → [cat=892] | Vocabulary explodes (plurals, tenses, languages) |
| Subword ⭐ | "cat" → [cat=892] | Best of both! |
2. Byte Pair Encoding (BPE)
The dominant tokenization algorithm — used by GPT, Claude, Llama, and most modern LLMs.
Core Idea
Start with individual characters, then iteratively merge the most frequent adjacent pairs.
Step-by-Step Example
Training corpus: "low low low lower lowest"
Initial representation (characters + end-of-word token </w>):
l o w </w> (freq: 3)
l o w e r </w> (freq: 1)
l o w e s t </w> (freq: 1)
Iteration 1 — Count pairs, merge most frequent:
| Pair | Frequency |
|---|---|
| (l, o) | 5 ← most frequent |
| (o, w) | 5 |
| (w, e) | 2 |
| (e, r) | 1 |
Merge (l, o) → "lo":
lo w </w> (freq: 3)
lo w e r </w> (freq: 1)
lo w e s t </w> (freq: 1)
Iteration 2 — Merge (lo, w) → "low":
low </w> (freq: 3)
low e r </w> (freq: 1)
low e s t </w> (freq: 1)
Iteration 3 — Merge (low, </w>) → "low</w>":
low</w> (freq: 3) ← standalone "low"
low e r </w> (freq: 1)
low e s t </w> (freq: 1)
Repeat thousands of times on billions of words until vocabulary size target (typically 50,000–100,000 tokens) is reached.
3. The BPE Vocabulary
After training, we get a hierarchical vocabulary:
| Level | Examples | Frequency |
|---|---|---|
| Common words | "the", "is", "and" | Very high |
| Common subwords | "ing", "un", "pre" | High |
| Rare subwords | "tion", "able" | Medium |
| Single characters | "z", "x", "q" | As fallback |
4. Tokenizing Real Text
Example Tokenizations (GPT-4 style)
| Text | Tokens | Count |
|---|---|---|
"Hello world" | ["Hello", " world"] | 2 |
"unbelievable" | ["un", "belie", "vable"] | 3 |
"ChatGPT" | ["Chat", "G", "PT"] | 3 |
"Python" | ["Python"] | 1 |
"Anthropic" | ["Anthrop", "ic"] | 2 |
"1234567" | ["1", "234", "567"] | 3 |
Note: The space is attached to the following word in BPE.
" world"(with leading space) is a single token.
5. Special Tokens
Modern LLMs use special tokens to structure conversations:
| Token | Purpose | Example |
|---|---|---|
<|system|> | System prompt start | Instructions to model |
<|user|> | User turn start | Human message |
<|assistant|> | Assistant turn start | Model response |
<|end|> | Turn end | Marks completion |
<|pad|> | Padding | Batch alignment |
<|unk|> | Unknown token | Unseen characters |
6. Positional Token IDs
Every token maps to a unique integer ID:
Example vocabulary slice:
| ID | Token | Notes |
|---|---|---|
| 0 | <|endoftext|> | Special |
| 1 | the | Most common |
| 2 | of | |
| 3 | and | |
| ... | ... | |
| 892 | playing | Note leading space |
| 1203 | ##ing | Suffix |
| 50256 | <|pad|> | Special |
7. Context Window and Token Limits
Every model has a maximum number of tokens it can process at once:
| Model | Context Window | Approx. Words |
|---|---|---|
| GPT-3.5 | 4,096 tokens | ~3,000 words |
| GPT-4 | 128,000 tokens | ~96,000 words |
| Claude 3.5 | 200,000 tokens | ~150,000 words |
| Gemini 1.5 | 1,000,000 tokens | ~750,000 words |
Token Budget in a Conversation
When the limit is hit → earliest messages are dropped → model "forgets" old context.
8. Tokens → Words Conversion
Rule of thumb for English:
| Tokens | Words | Real-world equivalent |
|---|---|---|
| 100 | ~75 | Short paragraph |
| 1,000 | ~750 | 1.5 pages |
| 10,000 | ~7,500 | 15 pages |
| 100,000 | ~75,000 | Short novel |
9. Why Tokenization Causes Weird Failures
Problem 1 — Letter Counting
"How many R's in STRAWBERRY?"
Tokenized: ["STR", "AW", "BER", "RY"]
The model sees tokens, not individual letters. It must infer character content from token representations → often fails.
Problem 2 — Arithmetic with Large Numbers
"1,234,567 × 89"
Tokenized: ["1", ",", "234", ",", "567", " ×", " 89"]
Math must be performed across arbitrary token boundaries → unreliable without tools.
Problem 3 — Language Inequality
The tokenizer was trained primarily on English → non-English text is tokenized less efficiently:
| Language | Text | Tokens | Efficiency |
|---|---|---|---|
| English | "Hello, how are you?" | 6 | 1× |
| Korean | "안녕하세요, 어떻게 지내세요?" | 14 | 0.43× |
| Arabic | "مرحباً، كيف حالك؟" | 12 | 0.5× |
Non-English text uses more tokens → costs more, uses context window faster.
10. SentencePiece vs BPE
| BPE | SentencePiece | |
|---|---|---|
| Space handling | Space attached to next token: " hello" | Special underscore: "▁hello" |
| Language support | English-centric | More language-agnostic |
| Used by | GPT, Claude | LLaMA, T5, mT5 |
11. Complete Pipeline: Text to Prediction
12. Quick Reference
| Concept | Formula / Value |
|---|---|
| BPE merge rule | Merge most frequent adjacent pair |
| Token ≈ words | |
| Vocabulary size | |
| Embedding lookup | |
| Context window | tokens (modern models) |