Tokenization

1. The Core Problem

Neural networks need numbers as input. But text is a sequence of characters, words, and symbols. How do we bridge this gap?

Three Approaches

Approach	Example	Problem
Character-level	"cat" → [c=3, a=1, t=20]	Sequences too long, hard to learn patterns
Word-level	"cat" → [cat=892]	Vocabulary explodes (plurals, tenses, languages)
Subword ⭐	"cat" → [cat=892]	Best of both!

2. Byte Pair Encoding (BPE)

The dominant tokenization algorithm — used by GPT, Claude, Llama, and most modern LLMs.

Core Idea

Start with individual characters, then iteratively merge the most frequent adjacent pairs.

Step-by-Step Example

Training corpus: "low low low lower lowest"

Initial representation (characters + end-of-word token </w>):

l o w </w>       (freq: 3)
l o w e r </w>   (freq: 1)
l o w e s t </w> (freq: 1)

Iteration 1 — Count pairs, merge most frequent:

Pair	Frequency
(l, o)	5 ← most frequent
(o, w)	5
(w, e)	2
(e, r)	1

Merge (l, o) → "lo":

lo w </w>       (freq: 3)
lo w e r </w>   (freq: 1)
lo w e s t </w> (freq: 1)

Iteration 2 — Merge (lo, w) → "low":

low </w>       (freq: 3)
low e r </w>   (freq: 1)
low e s t </w> (freq: 1)

Iteration 3 — Merge (low, </w>) → "low</w>":

low</w>            (freq: 3) ← standalone "low"
low e r </w>       (freq: 1)
low e s t </w>     (freq: 1)

Repeat thousands of times on billions of words until vocabulary size target (typically 50,000–100,000 tokens) is reached.

3. The BPE Vocabulary

After training, we get a hierarchical vocabulary:

Level	Examples	Frequency
Common words	"the", "is", "and"	Very high
Common subwords	"ing", "un", "pre"	High
Rare subwords	"tion", "able"	Medium
Single characters	"z", "x", "q"	As fallback

Rendering diagram...

4. Tokenizing Real Text

Example Tokenizations (GPT-4 style)

Text	Tokens	Count
`"Hello world"`	`["Hello", " world"]`	2
`"unbelievable"`	`["un", "belie", "vable"]`	3
`"ChatGPT"`	`["Chat", "G", "PT"]`	3
`"Python"`	`["Python"]`	1
`"Anthropic"`	`["Anthrop", "ic"]`	2
`"1234567"`	`["1", "234", "567"]`	3

Note: The space is attached to the following word in BPE. " world" (with leading space) is a single token.

5. Special Tokens

Modern LLMs use special tokens to structure conversations:

Token	Purpose	Example
`<\|system\|>`	System prompt start	Instructions to model
`<\|user\|>`	User turn start	Human message
`<\|assistant\|>`	Assistant turn start	Model response
`<\|end\|>`	Turn end	Marks completion
`<\|pad\|>`	Padding	Batch alignment
`<\|unk\|>`	Unknown token	Unseen characters

6. Positional Token IDs

Every token maps to a unique integer ID:

$\text{Token} \xrightarrow{\text{lookup}} \text{ID} \xrightarrow{\text{embedding}} \mathbf{e} \in \mathbb{R}^d$

Example vocabulary slice:

ID	Token	Notes
0	`<\|endoftext\|>`	Special
1	`the`	Most common
2	`of`
3	`and`
...	...
892	`playing`	Note leading space
1203	`##ing`	Suffix
50256	`<\|pad\|>`	Special

7. Context Window and Token Limits

Every model has a maximum number of tokens it can process at once:

Model	Context Window	Approx. Words
GPT-3.5	4,096 tokens	~3,000 words
GPT-4	128,000 tokens	~96,000 words
Claude 3.5	200,000 tokens	~150,000 words
Gemini 1.5	1,000,000 tokens	~750,000 words

Token Budget in a Conversation

$\underbrace{500}_{\text{system prompt}} + \underbrace{200}_{\text{user msg}} + \underbrace{800}_{\text{response}} + \underbrace{2500}_{\text{history}} = 4000 \text{ tokens used}$

When the limit is hit → earliest messages are dropped → model "forgets" old context.

8. Tokens → Words Conversion

Rule of thumb for English:

$1 \text{ token} \approx 0.75 \text{ words}$

Tokens	Words	Real-world equivalent
100	~75	Short paragraph
1,000	~750	1.5 pages
10,000	~7,500	15 pages
100,000	~75,000	Short novel

9. Why Tokenization Causes Weird Failures

Problem 1 — Letter Counting

"How many R's in STRAWBERRY?"

Tokenized: ["STR", "AW", "BER", "RY"]

The model sees tokens, not individual letters. It must infer character content from token representations → often fails.

$P(\text{correct count} \mid \text{token representations}) \ll 1$

Problem 2 — Arithmetic with Large Numbers

"1,234,567 × 89"
Tokenized: ["1", ",", "234", ",", "567", " ×", " 89"]

Math must be performed across arbitrary token boundaries → unreliable without tools.

Problem 3 — Language Inequality

The tokenizer was trained primarily on English → non-English text is tokenized less efficiently:

Language	Text	Tokens	Efficiency
English	"Hello, how are you?"	6	1×
Korean	"안녕하세요, 어떻게 지내세요?"	14	0.43×
Arabic	"مرحباً، كيف حالك؟"	12	0.5×

Non-English text uses more tokens → costs more, uses context window faster.

10. SentencePiece vs BPE

	BPE	SentencePiece
Space handling	Space attached to next token: `" hello"`	Special underscore: `"▁hello"`
Language support	English-centric	More language-agnostic
Used by	GPT, Claude	LLaMA, T5, mT5

11. Complete Pipeline: Text to Prediction

Rendering diagram...

12. Quick Reference

Concept	Formula / Value
BPE merge rule	Merge most frequent adjacent pair
Token ≈ words	$1 \text{ token} \approx 0.75 \text{ words}$
Vocabulary size	$V \approx 50{,}000 - 100{,}000$
Embedding lookup	$\mathbf{e}_i = \mathbf{E}[i]$
Context window	$128k - 1M$ tokens (modern models)

13. The Full AI Backbone — Complete Summary

Rendering diagram...