All posts
ai2026-03-20·6 min read

Tokenization

How machines actually read text — and why it's weirder than you think.

1. The Core Problem

Neural networks need numbers as input. But text is a sequence of characters, words, and symbols. How do we bridge this gap?

Three Approaches

ApproachExampleProblem
Character-level"cat" → [c=3, a=1, t=20]Sequences too long, hard to learn patterns
Word-level"cat" → [cat=892]Vocabulary explodes (plurals, tenses, languages)
Subword"cat" → [cat=892]Best of both!

2. Byte Pair Encoding (BPE)

The dominant tokenization algorithm — used by GPT, Claude, Llama, and most modern LLMs.

Core Idea

Start with individual characters, then iteratively merge the most frequent adjacent pairs.

Step-by-Step Example

Training corpus: "low low low lower lowest"

Initial representation (characters + end-of-word token </w>):

l o w </w>       (freq: 3)
l o w e r </w>   (freq: 1)
l o w e s t </w> (freq: 1)

Iteration 1 — Count pairs, merge most frequent:

PairFrequency
(l, o)5 ← most frequent
(o, w)5
(w, e)2
(e, r)1

Merge (l, o) → "lo":

lo w </w>       (freq: 3)
lo w e r </w>   (freq: 1)
lo w e s t </w> (freq: 1)

Iteration 2 — Merge (lo, w) → "low":

low </w>       (freq: 3)
low e r </w>   (freq: 1)
low e s t </w> (freq: 1)

Iteration 3 — Merge (low, </w>) → "low</w>":

low</w>            (freq: 3) ← standalone "low"
low e r </w>       (freq: 1)
low e s t </w>     (freq: 1)

Repeat thousands of times on billions of words until vocabulary size target (typically 50,000–100,000 tokens) is reached.


3. The BPE Vocabulary

After training, we get a hierarchical vocabulary:

LevelExamplesFrequency
Common words"the", "is", "and"Very high
Common subwords"ing", "un", "pre"High
Rare subwords"tion", "able"Medium
Single characters"z", "x", "q"As fallback
Rendering diagram...

4. Tokenizing Real Text

Example Tokenizations (GPT-4 style)

TextTokensCount
"Hello world"["Hello", " world"]2
"unbelievable"["un", "belie", "vable"]3
"ChatGPT"["Chat", "G", "PT"]3
"Python"["Python"]1
"Anthropic"["Anthrop", "ic"]2
"1234567"["1", "234", "567"]3

Note: The space is attached to the following word in BPE. " world" (with leading space) is a single token.


5. Special Tokens

Modern LLMs use special tokens to structure conversations:

TokenPurposeExample
<|system|>System prompt startInstructions to model
<|user|>User turn startHuman message
<|assistant|>Assistant turn startModel response
<|end|>Turn endMarks completion
<|pad|>PaddingBatch alignment
<|unk|>Unknown tokenUnseen characters

6. Positional Token IDs

Every token maps to a unique integer ID:

TokenlookupIDembeddingeRd\text{Token} \xrightarrow{\text{lookup}} \text{ID} \xrightarrow{\text{embedding}} \mathbf{e} \in \mathbb{R}^d

Example vocabulary slice:

IDTokenNotes
0<|endoftext|>Special
1theMost common
2of
3and
......
892 playingNote leading space
1203##ingSuffix
50256<|pad|>Special

7. Context Window and Token Limits

Every model has a maximum number of tokens it can process at once:

ModelContext WindowApprox. Words
GPT-3.54,096 tokens~3,000 words
GPT-4128,000 tokens~96,000 words
Claude 3.5200,000 tokens~150,000 words
Gemini 1.51,000,000 tokens~750,000 words

Token Budget in a Conversation

500system prompt+200user msg+800response+2500history=4000 tokens used\underbrace{500}_{\text{system prompt}} + \underbrace{200}_{\text{user msg}} + \underbrace{800}_{\text{response}} + \underbrace{2500}_{\text{history}} = 4000 \text{ tokens used}

When the limit is hit → earliest messages are dropped → model "forgets" old context.


8. Tokens → Words Conversion

Rule of thumb for English:

1 token0.75 words1 \text{ token} \approx 0.75 \text{ words}

TokensWordsReal-world equivalent
100~75Short paragraph
1,000~7501.5 pages
10,000~7,50015 pages
100,000~75,000Short novel

9. Why Tokenization Causes Weird Failures

Problem 1 — Letter Counting

"How many R's in STRAWBERRY?"

Tokenized: ["STR", "AW", "BER", "RY"]

The model sees tokens, not individual letters. It must infer character content from token representations → often fails.

P(correct counttoken representations)1P(\text{correct count} \mid \text{token representations}) \ll 1

Problem 2 — Arithmetic with Large Numbers

"1,234,567 × 89"
Tokenized: ["1", ",", "234", ",", "567", " ×", " 89"]

Math must be performed across arbitrary token boundaries → unreliable without tools.

Problem 3 — Language Inequality

The tokenizer was trained primarily on English → non-English text is tokenized less efficiently:

LanguageTextTokensEfficiency
English"Hello, how are you?"6
Korean"안녕하세요, 어떻게 지내세요?"140.43×
Arabic"مرحباً، كيف حالك؟"120.5×

Non-English text uses more tokens → costs more, uses context window faster.


10. SentencePiece vs BPE

BPESentencePiece
Space handlingSpace attached to next token: " hello"Special underscore: "▁hello"
Language supportEnglish-centricMore language-agnostic
Used byGPT, ClaudeLLaMA, T5, mT5

11. Complete Pipeline: Text to Prediction

Rendering diagram...

12. Quick Reference

ConceptFormula / Value
BPE merge ruleMerge most frequent adjacent pair
Token ≈ words1 token0.75 words1 \text{ token} \approx 0.75 \text{ words}
Vocabulary sizeV50,000100,000V \approx 50{,}000 - 100{,}000
Embedding lookupei=E[i]\mathbf{e}_i = \mathbf{E}[i]
Context window128k1M128k - 1M tokens (modern models)

13. The Full AI Backbone — Complete Summary

Rendering diagram...
Filed underai

Related posts