1. The Problem with Sequential Models
Before Transformers, sequences were processed with RNNs — one token at a time:
ht=f(ht−1,xt)
Critical Failures
Long-range dependency problem:
"The cat that sat on the mat was hungry"
By the time the RNN reaches "hungry", memory of "cat" has faded through 6+ steps of hidden state compression.
Cannot parallelize:
- Must process x1→x2→x3→… sequentially
- GPUs excel at parallel computation → RNNs waste GPU power
- Training is painfully slow
2. The Attention Intuition
Real-Life Analogy 👀
Reading: "The trophy didn't fit in the suitcase because it was too big"
To understand "it", your brain simultaneously looks at ALL words:
| Word | Attention Weight |
|---|
| trophy | 85% ← "it" = trophy! |
| suitcase | 10% |
| fit | 3% |
| because | 2% |
Your brain doesn't read left-to-right — it attends to all words at once.
YouTube Search Analogy 🔍
| Attention Component | Analogy | Role |
|---|
| Query (Q) | What you type in search | What this word is looking for |
| Key (K) | Video titles | What each word advertises |
| Value (V) | Actual video content | What each word provides |
| Attention score | Search relevance | Q⋅K similarity |
3. Scaled Dot-Product Attention
The Formula
Attention(Q,K,V)=softmax(dkQKT)V
This is the single most important formula in modern AI.
4. Step-by-Step Example
Sentence: "Cat eats fish" (3 tokens, embedding dim = 4)
Token embeddings:
xcat=[1,0,1,0]
xeats=[0,1,0,1]
xfish=[1,1,0,0]
Step 1: Compute Q, K, V
For each token, project with learned weight matrices WQ,WK,WV:
Q=XWQ,K=XWK,V=XWV
Simplified for "cat":
Qcat=[0.8,0.2],Kcat=[0.9,0.1],Vcat=[1.0,0.0]
Step 2: Attention Scores (Dot Products)
score(cat→cat)=Qcat⋅Kcat=(0.8)(0.9)+(0.2)(0.1)=0.74
score(cat→eats)=Qcat⋅Keats=0.38
score(cat→fish)=Qcat⋅Kfish=0.50
Step 3: Scale
Divide by dk=2=1.41 to prevent gradient issues with large dot products:
scaled scores=[1.410.74,1.410.38,1.410.50]=[0.52,0.27,0.35]
Step 4: Softmax → Attention Weights
α=softmax([0.52,0.27,0.35])=[0.38,0.30,0.32]
Step 5: Weighted Sum of Values
outputcat=0.38⋅Vcat+0.30⋅Veats+0.32⋅Vfish
=0.38[1.0,0.0]+0.30[0.0,1.0]+0.32[0.5,0.5]=[0.54,0.46]
"Cat" now contains information from all tokens, weighted by relevance!
5. Attention Flow Diagram
Rendering diagram...
6. Multi-Head Attention
One attention head = one perspective. Language has many relationship types simultaneously:
| Head | What it learns |
|---|
| Head 1 | Syntactic structure (subject-verb) |
| Head 2 | Semantic meaning (word sense) |
| Head 3 | Coreference (pronoun → noun) |
| Head 4 | Long-range dependencies |
Formula
MultiHead(Q,K,V)=Concat(head1,…,headh)WO
headi=Attention(QWiQ,KWiK,VWiV)
Where WO∈Rhdv×dmodel projects concatenated heads back to model dimension.
GPT-3 uses 96 attention heads per layer with dmodel=12288!
7. The Transformer Block
Each Transformer block contains:
Rendering diagram...
Residual (Skip) Connections
xl+1=LayerNorm(xl+Sublayer(xl))
The original input is added back after each sublayer. This creates gradient "highways" and prevents vanishing gradients in deep networks.
Feed-Forward Network
FFN(x)=GELU(xW1+b1)W2+b2
Typical expansion: dff=4×dmodel
8. Encoder vs Decoder vs Both
| Architecture | Attention Type | Task | Examples |
|---|
| Encoder only | Bidirectional (sees all) | Understanding | BERT, RoBERTa |
| Decoder only | Causal (left-to-right) | Generation | GPT, Claude |
| Enc-Dec | Both | Seq-to-seq | T5, BART |
Causal (Masked) Attention
In decoder models, future tokens are masked so each position can only attend to past tokens:
maskij={0−∞i≥ji<j
Attention=softmax(dkQKT+M)V
9. GPT Text Generation
Rendering diagram...
10. Why Transformers Beat Everything
| Property | RNN | Transformer |
|---|
| Long-range dependencies | ❌ Forgets | ✅ Direct attention |
| Parallelizable | ❌ Sequential | ✅ All at once |
| Training speed | ❌ Slow | ✅ GPU-friendly |
| Scales with compute | ❌ Plateaus | ✅ Gets better |
| Context window | ❌ Limited | ✅ 100k+ tokens |
11. Timeline
Rendering diagram...
12. Quick Reference
Attention(Q,K,V)=softmax(dkQKT)V
MultiHead=Concat(head1,…,headh)WO
xl+1=LayerNorm(xl+Sublayer(xl))