Attention Mechanism & Transformers

1. The Problem with Sequential Models

Before Transformers, sequences were processed with RNNs — one token at a time:

$h_t = f(h_{t-1}, x_t)$

Critical Failures

Long-range dependency problem:

"The cat that sat on the mat was hungry"

By the time the RNN reaches "hungry", memory of "cat" has faded through 6+ steps of hidden state compression.

Cannot parallelize:

Must process $x_1 \to x_2 \to x_3 \to \ldots$ sequentially
GPUs excel at parallel computation → RNNs waste GPU power
Training is painfully slow

2. The Attention Intuition

Real-Life Analogy 👀

Reading: "The trophy didn't fit in the suitcase because it was too big"

To understand "it", your brain simultaneously looks at ALL words:

Word	Attention Weight
trophy	85% ← "it" = trophy!
suitcase	10%
fit	3%
because	2%

Your brain doesn't read left-to-right — it attends to all words at once.

YouTube Search Analogy 🔍

Attention Component	Analogy	Role
Query (Q)	What you type in search	What this word is looking for
Key (K)	Video titles	What each word advertises
Value (V)	Actual video content	What each word provides
Attention score	Search relevance	$Q \cdot K$ similarity

3. Scaled Dot-Product Attention

The Formula

$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$

This is the single most important formula in modern AI.

4. Step-by-Step Example

Sentence: "Cat eats fish" (3 tokens, embedding dim = 4)

Token embeddings:

$\mathbf{x}_\text{cat} = [1, 0, 1, 0]$ $\mathbf{x}_\text{eats} = [0, 1, 0, 1]$ $\mathbf{x}_\text{fish} = [1, 1, 0, 0]$

Step 1: Compute Q, K, V

For each token, project with learned weight matrices $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V$ :

$\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V$

Simplified for "cat":

$Q_\text{cat} = [0.8, 0.2], \quad K_\text{cat} = [0.9, 0.1], \quad V_\text{cat} = [1.0, 0.0]$

Step 2: Attention Scores (Dot Products)

$\text{score}(\text{cat} \to \text{cat}) = Q_\text{cat} \cdot K_\text{cat} = (0.8)(0.9) + (0.2)(0.1) = 0.74$ $\text{score}(\text{cat} \to \text{eats}) = Q_\text{cat} \cdot K_\text{eats} = 0.38$ $\text{score}(\text{cat} \to \text{fish}) = Q_\text{cat} \cdot K_\text{fish} = 0.50$

Step 3: Scale

Divide by $\sqrt{d_k} = \sqrt{2} = 1.41$ to prevent gradient issues with large dot products:

$\text{scaled scores} = \left[\frac{0.74}{1.41}, \frac{0.38}{1.41}, \frac{0.50}{1.41}\right] = [0.52, 0.27, 0.35]$

Step 4: Softmax → Attention Weights

$\alpha = \text{softmax}([0.52, 0.27, 0.35]) = [0.38, 0.30, 0.32]$

Step 5: Weighted Sum of Values

$\text{output}_\text{cat} = 0.38 \cdot V_\text{cat} + 0.30 \cdot V_\text{eats} + 0.32 \cdot V_\text{fish}$

$= 0.38[1.0, 0.0] + 0.30[0.0, 1.0] + 0.32[0.5, 0.5] = [0.54, 0.46]$

"Cat" now contains information from all tokens, weighted by relevance!

5. Attention Flow Diagram

Rendering diagram...

6. Multi-Head Attention

One attention head = one perspective. Language has many relationship types simultaneously:

Head	What it learns
Head 1	Syntactic structure (subject-verb)
Head 2	Semantic meaning (word sense)
Head 3	Coreference (pronoun → noun)
Head 4	Long-range dependencies

Formula

$\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$

$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$

Where $\mathbf{W}^O \in \mathbb{R}^{hd_v \times d_{model}}$ projects concatenated heads back to model dimension.

GPT-3 uses 96 attention heads per layer with $d_{model} = 12288$ !

7. The Transformer Block

Each Transformer block contains:

Rendering diagram...

Residual (Skip) Connections

$\mathbf{x}^{l+1} = \text{LayerNorm}(\mathbf{x}^l + \text{Sublayer}(\mathbf{x}^l))$

The original input is added back after each sublayer. This creates gradient "highways" and prevents vanishing gradients in deep networks.

Feed-Forward Network

$\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$

Typical expansion: $d_{ff} = 4 \times d_{model}$

8. Encoder vs Decoder vs Both

Architecture	Attention Type	Task	Examples
Encoder only	Bidirectional (sees all)	Understanding	BERT, RoBERTa
Decoder only	Causal (left-to-right)	Generation	GPT, Claude
Enc-Dec	Both	Seq-to-seq	T5, BART

Causal (Masked) Attention

In decoder models, future tokens are masked so each position can only attend to past tokens:

$\text{mask}_{ij} = \begin{cases} 0 & i \geq j \\ -\infty & i < j \end{cases}$

$\text{Attention} = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T + \mathbf{M}}{\sqrt{d_k}}\right)\mathbf{V}$

9. GPT Text Generation

Rendering diagram...

10. Why Transformers Beat Everything

Property	RNN	Transformer
Long-range dependencies	❌ Forgets	✅ Direct attention
Parallelizable	❌ Sequential	✅ All at once
Training speed	❌ Slow	✅ GPU-friendly
Scales with compute	❌ Plateaus	✅ Gets better
Context window	❌ Limited	✅ 100k+ tokens

11. Timeline

Rendering diagram...

12. Quick Reference

$\boxed{\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V}}$

$\boxed{\text{MultiHead} = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)\mathbf{W}^O}$

$\boxed{\mathbf{x}^{l+1} = \text{LayerNorm}(\mathbf{x}^l + \text{Sublayer}(\mathbf{x}^l))}$