All posts
ai2025-12-27·5 min read

Attention Mechanism & Transformers

The architecture that changed everything. How attention replaced recurrence and scaled to GPT.

1. The Problem with Sequential Models

Before Transformers, sequences were processed with RNNs — one token at a time:

ht=f(ht1,xt)h_t = f(h_{t-1}, x_t)

Critical Failures

Long-range dependency problem:

"The cat that sat on the mat was hungry"

By the time the RNN reaches "hungry", memory of "cat" has faded through 6+ steps of hidden state compression.

Cannot parallelize:

  • Must process x1x2x3x_1 \to x_2 \to x_3 \to \ldots sequentially
  • GPUs excel at parallel computation → RNNs waste GPU power
  • Training is painfully slow

2. The Attention Intuition

Real-Life Analogy 👀

Reading: "The trophy didn't fit in the suitcase because it was too big"

To understand "it", your brain simultaneously looks at ALL words:

WordAttention Weight
trophy85% ← "it" = trophy!
suitcase10%
fit3%
because2%

Your brain doesn't read left-to-right — it attends to all words at once.

YouTube Search Analogy 🔍

Attention ComponentAnalogyRole
Query (Q)What you type in searchWhat this word is looking for
Key (K)Video titlesWhat each word advertises
Value (V)Actual video contentWhat each word provides
Attention scoreSearch relevanceQKQ \cdot K similarity

3. Scaled Dot-Product Attention

The Formula

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

This is the single most important formula in modern AI.


4. Step-by-Step Example

Sentence: "Cat eats fish" (3 tokens, embedding dim = 4)

Token embeddings:

xcat=[1,0,1,0]\mathbf{x}_\text{cat} = [1, 0, 1, 0] xeats=[0,1,0,1]\mathbf{x}_\text{eats} = [0, 1, 0, 1] xfish=[1,1,0,0]\mathbf{x}_\text{fish} = [1, 1, 0, 0]

Step 1: Compute Q, K, V

For each token, project with learned weight matrices WQ,WK,WV\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V:

Q=XWQ,K=XWK,V=XWV\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V

Simplified for "cat":

Qcat=[0.8,0.2],Kcat=[0.9,0.1],Vcat=[1.0,0.0]Q_\text{cat} = [0.8, 0.2], \quad K_\text{cat} = [0.9, 0.1], \quad V_\text{cat} = [1.0, 0.0]

Step 2: Attention Scores (Dot Products)

score(catcat)=QcatKcat=(0.8)(0.9)+(0.2)(0.1)=0.74\text{score}(\text{cat} \to \text{cat}) = Q_\text{cat} \cdot K_\text{cat} = (0.8)(0.9) + (0.2)(0.1) = 0.74 score(cateats)=QcatKeats=0.38\text{score}(\text{cat} \to \text{eats}) = Q_\text{cat} \cdot K_\text{eats} = 0.38 score(catfish)=QcatKfish=0.50\text{score}(\text{cat} \to \text{fish}) = Q_\text{cat} \cdot K_\text{fish} = 0.50

Step 3: Scale

Divide by dk=2=1.41\sqrt{d_k} = \sqrt{2} = 1.41 to prevent gradient issues with large dot products:

scaled scores=[0.741.41,0.381.41,0.501.41]=[0.52,0.27,0.35]\text{scaled scores} = \left[\frac{0.74}{1.41}, \frac{0.38}{1.41}, \frac{0.50}{1.41}\right] = [0.52, 0.27, 0.35]

Step 4: Softmax → Attention Weights

α=softmax([0.52,0.27,0.35])=[0.38,0.30,0.32]\alpha = \text{softmax}([0.52, 0.27, 0.35]) = [0.38, 0.30, 0.32]

Step 5: Weighted Sum of Values

outputcat=0.38Vcat+0.30Veats+0.32Vfish\text{output}_\text{cat} = 0.38 \cdot V_\text{cat} + 0.30 \cdot V_\text{eats} + 0.32 \cdot V_\text{fish}

=0.38[1.0,0.0]+0.30[0.0,1.0]+0.32[0.5,0.5]=[0.54,0.46]= 0.38[1.0, 0.0] + 0.30[0.0, 1.0] + 0.32[0.5, 0.5] = [0.54, 0.46]

"Cat" now contains information from all tokens, weighted by relevance!


5. Attention Flow Diagram

Rendering diagram...

6. Multi-Head Attention

One attention head = one perspective. Language has many relationship types simultaneously:

HeadWhat it learns
Head 1Syntactic structure (subject-verb)
Head 2Semantic meaning (word sense)
Head 3Coreference (pronoun → noun)
Head 4Long-range dependencies

Formula

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)

Where WORhdv×dmodel\mathbf{W}^O \in \mathbb{R}^{hd_v \times d_{model}} projects concatenated heads back to model dimension.

GPT-3 uses 96 attention heads per layer with dmodel=12288d_{model} = 12288!


7. The Transformer Block

Each Transformer block contains:

Rendering diagram...

Residual (Skip) Connections

xl+1=LayerNorm(xl+Sublayer(xl))\mathbf{x}^{l+1} = \text{LayerNorm}(\mathbf{x}^l + \text{Sublayer}(\mathbf{x}^l))

The original input is added back after each sublayer. This creates gradient "highways" and prevents vanishing gradients in deep networks.

Feed-Forward Network

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2

Typical expansion: dff=4×dmodeld_{ff} = 4 \times d_{model}


8. Encoder vs Decoder vs Both

ArchitectureAttention TypeTaskExamples
Encoder onlyBidirectional (sees all)UnderstandingBERT, RoBERTa
Decoder onlyCausal (left-to-right)GenerationGPT, Claude
Enc-DecBothSeq-to-seqT5, BART

Causal (Masked) Attention

In decoder models, future tokens are masked so each position can only attend to past tokens:

maskij={0iji<j\text{mask}_{ij} = \begin{cases} 0 & i \geq j \\ -\infty & i < j \end{cases}

Attention=softmax(QKT+Mdk)V\text{Attention} = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T + \mathbf{M}}{\sqrt{d_k}}\right)\mathbf{V}


9. GPT Text Generation

Rendering diagram...

10. Why Transformers Beat Everything

PropertyRNNTransformer
Long-range dependencies❌ Forgets✅ Direct attention
Parallelizable❌ Sequential✅ All at once
Training speed❌ Slow✅ GPU-friendly
Scales with compute❌ Plateaus✅ Gets better
Context window❌ Limited✅ 100k+ tokens

11. Timeline

Rendering diagram...

12. Quick Reference

Attention(Q,K,V)=softmax ⁣(QKTdk)V\boxed{\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V}}

MultiHead=Concat(head1,,headh)WO\boxed{\text{MultiHead} = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)\mathbf{W}^O}

xl+1=LayerNorm(xl+Sublayer(xl))\boxed{\mathbf{x}^{l+1} = \text{LayerNorm}(\mathbf{x}^l + \text{Sublayer}(\mathbf{x}^l))}

Filed underai

Related posts