1. The Problem
Neural networks only process numbers. But real-world data includes words, categories, and IDs:
"cat", "king", "Paris", UserID=4521, ProductID=89
How do we represent these as meaningful numbers?
2. Naive Approach — One-Hot Encoding
Assign each item a unique position in a sparse binary vector:
| Word | Vector |
|---|
| cat | [1,0,0,0,0,0] |
| dog | [0,1,0,0,0,0] |
| king | [0,0,1,0,0,0] |
| queen | [0,0,0,1,0,0] |
Problems
1. Dimensionality explosion:
English vocabulary≈170,000 words
Each word=vector of 170,000 numbers, mostly zeros
2. No notion of similarity:
d("cat","dog")=d("cat","Tokyo")(identical distances!)
One-hot vectors are orthogonal — they capture no semantic relationships.
3. Dense Embeddings — The Solution
Map each item to a small, dense vector of real numbers:
cat→ecat=[0.2,−0.4,0.7,0.1,−0.3]
dog→edog=[0.3,−0.5,0.6,0.2,−0.2]
king→eking=[0.8,0.4,−0.1,0.9,0.5]
Similar items have similar vectors. These values are learned from data via backpropagation.
Real-Life Analogy 🗺️
Think of embeddings as coordinates on a semantic map:
ROYALTY
↑
queen · · king
│
FEMALE ─────────┼──────── MALE
│
woman · · man
↓
COMMON
Coordinates (embedding values) encode meaning geometrically.
4. Embedding Lookup
An embedding layer is just a lookup table (matrix E):
ei=E[i]∈Rd
Where:
- i — token index
- E∈RV×d — embedding matrix
- V — vocabulary size
- d — embedding dimension
Example: V=10,000 words, d=300 dimensions
E∈R10000×300(3 million learned parameters)
5. The Word2Vec Magic
Trained on billions of text sentences, embeddings learn analogical relationships:
eking−eman+ewoman≈equeen
Vector Arithmetic Example
eking=[0.8,0.4,−0.1,0.9]
eman=[0.7,0.3,−0.2,0.1]
ewoman=[0.6,0.5,−0.3,0.8]
eking−eman+ewoman=[0.7,0.6,−0.2,1.6]≈equeen✅
The network learned gender and royalty as geometric directions — with no human annotation!
6. Measuring Similarity — Cosine Similarity
cos(a,b)=∥a∥⋅∥b∥a⋅b
Example
ecat=[1,0,1],edog=[1,1,0],eTokyo=[0,0,1]
cat vs dog:
cos(ecat,edog)=2⋅2(1)(1)+(0)(1)+(1)(0)=21=0.5(similar ✅)
cat vs Tokyo:
cos(ecat,eTokyo)=2⋅1(1)(0)+(0)(0)+(1)(1)=21≈0.71
| Score | Meaning |
|---|
| 1.0 | Identical direction (same meaning) |
| 0.0 | Perpendicular (unrelated) |
| −1.0 | Opposite directions |
7. How Embeddings Are Learned
Method 1: Word2Vec — Skip-Gram
Task: Given center word, predict surrounding context words.
P(context∣center)=∑wexp(ew⋅ecenter)exp(econtext⋅ecenter)
Training sentence: "The cat sat on the mat"
| Center | Context (window=2) |
|---|
| "sat" | "The", "cat", "on", "the" |
Embeddings adjust until the model predicts context words well. By learning context, vectors capture meaning.
Method 2: End-to-End Learning
Embeddings are the first layer — random initially, updated by backprop like any weight:
Rendering diagram...
8. Positional Embeddings
Word order matters: "Dog bites man" = "Man bites dog"
But token embeddings alone have no position information. We add positional encodings:
efinal=eword+eposition
Sinusoidal positional encoding (original Transformer):
PE(pos,2i)=sin(100002i/dpos)
PE(pos,2i+1)=cos(100002i/dpos)
Where pos is the position and i is the dimension index.
9. Embeddings Everywhere
Rendering diagram...
10. Embedding Dimensions
| System | Dimension d | Notes |
|---|
| Word2Vec | 50–300 | Classic NLP |
| BERT | 768 | Contextual |
| GPT-3 | 12,288 | Large LLM |
| Claude 3 | ~8,192+ | Modern LLM |
| Netflix recs | 32–256 | Collaborative filtering |
Too few dimensions → can't capture meaning. Too many → slow, overfits.
11. Full Tokenization → Embedding Pipeline
Rendering diagram...
12. Quick Reference
ei=E[i],E∈RV×d
cos(a,b)=∥a∥∥b∥a⋅b
eking−eman+ewoman≈equeen