Embeddings

1. The Problem

Neural networks only process numbers. But real-world data includes words, categories, and IDs:

"cat", "king", "Paris", UserID=4521, ProductID=89

How do we represent these as meaningful numbers?

2. Naive Approach — One-Hot Encoding

Assign each item a unique position in a sparse binary vector:

Word	Vector
cat	$[1, 0, 0, 0, 0, 0]$
dog	$[0, 1, 0, 0, 0, 0]$
king	$[0, 0, 1, 0, 0, 0]$
queen	$[0, 0, 0, 1, 0, 0]$

Problems

1. Dimensionality explosion:

$\text{English vocabulary} \approx 170{,}000 \text{ words}$ $\text{Each word} = \text{vector of } 170{,}000 \text{ numbers, mostly zeros}$

2. No notion of similarity:

$d(\text{"cat"}, \text{"dog"}) = d(\text{"cat"}, \text{"Tokyo"}) \quad \text{(identical distances!)}$

One-hot vectors are orthogonal — they capture no semantic relationships.

3. Dense Embeddings — The Solution

Map each item to a small, dense vector of real numbers:

$\text{cat} \to \mathbf{e}_\text{cat} = [0.2, -0.4, 0.7, 0.1, -0.3]$

$\text{dog} \to \mathbf{e}_\text{dog} = [0.3, -0.5, 0.6, 0.2, -0.2]$

$\text{king} \to \mathbf{e}_\text{king} = [0.8, 0.4, -0.1, 0.9, 0.5]$

Similar items have similar vectors. These values are learned from data via backpropagation.

Real-Life Analogy 🗺️

Think of embeddings as coordinates on a semantic map:

            ROYALTY
                ↑
        queen · · king
                │
FEMALE ─────────┼──────── MALE
                │
        woman · · man
                ↓
            COMMON

Coordinates (embedding values) encode meaning geometrically.

4. Embedding Lookup

An embedding layer is just a lookup table (matrix $\mathbf{E}$ ):

$\mathbf{e}_i = \mathbf{E}[i] \in \mathbb{R}^d$

Where:

$i$ — token index
$\mathbf{E} \in \mathbb{R}^{V \times d}$ — embedding matrix
$V$ — vocabulary size
$d$ — embedding dimension

Example: $V = 10{,}000$ words, $d = 300$ dimensions

$\mathbf{E} \in \mathbb{R}^{10000 \times 300} \quad \text{(3 million learned parameters)}$

5. The Word2Vec Magic

Trained on billions of text sentences, embeddings learn analogical relationships:

$\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} \approx \mathbf{e}_\text{queen}$

Vector Arithmetic Example

$\mathbf{e}_\text{king} = [0.8, \; 0.4, \; -0.1, \; 0.9]$

$\mathbf{e}_\text{man} = [0.7, \; 0.3, \; -0.2, \; 0.1]$

$\mathbf{e}_\text{woman} = [0.6, \; 0.5, \; -0.3, \; 0.8]$

$\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} = [0.7, \; 0.6, \; -0.2, \; 1.6] \approx \mathbf{e}_\text{queen} \; ✅$

The network learned gender and royalty as geometric directions — with no human annotation!

6. Measuring Similarity — Cosine Similarity

$\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|}$

Example

$\mathbf{e}_\text{cat} = [1, 0, 1], \quad \mathbf{e}_\text{dog} = [1, 1, 0], \quad \mathbf{e}_\text{Tokyo} = [0, 0, 1]$

cat vs dog:

$\cos(\mathbf{e}_\text{cat}, \mathbf{e}_\text{dog}) = \frac{(1)(1)+(0)(1)+(1)(0)}{\sqrt{2}\cdot\sqrt{2}} = \frac{1}{2} = 0.5 \quad \text{(similar ✅)}$

cat vs Tokyo:

$\cos(\mathbf{e}_\text{cat}, \mathbf{e}_\text{Tokyo}) = \frac{(1)(0)+(0)(0)+(1)(1)}{\sqrt{2}\cdot 1} = \frac{1}{\sqrt{2}} \approx 0.71$

Score	Meaning
$1.0$	Identical direction (same meaning)
$0.0$	Perpendicular (unrelated)
$-1.0$	Opposite directions

7. How Embeddings Are Learned

Method 1: Word2Vec — Skip-Gram

Task: Given center word, predict surrounding context words.

$P(\text{context} \mid \text{center}) = \frac{\exp(\mathbf{e}_\text{context} \cdot \mathbf{e}_\text{center})}{\sum_{w} \exp(\mathbf{e}_w \cdot \mathbf{e}_\text{center})}$

Training sentence: "The cat sat on the mat"

Center	Context (window=2)
"sat"	"The", "cat", "on", "the"

Embeddings adjust until the model predicts context words well. By learning context, vectors capture meaning.

Method 2: End-to-End Learning

Embeddings are the first layer — random initially, updated by backprop like any weight:

Rendering diagram...

8. Positional Embeddings

Word order matters: "Dog bites man" $\neq$ "Man bites dog"

But token embeddings alone have no position information. We add positional encodings:

$\mathbf{e}_\text{final} = \mathbf{e}_\text{word} + \mathbf{e}_\text{position}$

Sinusoidal positional encoding (original Transformer):

$\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)$

$\text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)$

Where $pos$ is the position and $i$ is the dimension index.

9. Embeddings Everywhere

Rendering diagram...

10. Embedding Dimensions

System	Dimension $d$	Notes
Word2Vec	50–300	Classic NLP
BERT	768	Contextual
GPT-3	12,288	Large LLM
Claude 3	~8,192+	Modern LLM
Netflix recs	32–256	Collaborative filtering

Too few dimensions → can't capture meaning. Too many → slow, overfits.

11. Full Tokenization → Embedding Pipeline

Rendering diagram...

12. Quick Reference

$\boxed{\mathbf{e}_i = \mathbf{E}[i], \quad \mathbf{E} \in \mathbb{R}^{V \times d}}$

$\boxed{\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}}$

$\boxed{\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} \approx \mathbf{e}_\text{queen}}$