All posts
ai2025-12-20·4 min read

Embeddings

Teaching machines to understand meaning, not just symbols. How vectors capture semantic relationships.

1. The Problem

Neural networks only process numbers. But real-world data includes words, categories, and IDs:

"cat", "king", "Paris", UserID=4521, ProductID=89

How do we represent these as meaningful numbers?


2. Naive Approach — One-Hot Encoding

Assign each item a unique position in a sparse binary vector:

WordVector
cat[1,0,0,0,0,0][1, 0, 0, 0, 0, 0]
dog[0,1,0,0,0,0][0, 1, 0, 0, 0, 0]
king[0,0,1,0,0,0][0, 0, 1, 0, 0, 0]
queen[0,0,0,1,0,0][0, 0, 0, 1, 0, 0]

Problems

1. Dimensionality explosion:

English vocabulary170,000 words\text{English vocabulary} \approx 170{,}000 \text{ words} Each word=vector of 170,000 numbers, mostly zeros\text{Each word} = \text{vector of } 170{,}000 \text{ numbers, mostly zeros}

2. No notion of similarity:

d("cat","dog")=d("cat","Tokyo")(identical distances!)d(\text{"cat"}, \text{"dog"}) = d(\text{"cat"}, \text{"Tokyo"}) \quad \text{(identical distances!)}

One-hot vectors are orthogonal — they capture no semantic relationships.


3. Dense Embeddings — The Solution

Map each item to a small, dense vector of real numbers:

catecat=[0.2,0.4,0.7,0.1,0.3]\text{cat} \to \mathbf{e}_\text{cat} = [0.2, -0.4, 0.7, 0.1, -0.3]

dogedog=[0.3,0.5,0.6,0.2,0.2]\text{dog} \to \mathbf{e}_\text{dog} = [0.3, -0.5, 0.6, 0.2, -0.2]

kingeking=[0.8,0.4,0.1,0.9,0.5]\text{king} \to \mathbf{e}_\text{king} = [0.8, 0.4, -0.1, 0.9, 0.5]

Similar items have similar vectors. These values are learned from data via backpropagation.

Real-Life Analogy 🗺️

Think of embeddings as coordinates on a semantic map:

            ROYALTY
                ↑
        queen · · king
                │
FEMALE ─────────┼──────── MALE
                │
        woman · · man
                ↓
            COMMON

Coordinates (embedding values) encode meaning geometrically.


4. Embedding Lookup

An embedding layer is just a lookup table (matrix E\mathbf{E}):

ei=E[i]Rd\mathbf{e}_i = \mathbf{E}[i] \in \mathbb{R}^d

Where:

  • ii — token index
  • ERV×d\mathbf{E} \in \mathbb{R}^{V \times d} — embedding matrix
  • VV — vocabulary size
  • dd — embedding dimension

Example: V=10,000V = 10{,}000 words, d=300d = 300 dimensions

ER10000×300(3 million learned parameters)\mathbf{E} \in \mathbb{R}^{10000 \times 300} \quad \text{(3 million learned parameters)}


5. The Word2Vec Magic

Trained on billions of text sentences, embeddings learn analogical relationships:

ekingeman+ewomanequeen\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} \approx \mathbf{e}_\text{queen}

Vector Arithmetic Example

eking=[0.8,  0.4,  0.1,  0.9]\mathbf{e}_\text{king} = [0.8, \; 0.4, \; -0.1, \; 0.9]

eman=[0.7,  0.3,  0.2,  0.1]\mathbf{e}_\text{man} = [0.7, \; 0.3, \; -0.2, \; 0.1]

ewoman=[0.6,  0.5,  0.3,  0.8]\mathbf{e}_\text{woman} = [0.6, \; 0.5, \; -0.3, \; 0.8]

ekingeman+ewoman=[0.7,  0.6,  0.2,  1.6]equeen  \mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} = [0.7, \; 0.6, \; -0.2, \; 1.6] \approx \mathbf{e}_\text{queen} \; ✅

The network learned gender and royalty as geometric directions — with no human annotation!


6. Measuring Similarity — Cosine Similarity

cos(a,b)=abab\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|}

Example

ecat=[1,0,1],edog=[1,1,0],eTokyo=[0,0,1]\mathbf{e}_\text{cat} = [1, 0, 1], \quad \mathbf{e}_\text{dog} = [1, 1, 0], \quad \mathbf{e}_\text{Tokyo} = [0, 0, 1]

cat vs dog:

cos(ecat,edog)=(1)(1)+(0)(1)+(1)(0)22=12=0.5(similar ✅)\cos(\mathbf{e}_\text{cat}, \mathbf{e}_\text{dog}) = \frac{(1)(1)+(0)(1)+(1)(0)}{\sqrt{2}\cdot\sqrt{2}} = \frac{1}{2} = 0.5 \quad \text{(similar ✅)}

cat vs Tokyo:

cos(ecat,eTokyo)=(1)(0)+(0)(0)+(1)(1)21=120.71\cos(\mathbf{e}_\text{cat}, \mathbf{e}_\text{Tokyo}) = \frac{(1)(0)+(0)(0)+(1)(1)}{\sqrt{2}\cdot 1} = \frac{1}{\sqrt{2}} \approx 0.71

ScoreMeaning
1.01.0Identical direction (same meaning)
0.00.0Perpendicular (unrelated)
1.0-1.0Opposite directions

7. How Embeddings Are Learned

Method 1: Word2Vec — Skip-Gram

Task: Given center word, predict surrounding context words.

P(contextcenter)=exp(econtextecenter)wexp(ewecenter)P(\text{context} \mid \text{center}) = \frac{\exp(\mathbf{e}_\text{context} \cdot \mathbf{e}_\text{center})}{\sum_{w} \exp(\mathbf{e}_w \cdot \mathbf{e}_\text{center})}

Training sentence: "The cat sat on the mat"

CenterContext (window=2)
"sat""The", "cat", "on", "the"

Embeddings adjust until the model predicts context words well. By learning context, vectors capture meaning.

Method 2: End-to-End Learning

Embeddings are the first layer — random initially, updated by backprop like any weight:

Rendering diagram...

8. Positional Embeddings

Word order matters: "Dog bites man" \neq "Man bites dog"

But token embeddings alone have no position information. We add positional encodings:

efinal=eword+eposition\mathbf{e}_\text{final} = \mathbf{e}_\text{word} + \mathbf{e}_\text{position}

Sinusoidal positional encoding (original Transformer):

PE(pos,2i)=sin(pos100002i/d)\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)

PE(pos,2i+1)=cos(pos100002i/d)\text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Where pospos is the position and ii is the dimension index.


9. Embeddings Everywhere

Rendering diagram...

10. Embedding Dimensions

SystemDimension ddNotes
Word2Vec50–300Classic NLP
BERT768Contextual
GPT-312,288Large LLM
Claude 3~8,192+Modern LLM
Netflix recs32–256Collaborative filtering

Too few dimensions → can't capture meaning. Too many → slow, overfits.


11. Full Tokenization → Embedding Pipeline

Rendering diagram...

12. Quick Reference

ei=E[i],ERV×d\boxed{\mathbf{e}_i = \mathbf{E}[i], \quad \mathbf{E} \in \mathbb{R}^{V \times d}}

cos(a,b)=abab\boxed{\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}}

ekingeman+ewomanequeen\boxed{\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} \approx \mathbf{e}_\text{queen}}

Filed underai

Related posts