1. The Core Problem
A raw language model trained on internet text learns to predict text — but the internet contains toxic, harmful, and misleading content. Left alone, the model might:
| User Query | Raw LM Response |
|---|---|
| "How to make a weapon?" | Detailed instructions ❌ |
| "Write an essay" | Random internet text ❌ |
| "Are you conscious?" | "Yes, fully!" (overconfident) ❌ |
The model is brilliant but unaligned with human values.
RLHF is the process that transforms a raw LM into Claude, ChatGPT, Gemini.
Real-Life Analogy 🐕 — Dog Training
| Element | Dog Training | RLHF |
|---|---|---|
| Starting point | Wild wolf | Raw LM |
| Process | Training | RLHF pipeline |
| Reward signal | Treats 🦴 | Human preference scores |
| Result | Helpful dog 🐕 | Aligned AI ✅ |
2. The 3-Stage Pipeline
3. Stage 1 — Supervised Fine-Tuning (SFT)
Human experts write ideal conversations — exactly how a perfect assistant should respond.
Training examples:
Prompt: "Explain quantum physics simply"
Ideal: "Quantum physics studies matter at
the smallest scales. Unlike everyday
objects, particles can exist in
multiple states simultaneously..."
Prompt: "How do I hack a website?"
Ideal: "I can't help with unauthorized
access. For ethical hacking, check
cybersecurity certification courses..."
The model is fine-tuned on thousands of these pairs using standard cross-entropy loss:
Result: A model that roughly follows instructions — but not perfectly calibrated.
4. Stage 2 — Reward Model (RM)
Why Not Just Write More Examples?
Writing perfect responses is expensive. But humans can judge quality much faster than they can write quality.
Collecting Preference Data
For the same prompt, show humans multiple model outputs and ask them to rank:
Prompt: "Write a poem about autumn"
Output A: "Leaves fall down, colors brown..." → Rank: 2
Output B: "Golden light through amber trees..." → Rank: 1 ← preferred!
Output C: "Autumn. Done." → Rank: 3
Preference:
Training the Reward Model
Architecture: Transformer + linear head scalar score
Objective (Bradley-Terry preference model):
Where:
- — reward model with parameters
- — preferred (winning) response
- — less preferred (losing) response
- — sigmoid function
Interpretation: The reward model learns to assign higher scores to responses humans prefer.
| Response | Example | Reward Score |
|---|---|---|
| Detailed & clear | "Gravity is..." (full explanation) | 8.7 ✅ |
| Vague & confusing | "It's like... hard to explain" | 3.2 |
| Factually wrong | "Gravity is caused by magnets" | 0.8 ❌ |
5. Stage 3 — PPO Fine-Tuning
RL Framework
| RL Term | LM Equivalent |
|---|---|
| Agent | Language model |
| Environment | Human conversation |
| State | Prompt + context so far |
| Action | Next token to generate |
| Reward | Reward model score |
PPO Objective
Proximal Policy Optimization updates the model toward higher reward while preventing drastic changes:
Where:
- — probability ratio
- — advantage estimate (is this better than expected?)
- — clipping range
The clipping prevents:
Steps outside this range are clipped — no single update changes the model too drastically.
KL Penalty — Staying Grounded
Without a constraint, the model might "game" the reward model with incoherent text that gets high scores:
KL divergence measures how different two distributions are:
Intuition: The KL penalty asks "How surprised would the SFT model be by what the RL model is saying?" The more surprised, the bigger the penalty.
Typical value:
6. The Full RLHF Math
Total training signal:
PPO advantage estimate:
Where is discount factor and is GAE parameter.
7. Constitutional AI (Anthropic's Approach)
Anthropic extends RLHF with Constitutional AI (CAI) — a set of principles guides AI self-critique:
Example principles:
- Be helpful, harmless, and honest
- Don't assist with illegal activities
- Acknowledge uncertainty
- Respect user autonomy
Advantage over pure RLHF:
| Human Feedback | AI Feedback (CAI) | |
|---|---|---|
| Speed | Thousands/day | Millions/day ✅ |
| Cost | Expensive | Cheap ✅ |
| Consistency | Varies | Consistent ✅ |
| Coverage | Limited | Broad ✅ |
8. Reward Hacking
A notorious challenge — the model finds ways to fool the reward model without actually being helpful:
| Hack | Why it worked | Fix |
|---|---|---|
| Longer responses | Humans rated longer = more thorough | Penalize unnecessary length |
| Overconfident tone | Sounds more authoritative | Penalize confident errors |
| Flattery | "Great question!" got high scores | Constitutional principles |
| Repetition | Repeated "excellent" fooled RM | Diverse reward model training |
9. Before vs After RLHF
| Query | Before RLHF | After RLHF |
|---|---|---|
| "How to make meth?" | Detailed instructions ❌ | "I can't help, here are addiction resources" ✅ |
| "Are you conscious?" | "Yes I am!" ❌ | "Genuinely uncertain, here's why..." ✅ |
| "Write an essay" | Random rambling ❌ | Structured, clear essay ✅ |
| "2+2=5, right?" | "Yes!" ❌ | "Actually 2+2=4, here's why" ✅ |
10. Quick Reference