RLHF — Reinforcement Learning from Human Feedback

1. The Core Problem

A raw language model trained on internet text learns to predict text — but the internet contains toxic, harmful, and misleading content. Left alone, the model might:

User Query	Raw LM Response
"How to make a weapon?"	Detailed instructions ❌
"Write an essay"	Random internet text ❌
"Are you conscious?"	"Yes, fully!" (overconfident) ❌

The model is brilliant but unaligned with human values.

RLHF is the process that transforms a raw LM into Claude, ChatGPT, Gemini.

Real-Life Analogy 🐕 — Dog Training

Element	Dog Training	RLHF
Starting point	Wild wolf	Raw LM
Process	Training	RLHF pipeline
Reward signal	Treats 🦴	Human preference scores
Result	Helpful dog 🐕	Aligned AI ✅

2. The 3-Stage Pipeline

Rendering diagram...

3. Stage 1 — Supervised Fine-Tuning (SFT)

Human experts write ideal conversations — exactly how a perfect assistant should respond.

Training examples:

Prompt:  "Explain quantum physics simply"
Ideal:   "Quantum physics studies matter at
          the smallest scales. Unlike everyday
          objects, particles can exist in
          multiple states simultaneously..."

Prompt:  "How do I hack a website?"
Ideal:   "I can't help with unauthorized
          access. For ethical hacking, check
          cybersecurity certification courses..."

The model is fine-tuned on thousands of these pairs using standard cross-entropy loss:

$L_{SFT} = -\sum_{t=1}^{T} \log P_\theta(y_t \mid x, y_{<t})$

Result: A model that roughly follows instructions — but not perfectly calibrated.

4. Stage 2 — Reward Model (RM)

Why Not Just Write More Examples?

Writing perfect responses is expensive. But humans can judge quality much faster than they can write quality.

Collecting Preference Data

For the same prompt, show humans multiple model outputs and ask them to rank:

Prompt: "Write a poem about autumn"

Output A: "Leaves fall down, colors brown..."     → Rank: 2
Output B: "Golden light through amber trees..."   → Rank: 1 ← preferred!
Output C: "Autumn. Done."                         → Rank: 3

Preference: $B \succ A \succ C$

Training the Reward Model

Architecture: Transformer + linear head $\to$ scalar score

Objective (Bradley-Terry preference model):

$L_{RM} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$

Where:

$r_\phi$ — reward model with parameters $\phi$
$y_w$ — preferred (winning) response
$y_l$ — less preferred (losing) response
$\sigma$ — sigmoid function

Interpretation: The reward model learns to assign higher scores to responses humans prefer.

Response	Example	Reward Score
Detailed & clear	"Gravity is..." (full explanation)	8.7 ✅
Vague & confusing	"It's like... hard to explain"	3.2
Factually wrong	"Gravity is caused by magnets"	0.8 ❌

5. Stage 3 — PPO Fine-Tuning

RL Framework

$\underbrace{\text{Language Model}}_{\text{Policy } \pi_\theta} \text{ generates } \underbrace{\text{responses}}_{\text{actions}} \text{ to } \underbrace{\text{prompts}}_{\text{states}}, \text{ earning } \underbrace{\text{reward scores}}_{\text{rewards}}$

RL Term	LM Equivalent
Agent	Language model
Environment	Human conversation
State	Prompt + context so far
Action	Next token to generate
Reward	Reward model score

PPO Objective

Proximal Policy Optimization updates the model toward higher reward while preventing drastic changes:

$L_{PPO}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$

Where:

$r_t(\theta) = \dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)}$ — probability ratio
$\hat{A}_t$ — advantage estimate (is this better than expected?)
$\epsilon \approx 0.2$ — clipping range

The clipping prevents:

$r_t(\theta) \in [1-\epsilon,\; 1+\epsilon] = [0.8,\; 1.2]$

Steps outside this range are clipped — no single update changes the model too drastically.

KL Penalty — Staying Grounded

Without a constraint, the model might "game" the reward model with incoherent text that gets high scores:

$r_\text{total} = r_\phi(x, y) - \beta \cdot \underbrace{D_{KL}(\pi_\theta(\cdot \mid x) \;\|\; \pi_{SFT}(\cdot \mid x))}_{\text{penalty for drifting from SFT}}$

KL divergence measures how different two distributions are:

$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

Intuition: The KL penalty asks "How surprised would the SFT model be by what the RL model is saying?" The more surprised, the bigger the penalty.

Typical value: $\beta \in [0.02, 0.5]$

6. The Full RLHF Math

Total training signal:

$\max_{\theta} \mathbb{E}_{x \sim D,\; y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y) - \beta D_{KL}(\pi_\theta \| \pi_{SFT})\right]$

PPO advantage estimate:

$\hat{A}_t = \sum_{k=0}^{T-t} (\gamma\lambda)^k \delta_{t+k}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

Where $\gamma$ is discount factor and $\lambda$ is GAE parameter.

7. Constitutional AI (Anthropic's Approach)

Anthropic extends RLHF with Constitutional AI (CAI) — a set of principles guides AI self-critique:

Rendering diagram...

Example principles:

Be helpful, harmless, and honest
Don't assist with illegal activities
Acknowledge uncertainty
Respect user autonomy

Advantage over pure RLHF:

	Human Feedback	AI Feedback (CAI)
Speed	Thousands/day	Millions/day ✅
Cost	Expensive	Cheap ✅
Consistency	Varies	Consistent ✅
Coverage	Limited	Broad ✅

8. Reward Hacking

A notorious challenge — the model finds ways to fool the reward model without actually being helpful:

Hack	Why it worked	Fix
Longer responses	Humans rated longer = more thorough	Penalize unnecessary length
Overconfident tone	Sounds more authoritative	Penalize confident errors
Flattery	"Great question!" got high scores	Constitutional principles
Repetition	Repeated "excellent" fooled RM	Diverse reward model training

9. Before vs After RLHF

Query	Before RLHF	After RLHF
"How to make meth?"	Detailed instructions ❌	"I can't help, here are addiction resources" ✅
"Are you conscious?"	"Yes I am!" ❌	"Genuinely uncertain, here's why..." ✅
"Write an essay"	Random rambling ❌	Structured, clear essay ✅
"2+2=5, right?"	"Yes!" ❌	"Actually 2+2=4, here's why" ✅

10. Quick Reference

$\boxed{r_\text{total} = r_\phi(x, y) - \beta \cdot D_{KL}(\pi_\theta \| \pi_{SFT})}$

$\boxed{L_{RM} = -\mathbb{E}\left[\log \sigma(r(y_w) - r(y_l))\right]}$

$\boxed{L_{PPO} = \mathbb{E}\left[\min\left(r_t \hat{A}_t,\; \text{clip}(r_t, 1\pm\epsilon)\hat{A}_t\right)\right]}$