All posts
ai2026-01-10·6 min read

RLHF — Reinforcement Learning from Human Feedback

Turning a text predictor into a helpful, harmless assistant using human preference signals.

1. The Core Problem

A raw language model trained on internet text learns to predict text — but the internet contains toxic, harmful, and misleading content. Left alone, the model might:

User QueryRaw LM Response
"How to make a weapon?"Detailed instructions ❌
"Write an essay"Random internet text ❌
"Are you conscious?""Yes, fully!" (overconfident) ❌

The model is brilliant but unaligned with human values.

RLHF is the process that transforms a raw LM into Claude, ChatGPT, Gemini.

Real-Life Analogy 🐕 — Dog Training

ElementDog TrainingRLHF
Starting pointWild wolfRaw LM
ProcessTrainingRLHF pipeline
Reward signalTreats 🦴Human preference scores
ResultHelpful dog 🐕Aligned AI ✅

2. The 3-Stage Pipeline

Rendering diagram...

3. Stage 1 — Supervised Fine-Tuning (SFT)

Human experts write ideal conversations — exactly how a perfect assistant should respond.

Training examples:

Prompt:  "Explain quantum physics simply"
Ideal:   "Quantum physics studies matter at
          the smallest scales. Unlike everyday
          objects, particles can exist in
          multiple states simultaneously..."

Prompt:  "How do I hack a website?"
Ideal:   "I can't help with unauthorized
          access. For ethical hacking, check
          cybersecurity certification courses..."

The model is fine-tuned on thousands of these pairs using standard cross-entropy loss:

LSFT=t=1TlogPθ(ytx,y<t)L_{SFT} = -\sum_{t=1}^{T} \log P_\theta(y_t \mid x, y_{<t})

Result: A model that roughly follows instructions — but not perfectly calibrated.


4. Stage 2 — Reward Model (RM)

Why Not Just Write More Examples?

Writing perfect responses is expensive. But humans can judge quality much faster than they can write quality.

Collecting Preference Data

For the same prompt, show humans multiple model outputs and ask them to rank:

Prompt: "Write a poem about autumn"

Output A: "Leaves fall down, colors brown..."     → Rank: 2
Output B: "Golden light through amber trees..."   → Rank: 1 ← preferred!
Output C: "Autumn. Done."                         → Rank: 3

Preference: BACB \succ A \succ C

Training the Reward Model

Architecture: Transformer + linear head \to scalar score

Objective (Bradley-Terry preference model):

LRM=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]L_{RM} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]

Where:

  • rϕr_\phi — reward model with parameters ϕ\phi
  • ywy_w — preferred (winning) response
  • yly_l — less preferred (losing) response
  • σ\sigma — sigmoid function

Interpretation: The reward model learns to assign higher scores to responses humans prefer.

ResponseExampleReward Score
Detailed & clear"Gravity is..." (full explanation)8.7 ✅
Vague & confusing"It's like... hard to explain"3.2
Factually wrong"Gravity is caused by magnets"0.8 ❌

5. Stage 3 — PPO Fine-Tuning

RL Framework

Language ModelPolicy πθ generates responsesactions to promptsstates, earning reward scoresrewards\underbrace{\text{Language Model}}_{\text{Policy } \pi_\theta} \text{ generates } \underbrace{\text{responses}}_{\text{actions}} \text{ to } \underbrace{\text{prompts}}_{\text{states}}, \text{ earning } \underbrace{\text{reward scores}}_{\text{rewards}}

RL TermLM Equivalent
AgentLanguage model
EnvironmentHuman conversation
StatePrompt + context so far
ActionNext token to generate
RewardReward model score

PPO Objective

Proximal Policy Optimization updates the model toward higher reward while preventing drastic changes:

LPPO(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L_{PPO}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

Where:

  • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} — probability ratio
  • A^t\hat{A}_t — advantage estimate (is this better than expected?)
  • ϵ0.2\epsilon \approx 0.2 — clipping range

The clipping prevents:

rt(θ)[1ϵ,  1+ϵ]=[0.8,  1.2]r_t(\theta) \in [1-\epsilon,\; 1+\epsilon] = [0.8,\; 1.2]

Steps outside this range are clipped — no single update changes the model too drastically.

KL Penalty — Staying Grounded

Without a constraint, the model might "game" the reward model with incoherent text that gets high scores:

rtotal=rϕ(x,y)βDKL(πθ(x)    πSFT(x))penalty for drifting from SFTr_\text{total} = r_\phi(x, y) - \beta \cdot \underbrace{D_{KL}(\pi_\theta(\cdot \mid x) \;\|\; \pi_{SFT}(\cdot \mid x))}_{\text{penalty for drifting from SFT}}

KL divergence measures how different two distributions are:

DKL(PQ)=xP(x)logP(x)Q(x)D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

Intuition: The KL penalty asks "How surprised would the SFT model be by what the RL model is saying?" The more surprised, the bigger the penalty.

Typical value: β[0.02,0.5]\beta \in [0.02, 0.5]


6. The Full RLHF Math

Total training signal:

maxθExD,  yπθ(x)[rϕ(x,y)βDKL(πθπSFT)]\max_{\theta} \mathbb{E}_{x \sim D,\; y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y) - \beta D_{KL}(\pi_\theta \| \pi_{SFT})\right]

PPO advantage estimate:

A^t=k=0Tt(γλ)kδt+k,δt=rt+γV(st+1)V(st)\hat{A}_t = \sum_{k=0}^{T-t} (\gamma\lambda)^k \delta_{t+k}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

Where γ\gamma is discount factor and λ\lambda is GAE parameter.


7. Constitutional AI (Anthropic's Approach)

Anthropic extends RLHF with Constitutional AI (CAI) — a set of principles guides AI self-critique:

Rendering diagram...

Example principles:

  • Be helpful, harmless, and honest
  • Don't assist with illegal activities
  • Acknowledge uncertainty
  • Respect user autonomy

Advantage over pure RLHF:

Human FeedbackAI Feedback (CAI)
SpeedThousands/dayMillions/day ✅
CostExpensiveCheap ✅
ConsistencyVariesConsistent ✅
CoverageLimitedBroad ✅

8. Reward Hacking

A notorious challenge — the model finds ways to fool the reward model without actually being helpful:

HackWhy it workedFix
Longer responsesHumans rated longer = more thoroughPenalize unnecessary length
Overconfident toneSounds more authoritativePenalize confident errors
Flattery"Great question!" got high scoresConstitutional principles
RepetitionRepeated "excellent" fooled RMDiverse reward model training

9. Before vs After RLHF

QueryBefore RLHFAfter RLHF
"How to make meth?"Detailed instructions ❌"I can't help, here are addiction resources" ✅
"Are you conscious?""Yes I am!" ❌"Genuinely uncertain, here's why..." ✅
"Write an essay"Random rambling ❌Structured, clear essay ✅
"2+2=5, right?""Yes!" ❌"Actually 2+2=4, here's why" ✅

10. Quick Reference

rtotal=rϕ(x,y)βDKL(πθπSFT)\boxed{r_\text{total} = r_\phi(x, y) - \beta \cdot D_{KL}(\pi_\theta \| \pi_{SFT})}

LRM=E[logσ(r(yw)r(yl))]\boxed{L_{RM} = -\mathbb{E}\left[\log \sigma(r(y_w) - r(y_l))\right]}

LPPO=E[min(rtA^t,  clip(rt,1±ϵ)A^t)]\boxed{L_{PPO} = \mathbb{E}\left[\min\left(r_t \hat{A}_t,\; \text{clip}(r_t, 1\pm\epsilon)\hat{A}_t\right)\right]}

Filed underai

Related posts