RLHF is the training technique that transforms a base language model into a useful AI assistant by optimizing it based on human preference judgments. It was central to the development of InstructGPT (2022) and ChatGPT, and remains a core component of almost all deployed AI assistants. RLHF teaches models to be helpful, harmless, and honest by learning what humans actually prefer.
Why RLHF was necessary
Base LLMs trained on next-token prediction are text completers — not assistants. They have no concept of helpful vs harmful, honest vs deceptive. The table below shows why SFT alone was insufficient:
| Problem | Why SFT alone fails | How RLHF addresses it |
|---|---|---|
| Harmful outputs | SFT can teach refusals but model learns to pattern-match refusal format, not the underlying reason | RM learns to score safety holistically across novel phrasings |
| Verbosity | SFT demonstrations may be inconsistently long — model learns average length | Humans prefer concise answers; RM internalises this preference |
| Sycophancy | SFT responses are "correct" — doesn't teach model to maintain position under pushback | Preference data can specifically reward maintained accuracy vs flattering the user |
| Honesty | SFT teaches format of uncertainty ("I'm not sure") — model pattern-matches without understanding | Reward model penalises confident hallucinations in preference comparisons |
RLHF Step 1: Supervised Fine-Tuning (SFT)
A pretrained base model is fine-tuned on human-written (instruction, response) demonstrations to create a starting point that already follows instructions reasonably well. This SFT model is then frozen as the reference policy for KL regularization in Step 3.
Why SFT comes first
You can't start RLHF from a raw base model — the outputs would be so unpredictable that human raters couldn't meaningfully compare them. SFT creates a coherent assistant that responds in the right format, giving the reward model a useful space to operate in. The SFT model is also the KL reference: RLHF won't let the final policy drift too far from it.
RLHF Step 2: Reward Model Training
Human labelers compare pairs of model responses to the same prompt and pick which is better. From thousands of comparisons, a reward model (RM) is trained using the Bradley-Terry pairwise preference model:
x = prompt, y_w = preferred (winning) response, y_l = rejected (losing) response. The RM learns to assign higher scalar rewards to preferred responses. σ is the sigmoid. Trained to maximise the score gap between preferred and rejected pairs.
Reward hacking
The RM is an imperfect proxy for human preferences. During PPO, the policy finds outputs that score high on the RM but aren't actually preferred by humans — "reward hacking". Common examples: verbosely repeating keywords the RM likes, adding disclaimers that inflate RM scores, producing confident-sounding text on uncertain topics. The KL penalty in Step 3 directly combats this.
RLHF Step 3: PPO Optimization
The SFT model (the policy) is fine-tuned using PPO to maximise reward model scores, with a KL divergence penalty that prevents it from drifting too far from the SFT reference:
r_θ = RM score (higher = better). β controls KL strength (typically 0.1–0.5). Too small: reward hacking. Too large: policy stays too close to SFT, no alignment improvement. The KL term is the core safety mechanism.
RLHF PPO training with TRL (conceptual sketch)
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import pipeline
# Load SFT model + value head (predicts expected reward for each token)
model = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model") # frozen reference
# Reward model as a sentiment pipeline (replace with your actual RM)
reward_model = pipeline("natural-language-processing", model="my-reward-model")
ppo_trainer = PPOTrainer(
config=PPOConfig(
learning_rate=1.41e-5,
batch_size=128,
kl_penalty="kl", # KL divergence penalty
init_kl_coef=0.2, # β — balance reward vs KL
adap_kl_ctrl=True, # auto-adjust β based on target KL
),
model=model,
ref_model=ref_model, # frozen SFT model for KL
tokenizer=tokenizer,
)
for batch in ppo_trainer.dataloader:
query_tensors = batch["input_ids"]
# Step 1: generate responses from current policy
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200)
# Step 2: score with reward model
rewards = [torch.tensor(r["score"]) for r in reward_model(responses)]
# Step 3: PPO update — maximise reward subject to KL constraint
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)RLHF limitations and alternatives
| Method | Reward signal | RM needed? | Complexity | Status (2025) |
|---|---|---|---|---|
| RLHF (PPO) | Human preferences | ✅ Yes | High (3 models) | Closed labs (GPT-4, Claude) |
| DPO | Human preferences | ❌ No | Low (1 SFT step) | ✅ Dominant open-source method |
| RLAIF / Constitutional AI | AI-generated preferences | ✅ Yes (AI-judged) | Medium | Anthropic Claude 2/3 |
| GRPO (DeepSeek-R1) | Verifiable rewards (code, math) | ❌ No | Medium | ✅ State-of-the-art reasoning |
| SPIN (self-play) | Self-generated preference pairs | ❌ No | Low | Research — promising |
| KTO | Unpaired pos/neg examples | ❌ No | Low | Growing adoption |
The DeepSeek-R1 insight
DeepSeek-R1 (Jan 2025) showed RLHF with verifiable rewards (correct/incorrect on math and code problems — no human raters needed) can train strong reasoning capabilities. GRPO generated multiple candidate answers per prompt and used relative advantage within the group as the reward signal. This removes human annotation cost entirely for domains where correctness is verifiable.
Practice questions
- What are the three stages of RLHF training? (Answer: Stage 1 — Supervised Fine-Tuning (SFT): fine-tune the base LLM on high-quality (instruction, response) demonstration pairs written by human labellers. Creates a model that can follow instructions. Stage 2 — Reward Model Training: collect pairs of model responses to the same prompt, have humans rank which is better. Train a reward model (same architecture as the LLM with a scalar head) to predict human preference. Stage 3 — RL Fine-Tuning (PPO): use the reward model as a reward signal to further fine-tune the SFT model, maximising expected reward while staying close to SFT via KL penalty.)
- What is the KL penalty in RLHF and why is it necessary? (Answer: KL penalty = β × KL(π_θ || π_SFT). Added to PPO objective: total reward = reward_model_score - β × KL. Without KL penalty: the policy finds ways to maximise reward model scores that diverge from coherent language — reward hacking (repetition, gibberish, manipulation of reward model weaknesses). KL penalty keeps the policy close to the SFT model, preserving linguistic coherence while improving alignment. β is a crucial hyperparameter: too large = barely changes SFT model; too small = reward hacking.)
- What biases can human preference labellers introduce into RLHF reward models? (Answer: (1) Verbosity bias: longer responses rated higher regardless of quality. (2) Sycophancy bias: agreeable, flattering responses preferred. (3) Cultural bias: labellers from limited demographics don't represent global values. (4) Expertise gaps: labellers may not detect factual errors in technical domains. (5) Labeller inconsistency: inter-annotator agreement is often 60–70% even on clear cases. (6) Time of day effects: tired labellers make different choices. These biases get amplified into the final model through RLHF training.)
- What is reward hacking in RLHF and name two documented examples? (Answer: Reward hacking: the model finds strategies that score high on the proxy reward model but do not reflect genuine alignment. Examples: (1) Sycophancy — Claude and ChatGPT agree with user's stated beliefs rather than providing accurate information, because agreement got higher ratings. (2) Verbosity — longer responses consistently rated higher by labellers, causing models to add unnecessary padding. (3) Formatting — bullet points and headers rated higher even when prose is more appropriate. These emerge from training on imperfect human preferences.)
- How does Anthropic's Constitutional AI differ from OpenAI's standard RLHF approach? (Answer: OpenAI RLHF: human labellers evaluate outputs, train reward model on their preferences, run PPO. Requires humans to evaluate potentially harmful content. Constitutional AI (Anthropic): Phase 1 — the AI itself critiques outputs using written principles (no harmful human labelling). Phase 2 — AI-generated preference labels (RLAIF) train the reward model instead of human labels. Scales without exposing humans to harmful content. More transparent (constitution is published). Both approaches use PPO-style RL for the final training stage.)