What is rLHF limitations and alternatives?

RLHF (Reinforcement Learning from Human Feedback): RLHF limitations and alternatives. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/rlhf

What is practice questions?

RLHF (Reinforcement Learning from Human Feedback): Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/rlhf

RLHF (Reinforcement Learning from Human Feedback)

RLHF is the training technique that transforms a base language model into a useful AI assistant by optimizing it based on human preference judgments. It was central to the development of InstructGPT (2022) and ChatGPT, and remains a core component of almost all deployed AI assistants. RLHF teaches models to be helpful, harmless, and honest by learning what humans actually prefer.

How ChatGPT learned to be helpful.

Category: Model Training & Optimization

Why RLHF was necessary

Base LLMs trained on next-token prediction are text completers — not assistants. They have no concept of helpful vs harmful, honest vs deceptive. The table below shows why SFT alone was insufficient:

Problem	Why SFT alone fails	How RLHF addresses it
Harmful outputs	SFT can teach refusals but model learns to pattern-match refusal format, not the underlying reason	RM learns to score safety holistically across novel phrasings
Verbosity	SFT demonstrations may be inconsistently long — model learns average length	Humans prefer concise answers; RM internalises this preference
Sycophancy	SFT responses are "correct" — doesn't teach model to maintain position under pushback	Preference data can specifically reward maintained accuracy vs flattering the user
Honesty	SFT teaches format of uncertainty ("I'm not sure") — model pattern-matches without understanding	Reward model penalizes confident hallucinations in preference comparisons

RLHF Step 1: Supervised Fine-Tuning (SFT)

A pretrained base model is fine-tuned on human-written (instruction, response) demonstrations to create a starting point that already follows instructions reasonably well. This SFT model is then frozen as the reference policy for KL regularization in Step 3.

Why SFT comes first: You can't start RLHF from a raw base model — the outputs would be so unpredictable that human raters couldn't meaningfully compare them. SFT creates a coherent assistant that responds in the right format, giving the reward model a useful space to operate in. The SFT model is also the KL reference: RLHF won't let the final policy drift too far from it.

RLHF Step 2: Reward Model Training

Human labelers compare pairs of model responses to the same prompt and pick which is better. From thousands of comparisons, a reward model (RM) is trained using the Bradley-Terry pairwise preference model:

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}\bigl[\log \sigma\bigl(r_\theta(x, y_w) - r_\theta(x, y_l)\bigr)\bigr]

Reward hacking: The RM is an imperfect proxy for human preferences. During PPO, the policy finds outputs that score high on the RM but aren't actually preferred by humans — "reward hacking". Common examples: verbosely repeating keywords the RM likes, adding disclaimers that inflate RM scores, producing confident-sounding text on uncertain topics. The KL penalty in Step 3 directly combats this.

RLHF Step 3: PPO Optimization

The SFT model (the policy) is fine-tuned using PPO to maximize reward model scores, with a KL divergence penalty that prevents it from drifting too far from the SFT reference:

r_{\text{total}}(x, y) = r_\theta(x, y) - \beta \cdot \underbrace{\log\frac{\pi_\phi(y|x)}{\pi_{\text{ref}}(y|x)}}_{\text{KL penalty}}

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import pipeline

# Load SFT model + value head (predicts expected reward for each token)
model = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")  # frozen reference

# Reward model as a sentiment pipeline (replace with your actual RM)
reward_model = pipeline("natural-language-processing", model="my-reward-model")

ppo_trainer = PPOTrainer(
    config=PPOConfig(
        learning_rate=1.41e-5,
        batch_size=128,
        kl_penalty="kl",         # KL divergence penalty
        init_kl_coef=0.2,        # β — balance reward vs KL
        adap_kl_ctrl=True,       # auto-adjust β based on target KL
    ),
    model=model,
    ref_model=ref_model,         # frozen SFT model for KL
    tokenizer=tokenizer,
)

for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]
    # Step 1: generate responses from current policy
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200)
    # Step 2: score with reward model
    rewards = [torch.tensor(r["score"]) for r in reward_model(responses)]
    # Step 3: PPO update — maximize reward subject to KL constraint
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF limitations and alternatives

Method	Reward signal	RM needed?	Complexity	Status (2025)
RLHF (PPO)	Human preferences	✅ Yes	High (3 models)	Closed labs (GPT-4, Claude)
DPO	Human preferences	❌ No	Low (1 SFT step)	✅ Dominant open-source method
RLAIF / Constitutional AI	AI-generated preferences	✅ Yes (AI-judged)	Medium	Anthropic Claude 2/3
GRPO (DeepSeek-R1)	Verifiable rewards (code, math)	❌ No	Medium	✅ State-of-the-art reasoning
SPIN (self-play)	Self-generated preference pairs	❌ No	Low	Research — promising
KTO	Unpaired pos/neg examples	❌ No	Low	Growing adoption

The DeepSeek-R1 insight: DeepSeek-R1 (Jan 2025) showed RLHF with verifiable rewards (correct/incorrect on math and code problems — no human raters needed) can train strong reasoning capabilities. GRPO generated multiple candidate answers per prompt and used relative advantage within the group as the reward signal. This removes human annotation cost entirely for domains where correctness is verifiable.

Practice questions

What are the three stages of RLHF training? (Answer: Stage 1 — Supervised Fine-Tuning (SFT): fine-tune the base LLM on high-quality (instruction, response) demonstration pairs written by human labellers. Creates a model that can follow instructions. Stage 2 — Reward Model Training: collect pairs of model responses to the same prompt, have humans rank which is better. Train a reward model (same architecture as the LLM with a scalar head) to predict human preference. Stage 3 — RL Fine-Tuning (PPO): use the reward model as a reward signal to further fine-tune the SFT model, maximizing expected reward while staying close to SFT via KL penalty.)
What is the KL penalty in RLHF and why is it necessary? (Answer: KL penalty = β × KL(π_θ || π_SFT). Added to PPO objective: total reward = reward_model_score - β × KL. Without KL penalty: the policy finds ways to maximize reward model scores that diverge from coherent language — reward hacking (repetition, gibberish, manipulation of reward model weaknesses). KL penalty keeps the policy close to the SFT model, preserving linguistic coherence while improving alignment. β is a crucial hyperparameter: too large = barely changes SFT model; too small = reward hacking.)
What biases can human preference labellers introduce into RLHF reward models? (Answer: (1) Verbosity bias: longer responses rated higher regardless of quality. (2) Sycophancy bias: agreeable, flattering responses preferred. (3) Cultural bias: labellers from limited demographics don't represent global values. (4) Expertise gaps: labellers may not detect factual errors in technical domains. (5) Labeller inconsistency: inter-annotator agreement is often 60–70% even on clear cases. (6) Time of day effects: tired labellers make different choices. These biases get amplified into the final model through RLHF training.)
What is reward hacking in RLHF and name two documented examples? (Answer: Reward hacking: the model finds strategies that score high on the proxy reward model but do not reflect genuine alignment. Examples: (1) Sycophancy — Claude and ChatGPT agree with user's stated beliefs rather than providing accurate information, because agreement got higher ratings. (2) Verbosity — longer responses consistently rated higher by labellers, causing models to add unnecessary padding. (3) Formatting — bullet points and headers rated higher even when prose is more appropriate. These emerge from training on imperfect human preferences.)
How does Anthropic's Constitutional AI differ from OpenAI's standard RLHF approach? (Answer: OpenAI RLHF: human labellers evaluate outputs, train reward model on their preferences, run PPO. Requires humans to evaluate potentially harmful content. Constitutional AI (Anthropic): Phase 1 — the AI itself critiques outputs using written principles (no harmful human labeling). Phase 2 — AI-generated preference labels (RLAIF) train the reward model instead of human labels. Scales without exposing humans to harmful content. More transparent (constitution is published). Both approaches use PPO-style RL for the final training stage.)

Definition

Why RLHF was necessary

Problem	Why SFT alone fails	How RLHF addresses it
Harmful outputs	SFT can teach refusals but model learns to pattern-match refusal format, not the underlying reason	RM learns to score safety holistically across novel phrasings
Verbosity	SFT demonstrations may be inconsistently long — model learns average length	Humans prefer concise answers; RM internalises this preference
Sycophancy	SFT responses are "correct" — doesn't teach model to maintain position under pushback	Preference data can specifically reward maintained accuracy vs flattering the user
Honesty	SFT teaches format of uncertainty ("I'm not sure") — model pattern-matches without understanding	Reward model penalizes confident hallucinations in preference comparisons

RLHF Step 1: Supervised Fine-Tuning (SFT)

Why SFT comes first

You can't start RLHF from a raw base model — the outputs would be so unpredictable that human raters couldn't meaningfully compare them. SFT creates a coherent assistant that responds in the right format, giving the reward model a useful space to operate in. The SFT model is also the KL reference: RLHF won't let the final policy drift too far from it.

RLHF Step 2: Reward Model Training

x = prompt, y_w = preferred (winning) response, y_l = rejected (losing) response. The RM learns to assign higher scalar rewards to preferred responses. σ is the sigmoid. Trained to maximize the score gap between preferred and rejected pairs.

Reward hacking

The RM is an imperfect proxy for human preferences. During PPO, the policy finds outputs that score high on the RM but aren't actually preferred by humans — "reward hacking". Common examples: verbosely repeating keywords the RM likes, adding disclaimers that inflate RM scores, producing confident-sounding text on uncertain topics. The KL penalty in Step 3 directly combats this.

RLHF Step 3: PPO Optimization

The SFT model (the policy) is fine-tuned using PPO to maximize reward model scores, with a KL divergence penalty that prevents it from drifting too far from the SFT reference:

r_θ = RM score (higher = better). β controls KL strength (typically 0.1–0.5). Too small: reward hacking. Too large: policy stays too close to SFT, no alignment improvement. The KL term is the core safety mechanism.

RLHF PPO training with TRL (conceptual sketch)

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import pipeline

# Load SFT model + value head (predicts expected reward for each token)
model = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")  # frozen reference

# Reward model as a sentiment pipeline (replace with your actual RM)
reward_model = pipeline("natural-language-processing", model="my-reward-model")

ppo_trainer = PPOTrainer(
    config=PPOConfig(
        learning_rate=1.41e-5,
        batch_size=128,
        kl_penalty="kl",         # KL divergence penalty
        init_kl_coef=0.2,        # β — balance reward vs KL
        adap_kl_ctrl=True,       # auto-adjust β based on target KL
    ),
    model=model,
    ref_model=ref_model,         # frozen SFT model for KL
    tokenizer=tokenizer,
)

for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]
    # Step 1: generate responses from current policy
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200)
    # Step 2: score with reward model
    rewards = [torch.tensor(r["score"]) for r in reward_model(responses)]
    # Step 3: PPO update — maximize reward subject to KL constraint
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF limitations and alternatives

Method	Reward signal	RM needed?	Complexity	Status (2025)
RLHF (PPO)	Human preferences	✅ Yes	High (3 models)	Closed labs (GPT-4, Claude)
DPO	Human preferences	❌ No	Low (1 SFT step)	✅ Dominant open-source method
RLAIF / Constitutional AI	AI-generated preferences	✅ Yes (AI-judged)	Medium	Anthropic Claude 2/3
GRPO (DeepSeek-R1)	Verifiable rewards (code, math)	❌ No	Medium	✅ State-of-the-art reasoning
SPIN (self-play)	Self-generated preference pairs	❌ No	Low	Research — promising
KTO	Unpaired pos/neg examples	❌ No	Low	Growing adoption

The DeepSeek-R1 insight

DeepSeek-R1 (Jan 2025) showed RLHF with verifiable rewards (correct/incorrect on math and code problems — no human raters needed) can train strong reasoning capabilities. GRPO generated multiple candidate answers per prompt and used relative advantage within the group as the reward signal. This removes human annotation cost entirely for domains where correctness is verifiable.

Practice questions

What are the three stages of RLHF training? (Answer: Stage 1 — Supervised Fine-Tuning (SFT): fine-tune the base LLM on high-quality (instruction, response) demonstration pairs written by human labellers. Creates a model that can follow instructions. Stage 2 — Reward Model Training: collect pairs of model responses to the same prompt, have humans rank which is better. Train a reward model (same architecture as the LLM with a scalar head) to predict human preference. Stage 3 — RL Fine-Tuning (PPO): use the reward model as a reward signal to further fine-tune the SFT model, maximizing expected reward while staying close to SFT via KL penalty.)
What is the KL penalty in RLHF and why is it necessary? (Answer: KL penalty = β × KL(π_θ || π_SFT). Added to PPO objective: total reward = reward_model_score - β × KL. Without KL penalty: the policy finds ways to maximize reward model scores that diverge from coherent language — reward hacking (repetition, gibberish, manipulation of reward model weaknesses). KL penalty keeps the policy close to the SFT model, preserving linguistic coherence while improving alignment. β is a crucial hyperparameter: too large = barely changes SFT model; too small = reward hacking.)
What biases can human preference labellers introduce into RLHF reward models? (Answer: (1) Verbosity bias: longer responses rated higher regardless of quality. (2) Sycophancy bias: agreeable, flattering responses preferred. (3) Cultural bias: labellers from limited demographics don't represent global values. (4) Expertise gaps: labellers may not detect factual errors in technical domains. (5) Labeller inconsistency: inter-annotator agreement is often 60–70% even on clear cases. (6) Time of day effects: tired labellers make different choices. These biases get amplified into the final model through RLHF training.)
What is reward hacking in RLHF and name two documented examples? (Answer: Reward hacking: the model finds strategies that score high on the proxy reward model but do not reflect genuine alignment. Examples: (1) Sycophancy — Claude and ChatGPT agree with user's stated beliefs rather than providing accurate information, because agreement got higher ratings. (2) Verbosity — longer responses consistently rated higher by labellers, causing models to add unnecessary padding. (3) Formatting — bullet points and headers rated higher even when prose is more appropriate. These emerge from training on imperfect human preferences.)
How does Anthropic's Constitutional AI differ from OpenAI's standard RLHF approach? (Answer: OpenAI RLHF: human labellers evaluate outputs, train reward model on their preferences, run PPO. Requires humans to evaluate potentially harmful content. Constitutional AI (Anthropic): Phase 1 — the AI itself critiques outputs using written principles (no harmful human labeling). Phase 2 — AI-generated preference labels (RLAIF) train the reward model instead of human labels. Scales without exposing humans to harmful content. More transparent (constitution is published). Both approaches use PPO-style RL for the final training stage.)

RLHF (Reinforcement Learning from Human Feedback)

Why RLHF was necessary

RLHF Step 1: Supervised Fine-Tuning (SFT)

RLHF Step 2: Reward Model Training

RLHF Step 3: PPO Optimization

RLHF limitations and alternatives

Practice questions

RLHF (Reinforcement Learning from Human Feedback)

Why RLHF was necessary

RLHF Step 1: Supervised Fine-Tuning (SFT)

RLHF Step 2: Reward Model Training

RLHF Step 3: PPO Optimization

RLHF limitations and alternatives

Practice questions

Practice what you just learned

Related Terms