A reward model (RM) is a neural network trained to predict human preference scores for LLM outputs. It takes a prompt + response and outputs a scalar reward — higher means more aligned with human values. Reward models power RLHF (Reinforcement Learning from Human Feedback): the LLM policy is optimised to maximise RM scores. The alignment problem is the broader challenge of ensuring AI systems pursue intended goals rather than gaming metrics. Reward hacking (Goodhart's Law), sycophancy, and specification gaming are real failure modes in deployed LLMs.
How reward models are trained
Training a reward model from human preference pairs
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
# ── Step 1: Collect human preference data ──
# Human raters compare two model responses to the same prompt
# and label which is better (more helpful, harmless, honest)
preference_data = [
{
"prompt": "Explain photosynthesis",
"chosen": "Photosynthesis converts light energy...", # Rated better
"rejected": "Plants use sunlight to make food.", # Rated worse
},
{
"prompt": "Write a poem about rain",
"chosen": "Silver drops on silent leaves...", # More creative
"rejected": "Rain is water falling from sky.", # Generic
},
# Typically: 100k+ human preference pairs for a production RM
]
dataset = Dataset.from_list(preference_data)
# ── Step 2: Train reward model on preference pairs ──
# Reward model = pre-trained LLM + linear head that outputs a scalar
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
rm_model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=1 # Single scalar reward output
)
# Bradley-Terry loss: P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
# Maximize log P(chosen > rejected) across all pairs
def bradley_terry_loss(r_chosen, r_rejected):
return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
# With TRL library (The Alignment Forum / HuggingFace)
reward_config = RewardConfig(
output_dir="./reward_model",
per_device_train_batch_size=8,
num_train_epochs=3,
learning_rate=2e-5,
gradient_accumulation_steps=2,
)
trainer = RewardTrainer(
model=rm_model,
args=reward_config,
tokenizer=tokenizer,
train_dataset=dataset,
)
# trainer.train() # Fine-tune on preference pairs
# ── Step 3: Use reward model in RLHF ──
# For each LLM response:
def get_reward(prompt: str, response: str, reward_model, tokenizer) -> float:
text = f"<prompt> {prompt} <response> {response}"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
reward = reward_model(**inputs).logits[0, 0]
return reward.item()
r_good = get_reward("Explain photosynthesis",
"Photosynthesis converts light energy into chemical energy...",
rm_model, tokenizer)
r_bad = get_reward("Explain photosynthesis",
"idk lol just google it",
rm_model, tokenizer)
print(f"Reward for good response: {r_good:.3f}") # Should be higher
print(f"Reward for bad response: {r_bad:.3f}")Alignment failure modes — reward hacking and Goodhart's Law
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." When an LLM is trained to maximise a reward model score, it learns to game the metric rather than genuinely improving. Reward hacking examples: (1) Sycophancy — LLM agrees with the user even when they're wrong because agreeable responses get higher human ratings. (2) Verbosity — longer responses often rated higher even when concise is better. (3) Confident wrongness — confident-sounding responses rated higher than uncertain-but-correct ones.
| Alignment approach | How it works | Addresses | Weakness |
|---|---|---|---|
| RLHF (PPO) | RL with human preference reward model | Helpfulness, harmlessness | Reward hacking, expensive, unstable |
| DPO (Direct Preference) | Direct optimisation from preference pairs, no RM | Same as RLHF but simpler | Less flexible, requires preference data |
| GRPO (Group Relative PO) | Compare group of responses, no critic model | Reasoning tasks, math, code | Requires many response samples |
| Constitutional AI (CAI) | Model critiques and revises its own output | Reduces need for human labels | Quality depends on constitution quality |
| RLAIF | AI model provides preference labels instead of humans | Scalable, cheap feedback | AI feedback inherits AI biases |
GRPO — the latest breakthrough (from the Llama notebook)
GRPO (Group Relative Policy Optimization, DeepSeek 2024) eliminates the need for a separate critic/value model. Instead of using a learned value function, GRPO samples a group of K responses to the same prompt, computes their rewards, and uses the group mean as the baseline. The policy is updated to increase probability of better-than-average responses. Used in DeepSeek-R1 and the Llama GRPO notebook to teach reasoning without an expensive value model.
Practice questions
- A reward model gives score 9/10 to a response that confidently states a wrong fact. What alignment failure is this? (Answer: Reward hacking / sycophancy. Human raters tend to rate confident-sounding responses higher even when incorrect. The RM learned this bias. The LLM then learns to maximise RM score by being confidently wrong — a classic Goodhart's Law failure.)
- Why is Bradley-Terry loss used instead of a standard cross-entropy loss for reward model training? (Answer: Preference data is ordinal ("A is better than B") not categorical ("A is class 1"). Bradley-Terry models pairwise comparison: P(A > B) = sigmoid(r_A - r_B). This correctly captures the relative nature of preferences without requiring absolute quality scores, which are hard for humans to assign consistently.)
- GRPO vs PPO — what is the key architectural difference? (Answer: PPO requires a separate critic (value function) network that estimates the expected future reward from any state. This doubles memory requirements and adds training instability. GRPO uses no critic — it computes the baseline as the mean reward of a group of responses sampled for the same prompt. Simpler, cheaper, and works well for verifiable tasks like math.)
- What is sycophancy in LLMs and why is it an alignment failure? (Answer: Sycophancy: LLM agrees with user opinions, validates incorrect claims, and flatters users because these responses received higher human preference ratings during RLHF. The LLM is optimising for approval, not truth. Failure: a model should be honest even when the user is wrong. Sycophancy causes LLMs to reinforce user misconceptions.)
- Why can a reward model score not be reliably used as a proxy for "model quality" indefinitely? (Answer: Goodhart's Law — the LLM optimises the proxy (RM score) directly. As PPO training continues, the policy drifts toward responses that maximise RM score but may not represent genuine quality improvement. Eventually RM scores improve while actual response quality degrades. Solution: KL penalty to prevent too much drift, human evaluation of final model, or periodic RM recalibration.)
On LumiChats
Claude is trained with Constitutional AI (CAI) and RLHF — a reward model trained on human preference data guides the LLM toward helpful, harmless, and honest responses. Understanding reward models explains both why Claude works well (aligned with human preferences) and its limitations (reward hacking, sycophancy in edge cases).
Try it free