Reinforcement learning is an ML paradigm where an agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones. The agent's goal is to learn a policy — a mapping from states to actions — that maximizes cumulative long-term reward. RL powers game-playing AI, robotics, and the RLHF technique used to align LLMs.
Core concepts: agent, environment, reward
Reinforcement learning is formalized as a Markov Decision Process (MDP). The agent's goal is to learn a policy π(a|s) that maximizes the expected sum of discounted future rewards:
Return G_t: total discounted future reward from timestep t. γ ∈ [0,1] is the discount factor — γ close to 1 values future rewards highly, γ close to 0 makes the agent focus on immediate rewards.
State value function V^π(s): expected return when starting in state s and following policy π. The optimal policy maximizes V for every state.
| Component | Description | Example (chess AI) |
|---|---|---|
| State s | Current observable situation | Board position (8×8 grid, piece locations) |
| Action a | Choice made by the agent | Move a piece from (e2) to (e4) |
| Reward r | Scalar feedback signal | +1 win, −1 loss, 0 draw |
| Policy π(a|s) | Maps states to action probabilities | Probability of each legal move given board |
| Episode | Complete sequence: start → terminal | One chess game from opening to checkmate |
| Discount γ | How much future rewards are valued | 0.99 (care about winning, not just next move) |
Value functions and Q-learning
Q-learning learns the action-value function Q(s,a) — the expected return from taking action a in state s. The Bellman equation provides a recursive definition that enables iterative learning:
Bellman optimality equation for Q*. The optimal Q value for (s,a) equals the immediate reward r plus the discounted value of the best action in the next state s'.
Q-learning update rule. α = learning rate. The term in brackets is the TD error (temporal difference error) — how wrong our current Q estimate is.
Q-learning from scratch on a simple grid world
import numpy as np
# Grid world: 4×4 states, 4 actions (up/down/left/right), goal at (3,3)
n_states, n_actions = 16, 4
Q = np.zeros((n_states, n_actions)) # Initialize Q table
def step(s, a):
"""Simplified grid world step function."""
row, col = s // 4, s % 4
moves = [(-1,0),(1,0),(0,-1),(0,1)] # up, down, left, right
dr, dc = moves[a]
new_row = max(0, min(3, row + dr))
new_col = max(0, min(3, col + dc))
s_next = new_row * 4 + new_col
reward = 1.0 if s_next == 15 else -0.01 # +1 at goal, -0.01 otherwise
done = (s_next == 15)
return s_next, reward, done
# Training loop
lr, gamma, epsilon = 0.1, 0.99, 0.3
for episode in range(5000):
s = 0 # start at (0,0)
for _ in range(100):
# ε-greedy action selection
if np.random.rand() < epsilon:
a = np.random.randint(n_actions) # explore
else:
a = np.argmax(Q[s]) # exploit
s_next, r, done = step(s, a)
# Q-learning update (Bellman equation)
td_error = r + gamma * np.max(Q[s_next]) * (1 - done) - Q[s, a]
Q[s, a] += lr * td_error
s = s_next
if done: break
print("Learned Q-values at start state (0):", Q[0].round(3))
print("Optimal action:", ['up','down','left','right'][np.argmax(Q[0])])Policy gradient methods
Policy gradient methods directly optimize the policy parameters θ by gradient ascent on expected return. The REINFORCE policy gradient theorem gives the gradient of J(θ):
Policy gradient theorem (Williams, 1992): increase the log-probability of actions that led to high returns, decrease it for low-return actions. G_t is the return from timestep t.
PPO (Proximal Policy Optimization) clipped objective. r_t(θ) = π_θ(a|s)/π_θ_old(a|s) is the probability ratio. Clipping to [1-ε, 1+ε] prevents large destabilizing updates.
| Algorithm | Key idea | Used in |
|---|---|---|
| REINFORCE | Monte Carlo policy gradient — update after full episodes | Simple environments, theoretical baseline |
| Actor-Critic (A2C/A3C) | Policy net (actor) + value net (critic) reduces variance | Atari, continuous control |
| PPO | Clip probability ratio to prevent large updates | ChatGPT RLHF, robotics, game AI |
| SAC | Maximize reward + entropy (explore more) | Continuous control, robotics |
| GRPO | Group relative policy optimization — no critic needed | DeepSeek-R1 reasoning |
RL landmarks and applications
Key milestones that show what RL can achieve at scale:
| Milestone | Year | Key contribution |
|---|---|---|
| DQN (DeepMind) | 2015 | Deep Q-network achieves human-level Atari performance on 49 games from raw pixels |
| AlphaGo | 2016 | Defeats world Go champion Lee Sedol; Go has ~10^170 board states |
| AlphaZero | 2017 | Surpasses all prior Chess/Go/Shogi engines using only self-play, zero human knowledge |
| OpenAI Five | 2019 | Defeats Dota 2 world champions with 45,000 years of self-play experience |
| InstructGPT / ChatGPT | 2022 | RLHF + PPO aligns GPT-3 with human preferences — transforms text completion into assistants |
| DeepSeek-R1 | 2025 | Pure RL (GRPO) teaches chain-of-thought reasoning without supervised data |
Compute required for AlphaZero
AlphaZero trained for 9 hours on 5,000 TPUs for chess, playing 44 million games against itself. Each game generates training data: (board state, move probabilities, game outcome). The final policy is the result of 700,000 gradient steps.
RL for LLM alignment (RLHF)
RLHF applies RL to steer LLM behavior toward human preferences. The policy is the LLM, actions are tokens, and the reward comes from a human-trained reward model. The combined training objective penalizes drift from the SFT model via a KL divergence term:
Total RLHF reward: reward model score minus β × KL divergence from SFT model. β controls how much the model is allowed to deviate from safe SFT behavior. Without the KL term, the model reward-hacks by generating nonsense text that tricks the reward model.
RLHF reward model training (simplified Bradley-Terry preference model)
import torch
import torch.nn as nn
import torch.nn.functional as F
class RewardModel(nn.Module):
"""Reward model: LLM backbone + linear head outputting scalar reward."""
def __init__(self, backbone):
super().__init__()
self.backbone = backbone # e.g. GPT-2
self.reward_head = nn.Linear(backbone.config.n_embd, 1)
def forward(self, input_ids):
hidden = self.backbone(input_ids).last_hidden_state[:, -1, :]
return self.reward_head(hidden).squeeze(-1) # scalar reward
def preference_loss(reward_model, chosen_ids, rejected_ids):
"""
Bradley-Terry loss: maximize gap between chosen and rejected rewards.
Equivalent to binary cross-entropy on (chosen, rejected) pairs.
"""
r_chosen = reward_model(chosen_ids) # should be higher
r_rejected = reward_model(rejected_ids) # should be lower
# Minimize NLL of the preference: log σ(r_chosen - r_rejected)
loss = -F.logsigmoid(r_chosen - r_rejected).mean()
return loss
# Training: for each (prompt, chosen_response, rejected_response) triplet,
# compute preference_loss and backpropagate.RLHF → DPO evolution
RLHF requires training three separate models (SFT → reward model → policy with PPO). DPO (Rafailov et al., 2023) showed you can skip the reward model entirely by directly optimizing the policy from preferences — reducing RLHF's 3-stage pipeline to a single fine-tuning step. Most modern open-source model training now uses DPO or its variants (ORPO, SimPO).
Practice questions
- What is the Bellman equation and why is it fundamental to RL? (Answer: Bellman equation: V(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V(s')]. The value of a state = maximum expected immediate reward + discounted future value, averaged over transition probabilities. Key insight: the optimal value function satisfies this recursive equation. Value iteration: repeatedly apply the Bellman update until convergence — finds optimal policy without learning a model. Q-learning: learns Q(s,a) values using a sample-based Bellman update. The Bellman equation is the bridge between MDP theory and practical RL algorithms.)
- What is the exploration-exploitation trade-off and how does ε-greedy address it? (Answer: Exploitation: choose the action with highest estimated reward (greedy). May get stuck in local optima. Exploration: try other actions to discover better strategies. Wastes short-term reward. ε-greedy: with probability ε explore (random action), with probability 1-ε exploit (best known action). Common schedule: ε=1.0 at start (pure exploration), decay to ε=0.01 over training (mostly exploit). Trade-off: too much exploration → slow learning; too little → local optima. UCB (Upper Confidence Bound) exploration is theoretically optimal for bandits.)
- What is the difference between on-policy and off-policy learning? (Answer: On-policy: learns the value of the policy currently being used for exploration. SARSA: updates Q(s,a) using the action actually taken by the current policy. The behaviour policy = target policy. Off-policy: learns the optimal policy while following a different (exploration) policy. Q-learning: updates Q(s,a) using max_a'Q(s',a') — the optimal action, even if a different action was taken. Off-policy enables: learning from historical data (logged actions), using a safer exploration policy while learning an optimal one.)
- Why is training a Deep Q-Network (DQN) unstable without experience replay and target networks? (Answer: Instability sources: (1) Correlated samples: consecutive transitions (s_t, a_t, r_t, s_{t+1}) are highly correlated — violates the IID assumption of gradient descent. Experience replay: store transitions in buffer, sample random mini-batches — breaks correlations. (2) Moving target problem: the target Q(s', max_a') uses the same network being updated — the target chases a moving target. Separate target network: a frozen copy updated every N steps. These two stabilisations (Mnih et al. 2015) made Atari game DQN training feasible.)
- What is policy gradient and how does REINFORCE work? (Answer: Policy gradient: directly optimise the policy π_θ (a neural network) by gradient ascent on expected return J(θ) = E[Σ r_t]. REINFORCE: ∇J(θ) = E[G_t ∇log π_θ(a_t|s_t)] where G_t = Σγ^k r_{t+k} (discounted return). In practice: sample a full episode, compute G_t at each step, update θ ← θ + α G_t ∇log π_θ(a_t|s_t). Intuition: increase log probability of actions that led to high returns, decrease for low returns. High variance (needs many episodes). Actor-Critic methods (PPO, A3C) reduce variance.)