Reinforcement Learning (RL) trains an agent to make sequential decisions in an environment to maximise cumulative reward. Unlike supervised learning (labelled examples) or unsupervised learning (patterns), RL learns from the consequences of its own actions. Core components: Agent (learner/decision maker), Environment (world the agent interacts with), State (current situation), Action (what the agent does), Reward (feedback signal), Policy (strategy for choosing actions), and Value Function (expected future rewards). Q-Learning is the foundational model-free RL algorithm. Deep RL (DQN, PPO, A3C) powers AlphaGo, ChatGPT RLHF, and game-playing AI.
Real-life analogy: Training a dog
Training a dog to sit: the dog (agent) tries different behaviours (actions) in the room (environment). When it sits, you give a treat (positive reward). When it jumps, you say 'no' (negative reward). The dog learns to sit to maximise treats. It does not need labelled examples — it discovers the optimal policy through trial, error, and reward signals. RL is generalised dog training for any sequential decision-making problem.
Core RL components and Markov Decision Process
| Component | Symbol | Definition | Example (chess) |
|---|---|---|---|
| Agent | - | The learner/decision maker | Chess program |
| Environment | - | Everything outside the agent | Chess board + opponent |
| State | s ∈ S | Current situation of the environment | Current board position (all piece locations) |
| Action | a ∈ A(s) | What the agent does in state s | A legal chess move |
| Reward | r | Scalar feedback after each action | +1 win, -1 lose, 0 draw |
| Policy | π(a|s) | Strategy: probability of action a in state s | Which move to play in each position |
| Value function | V(s) | Expected future reward from state s | How good is this position? |
| Q-function | Q(s,a) | Expected reward for taking action a in state s | How good is this specific move? |
| Discount factor | γ ∈ [0,1] | Weight of future vs immediate rewards | 0.99: future rewards almost as valuable |
Return G_t: discounted sum of future rewards from time t. γ (gamma) controls time preference: γ=0 → myopic (only immediate reward). γ=1 → far-sighted (future rewards equally valuable). Bellman equation: V(s) = E[r + γV(s')] — value of a state = immediate reward + discounted value of next state.
Q-Learning — the foundational RL algorithm
Q-Learning from scratch on FrozenLake environment
import numpy as np
import gymnasium as gym
# FrozenLake: 4x4 grid, agent must reach goal (G) without falling in holes (H)
# States: 0-15 (16 grid positions), Actions: 0=Left, 1=Down, 2=Right, 3=Up
env = gym.make('FrozenLake-v1', is_slippery=False)
# Q-TABLE: Q[state, action] = expected future reward
n_states = env.observation_space.n # 16
n_actions = env.action_space.n # 4
Q = np.zeros((n_states, n_actions))
# Hyperparameters
alpha = 0.8 # Learning rate: how quickly update Q values
gamma = 0.95 # Discount factor: weight of future rewards
epsilon = 1.0 # Exploration rate: probability of random action
eps_decay = 0.995 # Epsilon decay per episode
eps_min = 0.01 # Minimum exploration rate
n_episodes = 2000
rewards_history = []
for episode in range(n_episodes):
state, _ = env.reset()
total_reward = 0
for step in range(100): # Max 100 steps per episode
# EXPLORATION vs EXPLOITATION (epsilon-greedy)
if np.random.random() < epsilon:
action = env.action_space.sample() # Explore: random action
else:
action = np.argmax(Q[state]) # Exploit: best known action
# Take action, observe next state and reward
next_state, reward, done, truncated, _ = env.step(action)
# Q-LEARNING UPDATE (Bellman equation)
# Q(s,a) ← Q(s,a) + α[r + γ max_a'Q(s',a') - Q(s,a)]
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
state = next_state
total_reward += reward
if done or truncated:
break
# Decay epsilon (less exploration over time as Q converges)
epsilon = max(eps_min, epsilon * eps_decay)
rewards_history.append(total_reward)
if (episode + 1) % 500 == 0:
avg_reward = np.mean(rewards_history[-100:])
print(f"Episode {episode+1}: Avg reward (last 100) = {avg_reward:.3f}, ε = {epsilon:.3f}")
print("Learned Q-table (first 5 states):")
print(Q[:5].round(3))
print("Learned policy:", ['Left','Down','Right','Up'][np.argmax(Q[i])] for i in range(16))Exploration vs Exploitation and modern deep RL
Exploration-Exploitation Dilemma: The agent must balance exploiting known good actions (taking the best known action) vs exploring new actions that might be even better. Pure exploitation: gets stuck in local optima. Pure exploration: never learns to perform well. Solutions: ε-greedy (explore randomly with probability ε), Upper Confidence Bound (UCB), Thompson Sampling.
| Algorithm | Key idea | Use case |
|---|---|---|
| Q-Learning | Tabular Q(s,a) — works for small discrete spaces | Grid worlds, simple games |
| DQN (Deep Q-Network) | Neural network approximates Q(s,a) | Atari games (DeepMind 2013) |
| Policy Gradient (REINFORCE) | Directly optimise policy π(a|s) via gradient ascent | Continuous action spaces |
| Actor-Critic (A2C/A3C) | Separate policy (actor) and value (critic) networks | Robotics, continuous control |
| PPO (Proximal Policy Optimisation) | Policy gradient with clipped objective — stable training | ChatGPT RLHF, games, robotics |
| AlphaGo / AlphaZero | Monte Carlo Tree Search + deep RL self-play | Chess, Go, board games |
RLHF — how ChatGPT/Claude are trained
Reinforcement Learning from Human Feedback (RLHF): (1) Supervised fine-tuning on high-quality demonstrations. (2) Train a reward model — humans rank multiple AI responses, reward model learns to predict human preference. (3) PPO optimises the LLM policy to maximise the reward model's score, subject to a KL constraint that prevents the policy from drifting too far from the SFT model. This is why LLMs follow instructions and are "helpful, harmless, and honest."
Practice questions
- Discount factor γ=0 vs γ=1 — what is the agent optimising in each case? (Answer: γ=0: myopic — only cares about immediate reward, ignores all future rewards. γ=1: far-sighted — treats all future rewards equally to immediate reward (no discounting). In practice: γ=0.95-0.99 for most tasks. γ<1 ensures the sum of infinite rewards is finite.)
- What is the Exploration-Exploitation dilemma? Give an example. (Answer: The agent must choose between exploiting the best known strategy (eating at your favourite restaurant) vs exploring new options that might be better (trying a new restaurant). Too much exploitation: never discovers better options. Too much exploration: wastes time on poor options. ε-greedy: explore randomly with probability ε.)
- Q-Learning update equation: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]. What is the term in brackets? (Answer: The TD error (Temporal Difference error) — the difference between the current Q estimate and the Bellman target (r + γ max Q(s',a')). If TD error > 0: current Q is too low, increase it. TD error < 0: current Q is too high, decrease it.)
- Why does DQN use a separate target network for the Bellman target? (Answer: Without a separate target network, both the Q network and the target (max Q(s',a')) change at every update step — like chasing a moving target. This causes instability and divergence. The target network is frozen for C steps then updated, making the target more stable during training.)
- What is the difference between a model-free and model-based RL algorithm? (Answer: Model-free (Q-Learning, PPO): learns directly from experience without modelling the environment dynamics P(s'|s,a). Model-based (Dyna, AlphaZero): learns a model of environment dynamics, then uses it to plan or generate simulated experience. Model-based is more sample-efficient but requires accurate models.)
On LumiChats
Claude and ChatGPT use PPO-based RLHF as the final training stage. Understanding Q-Learning and policy gradients directly explains why LLMs follow instructions — they are policy networks trained to maximise a human preference reward signal.
Try it free