What is core RL components and Markov Decision Process?

Reinforcement Learning — Policy, Value, Q-Learning & Exploration: Core RL components and Markov Decision Process. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/reinforcement-learning-deep

What is q-Learning — the foundational RL algorithm?

Reinforcement Learning — Policy, Value, Q-Learning & Exploration: Q-Learning — the foundational RL algorithm. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/reinforcement-learning-deep

What is practice questions?

Reinforcement Learning — Policy, Value, Q-Learning & Exploration: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/reinforcement-learning-deep

Reinforcement Learning — Q-Learning, Policy, Value, RLHF

Reinforcement Learning — Policy, Value, Q-Learning & Exploration

Reinforcement Learning (RL) trains an agent to make sequential decisions in an environment to maximize cumulative reward. Unlike supervised learning (labeled examples) or unsupervised learning (patterns), RL learns from the consequences of its own actions. Core components: Agent (learner/decision maker), Environment (world the agent interacts with), State (current situation), Action (what the agent does), Reward (feedback signal), Policy (strategy for choosing actions), and Value Function (expected future rewards). Q-Learning is the foundational model-free RL algorithm. Deep RL (DQN, PPO, A3C) powers AlphaGo, ChatGPT RLHF, and game-playing AI.

Learning by trial and error — agents, rewards, and the path to optimal decisions.

Category: Machine Learning

Real-life analogy: Training a dog

Training a dog to sit: the dog (agent) tries different behaviors (actions) in the room (environment). When it sits, you give a treat (positive reward). When it jumps, you say 'no' (negative reward). The dog learns to sit to maximize treats. It does not need labeled examples — it discovers the optimal policy through trial, error, and reward signals. RL is generalized dog training for any sequential decision-making problem.

Core RL components and Markov Decision Process

Component	Symbol	Definition	Example (chess)
Agent	-	The learner/decision maker	Chess program
Environment	-	Everything outside the agent	Chess board + opponent
State	s ∈ S	Current situation of the environment	Current board position (all piece locations)
Action	a ∈ A(s)	What the agent does in state s	A legal chess move
Reward	r	Scalar feedback after each action	+1 win, -1 lose, 0 draw
Policy	π(a\|s)	Strategy: probability of action a in state s	Which move to play in each position
Value function	V(s)	Expected future reward from state s	How good is this position?
Q-function	Q(s,a)	Expected reward for taking action a in state s	How good is this specific move?
Discount factor	γ ∈ [0,1]	Weight of future vs immediate rewards	0.99: future rewards almost as valuable

G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

Q-Learning — the foundational RL algorithm

import numpy as np
import gymnasium as gym

# FrozenLake: 4x4 grid, agent must reach goal (G) without falling in holes (H)
# States: 0-15 (16 grid positions), Actions: 0=Left, 1=Down, 2=Right, 3=Up
env = gym.make('FrozenLake-v1', is_slippery=False)

# Q-TABLE: Q[state, action] = expected future reward
n_states  = env.observation_space.n   # 16
n_actions = env.action_space.n        # 4
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha       = 0.8     # Learning rate: how quickly update Q values
gamma       = 0.95    # Discount factor: weight of future rewards
epsilon     = 1.0     # Exploration rate: probability of random action
eps_decay   = 0.995   # Epsilon decay per episode
eps_min     = 0.01    # Minimum exploration rate
n_episodes  = 2000

rewards_history = []

for episode in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0

    for step in range(100):  # Max 100 steps per episode
        # EXPLORATION vs EXPLOITATION (epsilon-greedy)
        if np.random.random() < epsilon:
            action = env.action_space.sample()   # Explore: random action
        else:
            action = np.argmax(Q[state])          # Exploit: best known action

        # Take action, observe next state and reward
        next_state, reward, done, truncated, _ = env.step(action)

        # Q-LEARNING UPDATE (Bellman equation)
        # Q(s,a) ← Q(s,a) + α[r + γ max_a'Q(s',a') - Q(s,a)]
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state
        total_reward += reward

        if done or truncated:
            break

    # Decay epsilon (less exploration over time as Q converges)
    epsilon = max(eps_min, epsilon * eps_decay)
    rewards_history.append(total_reward)

    if (episode + 1) % 500 == 0:
        avg_reward = np.mean(rewards_history[-100:])
        print(f"Episode {episode+1}: Avg reward (last 100) = {avg_reward:.3f}, ε = {epsilon:.3f}")

print("Learned Q-table (first 5 states):")
print(Q[:5].round(3))
print("Learned policy:", ['Left','Down','Right','Up'][np.argmax(Q[i])] for i in range(16))

Exploration vs Exploitation and modern deep RL

Exploration-Exploitation Dilemma: The agent must balance exploiting known good actions (taking the best known action) vs exploring new actions that might be even better. Pure exploitation: gets stuck in local optima. Pure exploration: never learns to perform well. Solutions: ε-greedy (explore randomly with probability ε), Upper Confidence Bound (UCB), Thompson Sampling.

Algorithm	Key idea	Use case
Q-Learning	Tabular Q(s,a) — works for small discrete spaces	Grid worlds, simple games
DQN (Deep Q-Network)	Neural network approximates Q(s,a)	Atari games (DeepMind 2013)
Policy Gradient (REINFORCE)	Directly optimize policy π(a\|s) via gradient ascent	Continuous action spaces
Actor-Critic (A2C/A3C)	Separate policy (actor) and value (critic) networks	Robotics, continuous control
PPO (Proximal Policy Optimization)	Policy gradient with clipped objective — stable training	ChatGPT RLHF, games, robotics
AlphaGo / AlphaZero	Monte Carlo Tree Search + deep RL self-play	Chess, Go, board games

RLHF — how ChatGPT/Claude are trained: Reinforcement Learning from Human Feedback (RLHF): (1) Supervised fine-tuning on high-quality demonstrations. (2) Train a reward model — humans rank multiple AI responses, reward model learns to predict human preference. (3) PPO optimizes the LLM policy to maximize the reward model's score, subject to a KL constraint that prevents the policy from drifting too far from the SFT model. This is why LLMs follow instructions and are "helpful, harmless, and honest."

Practice questions

Discount factor γ=0 vs γ=1 — what is the agent optimizing in each case? (Answer: γ=0: myopic — only cares about immediate reward, ignores all future rewards. γ=1: far-sighted — treats all future rewards equally to immediate reward (no discounting). In practice: γ=0.95-0.99 for most tasks. γ<1 ensures the sum of infinite rewards is finite.)
What is the Exploration-Exploitation dilemma? Give an example. (Answer: The agent must choose between exploiting the best known strategy (eating at your favorite restaurant) vs exploring new options that might be better (trying a new restaurant). Too much exploitation: never discovers better options. Too much exploration: wastes time on poor options. ε-greedy: explore randomly with probability ε.)
Q-Learning update equation: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]. What is the term in brackets? (Answer: The TD error (Temporal Difference error) — the difference between the current Q estimate and the Bellman target (r + γ max Q(s',a')). If TD error > 0: current Q is too low, increase it. TD error < 0: current Q is too high, decrease it.)
Why does DQN use a separate target network for the Bellman target? (Answer: Without a separate target network, both the Q network and the target (max Q(s',a')) change at every update step — like chasing a moving target. This causes instability and divergence. The target network is frozen for C steps then updated, making the target more stable during training.)
What is the difference between a model-free and model-based RL algorithm? (Answer: Model-free (Q-Learning, PPO): learns directly from experience without modeling the environment dynamics P(s'|s,a). Model-based (Dyna, AlphaZero): learns a model of environment dynamics, then uses it to plan or generate simulated experience. Model-based is more sample-efficient but requires accurate models.)

Claude and ChatGPT use PPO-based RLHF as the final training stage. Understanding Q-Learning and policy gradients directly explains why LLMs follow instructions — they are policy networks trained to maximize a human preference reward signal.

Definition

Real-life analogy: Training a dog

Core RL components and Markov Decision Process

Component	Symbol	Definition	Example (chess)
Agent	-	The learner/decision maker	Chess program
Environment	-	Everything outside the agent	Chess board + opponent
State	s ∈ S	Current situation of the environment	Current board position (all piece locations)
Action	a ∈ A(s)	What the agent does in state s	A legal chess move
Reward	r	Scalar feedback after each action	+1 win, -1 lose, 0 draw
Policy	π(a\|s)	Strategy: probability of action a in state s	Which move to play in each position
Value function	V(s)	Expected future reward from state s	How good is this position?
Q-function	Q(s,a)	Expected reward for taking action a in state s	How good is this specific move?
Discount factor	γ ∈ [0,1]	Weight of future vs immediate rewards	0.99: future rewards almost as valuable

Return G_t: discounted sum of future rewards from time t. γ (gamma) controls time preference: γ=0 → myopic (only immediate reward). γ=1 → far-sighted (future rewards equally valuable). Bellman equation: V(s) = E[r + γV(s')] — value of a state = immediate reward + discounted value of next state.

Q-Learning — the foundational RL algorithm

Q-Learning from scratch on FrozenLake environment

import numpy as np
import gymnasium as gym

# FrozenLake: 4x4 grid, agent must reach goal (G) without falling in holes (H)
# States: 0-15 (16 grid positions), Actions: 0=Left, 1=Down, 2=Right, 3=Up
env = gym.make('FrozenLake-v1', is_slippery=False)

# Q-TABLE: Q[state, action] = expected future reward
n_states  = env.observation_space.n   # 16
n_actions = env.action_space.n        # 4
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha       = 0.8     # Learning rate: how quickly update Q values
gamma       = 0.95    # Discount factor: weight of future rewards
epsilon     = 1.0     # Exploration rate: probability of random action
eps_decay   = 0.995   # Epsilon decay per episode
eps_min     = 0.01    # Minimum exploration rate
n_episodes  = 2000

rewards_history = []

for episode in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0

    for step in range(100):  # Max 100 steps per episode
        # EXPLORATION vs EXPLOITATION (epsilon-greedy)
        if np.random.random() < epsilon:
            action = env.action_space.sample()   # Explore: random action
        else:
            action = np.argmax(Q[state])          # Exploit: best known action

        # Take action, observe next state and reward
        next_state, reward, done, truncated, _ = env.step(action)

        # Q-LEARNING UPDATE (Bellman equation)
        # Q(s,a) ← Q(s,a) + α[r + γ max_a'Q(s',a') - Q(s,a)]
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state
        total_reward += reward

        if done or truncated:
            break

    # Decay epsilon (less exploration over time as Q converges)
    epsilon = max(eps_min, epsilon * eps_decay)
    rewards_history.append(total_reward)

    if (episode + 1) % 500 == 0:
        avg_reward = np.mean(rewards_history[-100:])
        print(f"Episode {episode+1}: Avg reward (last 100) = {avg_reward:.3f}, ε = {epsilon:.3f}")

print("Learned Q-table (first 5 states):")
print(Q[:5].round(3))
print("Learned policy:", ['Left','Down','Right','Up'][np.argmax(Q[i])] for i in range(16))

Exploration vs Exploitation and modern deep RL

Algorithm	Key idea	Use case
Q-Learning	Tabular Q(s,a) — works for small discrete spaces	Grid worlds, simple games
DQN (Deep Q-Network)	Neural network approximates Q(s,a)	Atari games (DeepMind 2013)
Policy Gradient (REINFORCE)	Directly optimize policy π(a\|s) via gradient ascent	Continuous action spaces
Actor-Critic (A2C/A3C)	Separate policy (actor) and value (critic) networks	Robotics, continuous control
PPO (Proximal Policy Optimization)	Policy gradient with clipped objective — stable training	ChatGPT RLHF, games, robotics
AlphaGo / AlphaZero	Monte Carlo Tree Search + deep RL self-play	Chess, Go, board games

RLHF — how ChatGPT/Claude are trained

Reinforcement Learning from Human Feedback (RLHF): (1) Supervised fine-tuning on high-quality demonstrations. (2) Train a reward model — humans rank multiple AI responses, reward model learns to predict human preference. (3) PPO optimizes the LLM policy to maximize the reward model's score, subject to a KL constraint that prevents the policy from drifting too far from the SFT model. This is why LLMs follow instructions and are "helpful, harmless, and honest."

Practice questions

Discount factor γ=0 vs γ=1 — what is the agent optimizing in each case? (Answer: γ=0: myopic — only cares about immediate reward, ignores all future rewards. γ=1: far-sighted — treats all future rewards equally to immediate reward (no discounting). In practice: γ=0.95-0.99 for most tasks. γ<1 ensures the sum of infinite rewards is finite.)
What is the Exploration-Exploitation dilemma? Give an example. (Answer: The agent must choose between exploiting the best known strategy (eating at your favorite restaurant) vs exploring new options that might be better (trying a new restaurant). Too much exploitation: never discovers better options. Too much exploration: wastes time on poor options. ε-greedy: explore randomly with probability ε.)
Q-Learning update equation: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]. What is the term in brackets? (Answer: The TD error (Temporal Difference error) — the difference between the current Q estimate and the Bellman target (r + γ max Q(s',a')). If TD error > 0: current Q is too low, increase it. TD error < 0: current Q is too high, decrease it.)
Why does DQN use a separate target network for the Bellman target? (Answer: Without a separate target network, both the Q network and the target (max Q(s',a')) change at every update step — like chasing a moving target. This causes instability and divergence. The target network is frozen for C steps then updated, making the target more stable during training.)
What is the difference between a model-free and model-based RL algorithm? (Answer: Model-free (Q-Learning, PPO): learns directly from experience without modeling the environment dynamics P(s'|s,a). Model-based (Dyna, AlphaZero): learns a model of environment dynamics, then uses it to plan or generate simulated experience. Model-based is more sample-efficient but requires accurate models.)

On LumiChats

Try it free

Reinforcement Learning — Policy, Value, Q-Learning & Exploration

Real-life analogy: Training a dog

Core RL components and Markov Decision Process

Q-Learning — the foundational RL algorithm

Exploration vs Exploitation and modern deep RL

Practice questions

Reinforcement Learning — Policy, Value, Q-Learning & Exploration

Real-life analogy: Training a dog

Core RL components and Markov Decision Process

Q-Learning — the foundational RL algorithm

Exploration vs Exploitation and modern deep RL

Practice questions

Practice what you just learned

Related Terms