What is practice questions?

Gradient Descent: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/gradient-descent

Gradient Descent

Gradient descent is the core optimization algorithm used to train neural networks. It works by iteratively adjusting the model's parameters (weights) in the direction that most reduces the loss function — the measure of how wrong the model's predictions are. By following the gradient (the slope) of the loss downhill, the model gradually learns to make better predictions.

How AI learns by minimizing mistakes.

Category: Machine Learning

The intuition: walking downhill

Imagine you're blindfolded on a hilly landscape, trying to reach the lowest point. You can feel the slope under your feet. Gradient descent is exactly this: at each step, feel which direction is steepest downhill, and take a step that way.

The 'hills' are defined by the loss function — a mathematical surface over the parameter space. The lowest point is where the model makes the fewest mistakes. The gradient tells us which direction increases the loss most steeply — so we move in the opposite direction.

Gradient: The gradient ∇L(θ) is a vector pointing in the direction of steepest increase of the loss. Moving in the negative gradient direction is steepest descent.

The update rule

At each training step, we update each parameter θ by subtracting a fraction of the gradient:

\theta_{t+1} = \theta_t - \eta \cdot \nabla_{\theta} \mathcal{L}(\theta_t)

The learning rate η controls step size. Too large → overshoot the minimum and diverge. Too small → slow convergence. Choosing η is one of the most important hyperparameters in training.

import numpy as np

# Simple 1D example: minimize f(x) = x² (minimum at x=0)
def loss(x):
    return x ** 2

def gradient(x):
    return 2 * x   # derivative of x²

x = 10.0          # start far from minimum
lr = 0.1          # learning rate

print(f"Start: x={x:.4f}, loss={loss(x):.4f}")

for step in range(20):
    grad = gradient(x)
    x = x - lr * grad          # gradient descent update
    if step % 4 == 3:
        print(f"Step {step+1:2d}: x={x:.6f}, loss={loss(x):.6f}")

# Start:    x=10.0000, loss=100.0000
# Step   4: x=2.0971,  loss=4.3982
# Step   8: x=0.4295,  loss=0.1845
# Step  12: x=0.0879,  loss=0.0077
# Step  16: x=0.0180,  loss=0.0003
# Step  20: x=0.0037,  loss=0.0000  ← converging to 0

Stochastic and Mini-batch Gradient Descent

Computing the gradient over the entire dataset (Batch GD) is too expensive for modern LLMs trained on trillions of tokens. In practice, we use mini-batches:

Variant	Batch size	Gradient quality	Speed
Batch GD	Full dataset	Exact	Very slow — one update per full pass
Stochastic GD (SGD)	1 sample	Noisy	Fast — but very noisy updates
Mini-batch GD	32–4096 samples	Good estimate	Best of both — used in all modern LLMs

Mini-batch GD is the standard. LLaMA 3 was trained with a batch size of ~4 million tokens per step. The noise in mini-batch gradients actually helps escape shallow local minima — an unexpected benefit.

Modern optimizers: Adam and beyond

Vanilla gradient descent is rarely used in practice. Modern optimizers add adaptive learning rates and momentum:

m_t = \beta_1 m_{t-1} + (1 - \beta_1)\,g_t

v_t = \beta_2 v_{t-1} + (1 - \beta_2)\,g_t^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\,\hat{m}_t

import numpy as np

def adam_step(params, grads, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
    """
    One step of the Adam optimizer.

    params: dict of parameter arrays
    grads:  dict of gradient arrays (same keys)
    m, v:   first and second moment estimates (updated in place)
    t:      current timestep (for bias correction)
    """
    for key in params:
        g = grads[key]

        # Update biased moment estimates
        m[key] = beta1 * m[key] + (1 - beta1) * g
        v[key] = beta2 * v[key] + (1 - beta2) * g**2

        # Bias correction
        m_hat = m[key] / (1 - beta1**t)
        v_hat = v[key] / (1 - beta2**t)

        # Parameter update
        params[key] -= lr * m_hat / (np.sqrt(v_hat) + eps)

    return params, m, v

# Adam defaults: lr=1e-3, β1=0.9, β2=0.999, ε=1e-8
# Used in virtually every modern LLM training run
# (AdamW adds weight decay for regularization)

AdamW: Most LLMs use AdamW (Adam + Weight Decay), which decouples the weight decay from the gradient update. This provides better regularization and is standard in LLaMA, GPT, and Claude training.

Learning rate schedules

A fixed learning rate is rarely optimal. Modern training uses schedules — the learning rate changes over training:

Warmup — start with a very small lr and linearly increase over the first 1-5% of training steps. Prevents instability when parameters are randomly initialized.
Cosine annealing — decrease lr following a cosine curve to near-zero by the end of training. Widely used: GPT-3, LLaMA, Claude.
Linear decay — simpler, similar results to cosine for many tasks.
Cyclic LR — oscillate between min and max lr, allowing the model to escape local minima periodically.

import numpy as np

def cosine_lr_schedule(step, total_steps, lr_max=3e-4, lr_min=3e-5, warmup_steps=2000):
    """
    Linear warmup → cosine decay.
    Used in GPT-3, LLaMA, and most modern LLM training runs.
    """
    if step < warmup_steps:
        # Linear warmup
        return lr_max * (step / warmup_steps)
    else:
        # Cosine annealing
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
        return lr_min + (lr_max - lr_min) * cosine_decay

# Visualize the schedule
total = 100_000
steps = np.arange(total)
lrs   = [cosine_lr_schedule(s, total) for s in steps]

# At step 0:          lr = 0.000000 (warmup start)
# At step 2000:       lr = 0.000300 (warmup end / peak)
# At step 50000:      lr = 0.000165 (halfway through decay)
# At step 100000:     lr = 0.000030 (final lr)

Practice questions

What is the difference between convex and non-convex loss landscapes and why does it matter for gradient descent? (Answer: Convex: only one global minimum — gradient descent always converges to the optimal solution. Linear and logistic regression are convex. Non-convex: multiple local minima, saddle points, and flat regions. Neural networks are highly non-convex. Key insight from deep learning practice: most local minima in neural networks are approximately as good as the global minimum — the dangerous failure modes are saddle points and very flat regions where gradients vanish. SGD's noise actually helps navigate past saddle points.)
What is momentum in gradient descent and what is the intuition for the exponential moving average? (Answer: Standard GD: θ ← θ - α∇L. Momentum: v ← β v + (1-β)∇L; θ ← θ - α v. The velocity v is an exponential moving average of past gradients — effectively smoothing the gradient direction. Intuition: a ball rolling down a hill builds up speed (velocity) in the consistent downhill direction, allowing it to overcome small bumps and navigate narrow ravines more smoothly than a ball that stops at each step. β=0.9 means 90% of previous velocity is kept.)
What does it mean for gradient descent to converge and what are the convergence criteria? (Answer: Convergence: the loss function reaches (approximately) a minimum — subsequent gradient steps produce negligible change. Practical criteria: (1) Gradient norm ||∇L|| < ε (e.g., 10⁻⁴). (2) Loss change |L_t - L_{t-1}| < ε for N consecutive steps. (3) Validation metric stops improving for patience P epochs (early stopping). (4) Fixed iteration limit reached. For deep learning, convergence to a local minimum is sufficient — global optimum is generally unachievable and usually unnecessary.)
What is the learning rate warm-up schedule and why is it critical for training large transformers? (Answer: Warm-up: LR increases linearly from 0 to target_LR over the first W steps. At training start, parameters are randomly initialized — gradient estimates are noisy and the loss landscape is unfamiliar. Large LR immediately causes destructive updates. After warm-up, the model has a better initial estimate of the loss landscape, gradients are more reliable, and larger steps are safe. BERT uses 10,000 warm-up steps; GPT-3 uses 375M token warm-up. Without warm-up, large transformer training frequently diverges in the first few thousand steps.)
What is the saddle point problem and how does SGD's noise help escape it? (Answer: Saddle points: gradient is zero but it's not a minimum — flat in some directions, downward in others. In high dimensions, saddle points are much more common than local minima. Batch GD would get stuck (gradient = 0, no update). SGD noise: each mini-batch gives a noisy gradient estimate. At saddle points, noise perturbs the gradient in random directions — some perturbations point in downhill escape directions. SGD's noise is therefore beneficial at saddle points. Adam's adaptive learning rates also help by giving larger steps in flat directions.)

Definition

The intuition: walking downhill

Gradient

The gradient ∇L(θ) is a vector pointing in the direction of steepest increase of the loss. Moving in the negative gradient direction is steepest descent.

The update rule

At each training step, we update each parameter θ by subtracting a fraction of the gradient:

The gradient descent update rule. η (eta) = learning rate, ∇L = gradient of the loss.

The learning rate η controls step size. Too large → overshoot the minimum and diverge. Too small → slow convergence. Choosing η is one of the most important hyperparameters in training.

Vanilla gradient descent from scratch

import numpy as np

# Simple 1D example: minimize f(x) = x² (minimum at x=0)
def loss(x):
    return x ** 2

def gradient(x):
    return 2 * x   # derivative of x²

x = 10.0          # start far from minimum
lr = 0.1          # learning rate

print(f"Start: x={x:.4f}, loss={loss(x):.4f}")

for step in range(20):
    grad = gradient(x)
    x = x - lr * grad          # gradient descent update
    if step % 4 == 3:
        print(f"Step {step+1:2d}: x={x:.6f}, loss={loss(x):.6f}")

# Start:    x=10.0000, loss=100.0000
# Step   4: x=2.0971,  loss=4.3982
# Step   8: x=0.4295,  loss=0.1845
# Step  12: x=0.0879,  loss=0.0077
# Step  16: x=0.0180,  loss=0.0003
# Step  20: x=0.0037,  loss=0.0000  ← converging to 0

Stochastic and Mini-batch Gradient Descent

Computing the gradient over the entire dataset (Batch GD) is too expensive for modern LLMs trained on trillions of tokens. In practice, we use mini-batches:

Variant	Batch size	Gradient quality	Speed
Batch GD	Full dataset	Exact	Very slow — one update per full pass
Stochastic GD (SGD)	1 sample	Noisy	Fast — but very noisy updates
Mini-batch GD	32–4096 samples	Good estimate	Best of both — used in all modern LLMs

Modern optimizers: Adam and beyond

Vanilla gradient descent is rarely used in practice. Modern optimizers add adaptive learning rates and momentum:

Adam: first moment (momentum) — exponential moving average of gradients

Adam: second moment — exponential moving average of squared gradients

Adam update: parameters updated with learning rate scaled per-dimension by gradient variance

Adam optimizer from scratch (simplified)

import numpy as np

def adam_step(params, grads, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
    """
    One step of the Adam optimizer.

    params: dict of parameter arrays
    grads:  dict of gradient arrays (same keys)
    m, v:   first and second moment estimates (updated in place)
    t:      current timestep (for bias correction)
    """
    for key in params:
        g = grads[key]

        # Update biased moment estimates
        m[key] = beta1 * m[key] + (1 - beta1) * g
        v[key] = beta2 * v[key] + (1 - beta2) * g**2

        # Bias correction
        m_hat = m[key] / (1 - beta1**t)
        v_hat = v[key] / (1 - beta2**t)

        # Parameter update
        params[key] -= lr * m_hat / (np.sqrt(v_hat) + eps)

    return params, m, v

# Adam defaults: lr=1e-3, β1=0.9, β2=0.999, ε=1e-8
# Used in virtually every modern LLM training run
# (AdamW adds weight decay for regularization)

AdamW

Most LLMs use AdamW (Adam + Weight Decay), which decouples the weight decay from the gradient update. This provides better regularization and is standard in LLaMA, GPT, and Claude training.

Learning rate schedules

A fixed learning rate is rarely optimal. Modern training uses schedules — the learning rate changes over training:

Warmup — start with a very small lr and linearly increase over the first 1-5% of training steps. Prevents instability when parameters are randomly initialized.
Cosine annealing — decrease lr following a cosine curve to near-zero by the end of training. Widely used: GPT-3, LLaMA, Claude.
Linear decay — simpler, similar results to cosine for many tasks.
Cyclic LR — oscillate between min and max lr, allowing the model to escape local minima periodically.

Cosine learning rate schedule with warmup (common in LLM training)

import numpy as np

def cosine_lr_schedule(step, total_steps, lr_max=3e-4, lr_min=3e-5, warmup_steps=2000):
    """
    Linear warmup → cosine decay.
    Used in GPT-3, LLaMA, and most modern LLM training runs.
    """
    if step < warmup_steps:
        # Linear warmup
        return lr_max * (step / warmup_steps)
    else:
        # Cosine annealing
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
        return lr_min + (lr_max - lr_min) * cosine_decay

# Visualize the schedule
total = 100_000
steps = np.arange(total)
lrs   = [cosine_lr_schedule(s, total) for s in steps]

# At step 0:          lr = 0.000000 (warmup start)
# At step 2000:       lr = 0.000300 (warmup end / peak)
# At step 50000:      lr = 0.000165 (halfway through decay)
# At step 100000:     lr = 0.000030 (final lr)

Practice questions

What is the difference between convex and non-convex loss landscapes and why does it matter for gradient descent? (Answer: Convex: only one global minimum — gradient descent always converges to the optimal solution. Linear and logistic regression are convex. Non-convex: multiple local minima, saddle points, and flat regions. Neural networks are highly non-convex. Key insight from deep learning practice: most local minima in neural networks are approximately as good as the global minimum — the dangerous failure modes are saddle points and very flat regions where gradients vanish. SGD's noise actually helps navigate past saddle points.)
What is momentum in gradient descent and what is the intuition for the exponential moving average? (Answer: Standard GD: θ ← θ - α∇L. Momentum: v ← β v + (1-β)∇L; θ ← θ - α v. The velocity v is an exponential moving average of past gradients — effectively smoothing the gradient direction. Intuition: a ball rolling down a hill builds up speed (velocity) in the consistent downhill direction, allowing it to overcome small bumps and navigate narrow ravines more smoothly than a ball that stops at each step. β=0.9 means 90% of previous velocity is kept.)
What does it mean for gradient descent to converge and what are the convergence criteria? (Answer: Convergence: the loss function reaches (approximately) a minimum — subsequent gradient steps produce negligible change. Practical criteria: (1) Gradient norm ||∇L|| < ε (e.g., 10⁻⁴). (2) Loss change |L_t - L_{t-1}| < ε for N consecutive steps. (3) Validation metric stops improving for patience P epochs (early stopping). (4) Fixed iteration limit reached. For deep learning, convergence to a local minimum is sufficient — global optimum is generally unachievable and usually unnecessary.)
What is the learning rate warm-up schedule and why is it critical for training large transformers? (Answer: Warm-up: LR increases linearly from 0 to target_LR over the first W steps. At training start, parameters are randomly initialized — gradient estimates are noisy and the loss landscape is unfamiliar. Large LR immediately causes destructive updates. After warm-up, the model has a better initial estimate of the loss landscape, gradients are more reliable, and larger steps are safe. BERT uses 10,000 warm-up steps; GPT-3 uses 375M token warm-up. Without warm-up, large transformer training frequently diverges in the first few thousand steps.)
What is the saddle point problem and how does SGD's noise help escape it? (Answer: Saddle points: gradient is zero but it's not a minimum — flat in some directions, downward in others. In high dimensions, saddle points are much more common than local minima. Batch GD would get stuck (gradient = 0, no update). SGD noise: each mini-batch gives a noisy gradient estimate. At saddle points, noise perturbs the gradient in random directions — some perturbations point in downhill escape directions. SGD's noise is therefore beneficial at saddle points. Adam's adaptive learning rates also help by giving larger steps in flat directions.)

Gradient Descent

The intuition: walking downhill

The update rule

Stochastic and Mini-batch Gradient Descent

Modern optimizers: Adam and beyond

Learning rate schedules

Practice questions

Gradient Descent

The intuition: walking downhill

The update rule

Stochastic and Mini-batch Gradient Descent

Modern optimizers: Adam and beyond

Learning rate schedules

Practice questions

Practice what you just learned

Related Terms