Glossary/Learning Rate, SGD, Adam & Gradient Descent Variants
Machine Learning

Learning Rate, SGD, Adam & Gradient Descent Variants

The mechanics of training — how fast you learn and which path you take to the minimum.


Definition

The learning rate controls how large a step gradient descent takes at each iteration. Too large: oscillates and diverges. Too small: converges painfully slowly. SGD (Stochastic Gradient Descent) updates parameters using one example at a time. Mini-batch SGD uses small batches. Momentum accumulates past gradients for smoother updates. Adam combines adaptive learning rates with momentum and is the default optimiser for most deep learning. Choosing and tuning the optimiser is among the most impactful decisions in model training.

Real-life analogy: Walking down a mountain blindfolded

Imagine descending a mountain blindfolded, feeling only the slope under your feet. Gradient descent: always step in the steepest downhill direction. Learning rate: how large each step is. Too large = you might step over the valley into another hill. Too small = you take forever. Momentum: you build speed in consistent downhill directions, avoiding zig-zagging in narrow valleys. Adam: you automatically adjust your step size per direction — tiny steps in steep areas, larger steps in flat areas.

SGD and mini-batch gradient descent

VariantBatch sizeUpdates per epochNoiseMemoryBest for
Batch GDAll n examples1None (exact gradient)High — needs full datasetSmall datasets, convex problems
Stochastic GD (SGD)1 examplenVery highO(1)Online learning, huge datasets
Mini-batch GD32–512 examplesn/batch_sizeModerate (beneficial)LowStandard deep learning — best balance

SGD variants comparison with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Simple linear regression problem
torch.manual_seed(42)
X = torch.randn(1000, 5)
y = X @ torch.tensor([2., -1., 0.5, 3., -2.]) + 0.1 * torch.randn(1000)

model_sgd   = nn.Linear(5, 1)
model_adam  = nn.Linear(5, 1)
model_mom   = nn.Linear(5, 1)
loss_fn     = nn.MSELoss()

# Different optimisers
opt_sgd  = optim.SGD(model_sgd.parameters(),  lr=0.01)
opt_adam = optim.Adam(model_adam.parameters(), lr=0.001)   # Default β1=0.9, β2=0.999
opt_mom  = optim.SGD(model_mom.parameters(),  lr=0.01, momentum=0.9)

dataset = torch.utils.data.TensorDataset(X, y.unsqueeze(1))
loader  = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(10):
    losses = {'sgd': [], 'adam': [], 'momentum': []}
    for X_batch, y_batch in loader:
        for model, opt, name in [(model_sgd, opt_sgd, 'sgd'),
                                  (model_adam, opt_adam, 'adam'),
                                  (model_mom, opt_mom, 'momentum')]:
            opt.zero_grad()
            loss = loss_fn(model(X_batch), y_batch)
            loss.backward()
            opt.step()
            losses[name].append(loss.item())
    if epoch % 2 == 0:
        for name in losses:
            print(f"Epoch {epoch} {name}: {np.mean(losses[name]):.4f}")

Adam optimiser — the modern default

Adam: m_t = first moment (momentum, default β₁=0.9). v_t = second moment (adaptive learning rate, default β₂=0.999). Bias-corrected: m̂_t = m_t/(1-β₁ᵗ). Effective learning rate = α/√v̂ — large for rarely updated parameters, small for frequently updated ones.

Learning rate schedulers and warmup

import torch.optim as optim
from torch.optim.lr_scheduler import (StepLR, CosineAnnealingLR,
    ReduceLROnPlateau, OneCycleLR)

model = nn.Linear(10, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# 1. Step decay: reduce LR by gamma every step_size epochs
scheduler_step = StepLR(optimizer, step_size=10, gamma=0.5)
# LR: epoch 0-9: 1e-3, epoch 10-19: 5e-4, epoch 20-29: 2.5e-4...

# 2. Cosine annealing: smoothly reduce LR to min over T_max epochs
scheduler_cos = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# 3. Reduce on plateau: reduce when metric stops improving
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min',
    factor=0.5, patience=5, min_lr=1e-7)

# 4. OneCycle: warm up then anneal (best for fast training)
scheduler_one = OneCycleLR(optimizer, max_lr=1e-2,
    steps_per_epoch=100, epochs=10)

# Training loop with scheduler
for epoch in range(100):
    train_loss = train_one_epoch()   # Your training function

    scheduler_plateau.step(train_loss)  # Pass val loss to plateau scheduler

    # For step/cosine: call after each epoch
    scheduler_step.step()

    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.2e}, Loss = {train_loss:.4f}")

# Popular optimisers and when to use:
# SGD + Momentum: theoretical advantages (generalisation), used for CV models
# Adam: fastest convergence, default for NLP, transformers, mixed results on CV
# AdamW: Adam + weight decay decoupled (better generalisation) — GPT-4, Claude
# RMSprop: good for RNNs, similar to Adam without first moment
# Adagrad: sparse data, NLP, adapts per-parameter (but LR shrinks to zero)

Practice questions

  1. Learning rate of 10.0 vs 0.000001 — what happens with each? (Answer: LR=10: huge steps overshoot the minimum, oscillate wildly, may diverge (loss increases). LR=0.000001: infinitesimally small steps, learning is correct but takes millions of iterations to converge — impractically slow.)
  2. Why does SGD noise (from using one example at a time) sometimes help? (Answer: Noise helps escape local minima and saddle points — random perturbations can kick the optimiser out of flat regions. SGD noise also acts as implicit regularisation, often finding flatter minima that generalise better than the sharp minima that batch GD tends to find.)
  3. Adam uses β₁=0.9 and β₂=0.999. What do these hyperparameters control? (Answer: β₁=0.9: exponential decay rate for first moment (gradient momentum) — 90% of past gradients kept, 10% of current gradient. β₂=0.999: decay rate for second moment (gradient variance) — slow-moving estimate of per-parameter gradient squared. Higher = smoother, more history retained.)
  4. What is the difference between AdaGrad and Adam regarding learning rate decay? (Answer: AdaGrad accumulates all past squared gradients — learning rate shrinks monotonically and eventually reaches near-zero (learning stops). Adam uses exponential moving average of squared gradients — old information decays away, preventing the learning rate from shrinking to zero.)
  5. Why is learning rate warmup used in transformer training? (Answer: At the start of training, the model is randomly initialised — gradients are noisy and large. A large learning rate immediately would cause destructive updates. Warmup linearly increases LR from 0 to target over the first 1000-10000 steps, letting the model stabilise before taking large steps.)

On LumiChats

LumiChats can help you choose the right optimiser and learning rate for your specific model and dataset, debug slow convergence or loss spikes, and implement learning rate scheduling strategies in PyTorch or TensorFlow.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms