What is sGD and mini-batch gradient descent?

Learning Rate, SGD, Adam & Gradient Descent Variants: SGD and mini-batch gradient descent. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/learning-rate-optimizers

What is adam optimizer — the modern default?

Learning Rate, SGD, Adam & Gradient Descent Variants: Adam optimizer — the modern default. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/learning-rate-optimizers

What is practice questions?

Learning Rate, SGD, Adam & Gradient Descent Variants: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/learning-rate-optimizers

Learning Rate, SGD, Adam Optimizer & Gradient Descent

Learning Rate, SGD, Adam & Gradient Descent Variants

The learning rate controls how large a step gradient descent takes at each iteration. Too large: oscillates and diverges. Too small: converges painfully slowly. SGD (Stochastic Gradient Descent) updates parameters using one example at a time. Mini-batch SGD uses small batches. Momentum accumulates past gradients for smoother updates. Adam combines adaptive learning rates with momentum and is the default optimizer for most deep learning. Choosing and tuning the optimizer is among the most impactful decisions in model training.

The mechanics of training — how fast you learn and which path you take to the minimum.

Category: Machine Learning

Real-life analogy: Walking down a mountain blindfolded

Imagine descending a mountain blindfolded, feeling only the slope under your feet. Gradient descent: always step in the steepest downhill direction. Learning rate: how large each step is. Too large = you might step over the valley into another hill. Too small = you take forever. Momentum: you build speed in consistent downhill directions, avoiding zig-zagging in narrow valleys. Adam: you automatically adjust your step size per direction — tiny steps in steep areas, larger steps in flat areas.

SGD and mini-batch gradient descent

Variant	Batch size	Updates per epoch	Noise	Memory	Best for
Batch GD	All n examples	1	None (exact gradient)	High — needs full dataset	Small datasets, convex problems
Stochastic GD (SGD)	1 example	n	Very high	O(1)	Online learning, huge datasets
Mini-batch GD	32–512 examples	n/batch_size	Moderate (beneficial)	Low	Standard deep learning — best balance

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Simple linear regression problem
torch.manual_seed(42)
X = torch.randn(1000, 5)
y = X @ torch.tensor([2., -1., 0.5, 3., -2.]) + 0.1 * torch.randn(1000)

model_sgd   = nn.Linear(5, 1)
model_adam  = nn.Linear(5, 1)
model_mom   = nn.Linear(5, 1)
loss_fn     = nn.MSELoss()

# Different optimizers
opt_sgd  = optim.SGD(model_sgd.parameters(),  lr=0.01)
opt_adam = optim.Adam(model_adam.parameters(), lr=0.001)   # Default β1=0.9, β2=0.999
opt_mom  = optim.SGD(model_mom.parameters(),  lr=0.01, momentum=0.9)

dataset = torch.utils.data.TensorDataset(X, y.unsqueeze(1))
loader  = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(10):
    losses = {'sgd': [], 'adam': [], 'momentum': []}
    for X_batch, y_batch in loader:
        for model, opt, name in [(model_sgd, opt_sgd, 'sgd'),
                                  (model_adam, opt_adam, 'adam'),
                                  (model_mom, opt_mom, 'momentum')]:
            opt.zero_grad()
            loss = loss_fn(model(X_batch), y_batch)
            loss.backward()
            opt.step()
            losses[name].append(loss.item())
    if epoch % 2 == 0:
        for name in losses:
            print(f"Epoch {epoch} {name}: {np.mean(losses[name]):.4f}")

Adam optimizer — the modern default

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \quad \hat{\theta} = \theta - \frac{\alpha}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t

import torch.optim as optim
from torch.optim.lr_scheduler import (StepLR, CosineAnnealingLR,
    ReduceLROnPlateau, OneCycleLR)

model = nn.Linear(10, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# 1. Step decay: reduce LR by gamma every step_size epochs
scheduler_step = StepLR(optimizer, step_size=10, gamma=0.5)
# LR: epoch 0-9: 1e-3, epoch 10-19: 5e-4, epoch 20-29: 2.5e-4...

# 2. Cosine annealing: smoothly reduce LR to min over T_max epochs
scheduler_cos = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# 3. Reduce on plateau: reduce when metric stops improving
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min',
    factor=0.5, patience=5, min_lr=1e-7)

# 4. OneCycle: warm up then anneal (best for fast training)
scheduler_one = OneCycleLR(optimizer, max_lr=1e-2,
    steps_per_epoch=100, epochs=10)

# Training loop with scheduler
for epoch in range(100):
    train_loss = train_one_epoch()   # Your training function

    scheduler_plateau.step(train_loss)  # Pass val loss to plateau scheduler

    # For step/cosine: call after each epoch
    scheduler_step.step()

    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.2e}, Loss = {train_loss:.4f}")

# Popular optimizers and when to use:
# SGD + Momentum: theoretical advantages (generalization), used for CV models
# Adam: fastest convergence, default for NLP, transformers, mixed results on CV
# AdamW: Adam + weight decay decoupled (better generalization) — GPT-4, Claude
# RMSprop: good for RNNs, similar to Adam without first moment
# Adagrad: sparse data, NLP, adapts per-parameter (but LR shrinks to zero)

Practice questions

Learning rate of 10.0 vs 0.000001 — what happens with each? (Answer: LR=10: huge steps overshoot the minimum, oscillate wildly, may diverge (loss increases). LR=0.000001: infinitesimally small steps, learning is correct but takes millions of iterations to converge — impractically slow.)
Why does SGD noise (from using one example at a time) sometimes help? (Answer: Noise helps escape local minima and saddle points — random perturbations can kick the optimizer out of flat regions. SGD noise also acts as implicit regularization, often finding flatter minima that generalize better than the sharp minima that batch GD tends to find.)
Adam uses β₁=0.9 and β₂=0.999. What do these hyperparameters control? (Answer: β₁=0.9: exponential decay rate for first moment (gradient momentum) — 90% of past gradients kept, 10% of current gradient. β₂=0.999: decay rate for second moment (gradient variance) — slow-moving estimate of per-parameter gradient squared. Higher = smoother, more history retained.)
What is the difference between AdaGrad and Adam regarding learning rate decay? (Answer: AdaGrad accumulates all past squared gradients — learning rate shrinks monotonically and eventually reaches near-zero (learning stops). Adam uses exponential moving average of squared gradients — old information decays away, preventing the learning rate from shrinking to zero.)
Why is learning rate warmup used in transformer training? (Answer: At the start of training, the model is randomly initialized — gradients are noisy and large. A large learning rate immediately would cause destructive updates. Warmup linearly increases LR from 0 to target over the first 1000-10000 steps, letting the model stabilise before taking large steps.)

LumiChats can help you choose the right optimizer and learning rate for your specific model and dataset, debug slow convergence or loss spikes, and implement learning rate scheduling strategies in PyTorch or TensorFlow.

Definition

Real-life analogy: Walking down a mountain blindfolded

SGD and mini-batch gradient descent

Variant	Batch size	Updates per epoch	Noise	Memory	Best for
Batch GD	All n examples	1	None (exact gradient)	High — needs full dataset	Small datasets, convex problems
Stochastic GD (SGD)	1 example	n	Very high	O(1)	Online learning, huge datasets
Mini-batch GD	32–512 examples	n/batch_size	Moderate (beneficial)	Low	Standard deep learning — best balance

SGD variants comparison with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Simple linear regression problem
torch.manual_seed(42)
X = torch.randn(1000, 5)
y = X @ torch.tensor([2., -1., 0.5, 3., -2.]) + 0.1 * torch.randn(1000)

model_sgd   = nn.Linear(5, 1)
model_adam  = nn.Linear(5, 1)
model_mom   = nn.Linear(5, 1)
loss_fn     = nn.MSELoss()

# Different optimizers
opt_sgd  = optim.SGD(model_sgd.parameters(),  lr=0.01)
opt_adam = optim.Adam(model_adam.parameters(), lr=0.001)   # Default β1=0.9, β2=0.999
opt_mom  = optim.SGD(model_mom.parameters(),  lr=0.01, momentum=0.9)

dataset = torch.utils.data.TensorDataset(X, y.unsqueeze(1))
loader  = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(10):
    losses = {'sgd': [], 'adam': [], 'momentum': []}
    for X_batch, y_batch in loader:
        for model, opt, name in [(model_sgd, opt_sgd, 'sgd'),
                                  (model_adam, opt_adam, 'adam'),
                                  (model_mom, opt_mom, 'momentum')]:
            opt.zero_grad()
            loss = loss_fn(model(X_batch), y_batch)
            loss.backward()
            opt.step()
            losses[name].append(loss.item())
    if epoch % 2 == 0:
        for name in losses:
            print(f"Epoch {epoch} {name}: {np.mean(losses[name]):.4f}")

Adam optimizer — the modern default

Adam: m_t = first moment (momentum, default β₁=0.9). v_t = second moment (adaptive learning rate, default β₂=0.999). Bias-corrected: m̂_t = m_t/(1-β₁ᵗ). Effective learning rate = α/√v̂ — large for rarely updated parameters, small for frequently updated ones.

Learning rate schedulers and warmup

import torch.optim as optim
from torch.optim.lr_scheduler import (StepLR, CosineAnnealingLR,
    ReduceLROnPlateau, OneCycleLR)

model = nn.Linear(10, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# 1. Step decay: reduce LR by gamma every step_size epochs
scheduler_step = StepLR(optimizer, step_size=10, gamma=0.5)
# LR: epoch 0-9: 1e-3, epoch 10-19: 5e-4, epoch 20-29: 2.5e-4...

# 2. Cosine annealing: smoothly reduce LR to min over T_max epochs
scheduler_cos = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# 3. Reduce on plateau: reduce when metric stops improving
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min',
    factor=0.5, patience=5, min_lr=1e-7)

# 4. OneCycle: warm up then anneal (best for fast training)
scheduler_one = OneCycleLR(optimizer, max_lr=1e-2,
    steps_per_epoch=100, epochs=10)

# Training loop with scheduler
for epoch in range(100):
    train_loss = train_one_epoch()   # Your training function

    scheduler_plateau.step(train_loss)  # Pass val loss to plateau scheduler

    # For step/cosine: call after each epoch
    scheduler_step.step()

    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.2e}, Loss = {train_loss:.4f}")

# Popular optimizers and when to use:
# SGD + Momentum: theoretical advantages (generalization), used for CV models
# Adam: fastest convergence, default for NLP, transformers, mixed results on CV
# AdamW: Adam + weight decay decoupled (better generalization) — GPT-4, Claude
# RMSprop: good for RNNs, similar to Adam without first moment
# Adagrad: sparse data, NLP, adapts per-parameter (but LR shrinks to zero)

Practice questions

Learning rate of 10.0 vs 0.000001 — what happens with each? (Answer: LR=10: huge steps overshoot the minimum, oscillate wildly, may diverge (loss increases). LR=0.000001: infinitesimally small steps, learning is correct but takes millions of iterations to converge — impractically slow.)
Why does SGD noise (from using one example at a time) sometimes help? (Answer: Noise helps escape local minima and saddle points — random perturbations can kick the optimizer out of flat regions. SGD noise also acts as implicit regularization, often finding flatter minima that generalize better than the sharp minima that batch GD tends to find.)
Adam uses β₁=0.9 and β₂=0.999. What do these hyperparameters control? (Answer: β₁=0.9: exponential decay rate for first moment (gradient momentum) — 90% of past gradients kept, 10% of current gradient. β₂=0.999: decay rate for second moment (gradient variance) — slow-moving estimate of per-parameter gradient squared. Higher = smoother, more history retained.)
What is the difference between AdaGrad and Adam regarding learning rate decay? (Answer: AdaGrad accumulates all past squared gradients — learning rate shrinks monotonically and eventually reaches near-zero (learning stops). Adam uses exponential moving average of squared gradients — old information decays away, preventing the learning rate from shrinking to zero.)
Why is learning rate warmup used in transformer training? (Answer: At the start of training, the model is randomly initialized — gradients are noisy and large. A large learning rate immediately would cause destructive updates. Warmup linearly increases LR from 0 to target over the first 1000-10000 steps, letting the model stabilise before taking large steps.)

On LumiChats

Try it free

Learning Rate, SGD, Adam & Gradient Descent Variants

Real-life analogy: Walking down a mountain blindfolded

SGD and mini-batch gradient descent

Adam optimizer — the modern default

Practice questions

Learning Rate, SGD, Adam & Gradient Descent Variants

Real-life analogy: Walking down a mountain blindfolded

SGD and mini-batch gradient descent

Adam optimizer — the modern default

Practice questions

Practice what you just learned

Related Terms