What is batch size, epochs, and effective training time?

Training Dynamics — Batch Size, Epochs, Convergence & Loss Landscapes: Batch size, epochs, and effective training time. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/training-dynamics-batch-epochs

What is practice questions?

Training Dynamics — Batch Size, Epochs, Convergence & Loss Landscapes: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/training-dynamics-batch-epochs

Training Dynamics — Batch Size, Epochs, Convergence & Loss Landscapes

Training dynamics describes how a model's performance evolves during training. Key factors: batch size (how many examples per gradient update), epochs (how many times to pass over the full dataset), and learning rate scheduling (how to adjust step size over time). The loss landscape is the high-dimensional surface the optimizer navigates — understanding its geometry (flat minima, sharp minima, saddle points) explains why some training configs generalize better than others. Monitoring loss curves and gradient norms is essential for diagnosing training problems early.

Understanding how neural networks actually learn — the rhythm of training.

Category: Model Training & Optimization

Batch size, epochs, and effective training time

import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
import numpy as np

model     = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
loss_fn   = nn.MSELoss()
writer    = SummaryWriter("runs/training_dynamics")

# ── Key training hyperparameters ──
BATCH_SIZE   = 32      # Gradients computed from 32 examples per step
ACCUM_STEPS  = 4       # Accumulate for 4 steps → effective batch = 128
MAX_EPOCHS   = 100
LR           = 1e-3

# Learning rate schedule: warmup + cosine decay
total_steps  = MAX_EPOCHS * len(train_loader)
warmup_steps = total_steps // 10

from torch.optim.lr_scheduler import OneCycleLR, CosineAnnealingLR, LinearLR, SequentialLR

# Warmup: LR linearly increases from 0 to LR for first 10% of steps
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps)
# Decay: LR cosine anneals from LR to 0 for remaining steps
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine],
                          milestones=[warmup_steps])

train_losses, val_losses, grad_norms = [], [], []

for epoch in range(MAX_EPOCHS):
    model.train()
    epoch_loss = 0
    optimizer.zero_grad()   # Zero gradients at start of epoch

    for step, (X, y) in enumerate(train_loader):
        # Forward pass
        pred = model(X)
        loss = loss_fn(pred, y) / ACCUM_STEPS   # Scale loss for accumulation

        # Backward pass
        loss.backward()

        # Only update every ACCUM_STEPS steps (gradient accumulation)
        if (step + 1) % ACCUM_STEPS == 0:
            # Monitor gradient norm BEFORE clipping
            total_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            grad_norms.append(total_norm.item())

            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        epoch_loss += loss.item() * ACCUM_STEPS

    # Validation
    model.eval()
    with torch.no_grad():
        val_loss = sum(loss_fn(model(X_v), y_v).item() for X_v, y_v in val_loader)
        val_loss /= len(val_loader)

    avg_train_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_train_loss)
    val_losses.append(val_loss)

    # Log to TensorBoard
    writer.add_scalars("Loss", {"train": avg_train_loss, "val": val_loss}, epoch)
    writer.add_scalar("LR", optimizer.param_groups[0]["lr"], epoch)
    writer.add_scalar("GradNorm", np.mean(grad_norms[-len(train_loader):]), epoch)

    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d}: train={avg_train_loss:.4f} val={val_loss:.4f} "
              f"lr={optimizer.param_groups[0]['lr']:.2e}")

# ── Diagnosing from loss curves ──
gap = val_losses[-1] - train_losses[-1]
if train_losses[-1] > 0.3:
    print("Underfitting: train loss still high. Train longer or use larger model.")
elif gap > 0.2:
    print("Overfitting: large train-val gap. Add regularization or more data.")
else:
    print("Good fit: small train-val gap with low loss.")

Loss landscapes and generalization

Flat vs sharp minima: The loss landscape has valleys (minima) with different shapes. Sharp minima: narrow valley — tiny perturbations to weights cause large loss increase. Models in sharp minima often overfit. Flat minima: wide valley — small weight changes cause little loss change. Models in flat minima generalize better because the valley corresponds to a robust solution. Small batch size and noise from SGD naturally push toward flatter minima.

Scenario	Diagnosis	Fix
Train↓ Val↓ (gap closing)	Both improving — normal training	Continue training, monitor for overfitting
Train↓ Val→ plateau (gap)	Overfitting begins	Regularise (dropout, weight decay), more data
Train↓ Val↑ (crossing)	Overfitting — past optimal	Early stopping. Save model from before crossing.
Train→ Val→ (both stuck)	Underfitting or too low LR	Increase LR, train longer, larger model
Train/Val oscillating wildly	LR too high	Reduce learning rate, add warmup
Loss → NaN	Exploding gradients or LR too high	Gradient clipping, reduce LR, check for inf in data

Practice questions

You train for 100 epochs on a dataset of 10,000 examples with batch size 32. How many gradient updates total? (Answer: Steps per epoch = 10,000/32 = 312.5 ≈ 312. Total updates = 312 × 100 = 31,200 gradient updates. Each update uses the average gradient over 32 examples.)
Why do models in flat minima generalize better than models in sharp minima? (Answer: Flat minima correspond to parameter regions where the loss is insensitive to small perturbations. When the model is deployed on slightly different data (test set), the parameter perturbations it encounters do not cause large loss increases. Sharp minima are sensitive — small distribution shifts cause catastrophic performance drops.)
gradient_accumulation_steps=4 with batch_size=8. What is the effective batch size? (Answer: Effective batch = 8 × 4 = 32. Gradients are computed for 8 examples per step, accumulated (summed) over 4 steps, then parameters are updated with the accumulated gradient — mathematically equivalent to computing the gradient over 32 examples at once.)
Loss is 0.001 on training set but 2.5 on validation set. What is the issue and what are three fixes? (Answer: Severe overfitting. Fixes: (1) Reduce model complexity (fewer layers/neurons). (2) Add regularization (dropout, L2 weight decay). (3) Get more training data or use augmentation. (4) Early stopping — save model from earlier epoch when val loss was lower. (5) Use LoRA for LLMs — fewer trainable params = less overfitting.)
What does a warmup schedule do and why is it used for LLM fine-tuning? (Answer: Warmup linearly increases LR from near-0 to target LR over the first 5-10% of training steps. At the start of fine-tuning, weights are not calibrated for the new task — large LR immediately would cause destructive updates. Warmup lets the model stabilise before taking large steps. Especially important for large LLMs where the pretrained weights contain valuable knowledge that must be preserved.)

Every LLM fine-tuning run (including the Unsloth notebooks) needs careful monitoring of loss curves. LumiChats can help you diagnose training problems: paste your training logs and ask 'Why is my validation loss increasing?' or 'Is my model overfitting?' — with specific actionable fixes.

import torch import torch.nn as nn from torch.utils.tensorboard import SummaryWriter import numpy as np model = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1)) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01) loss_fn = nn.MSELoss() writer = SummaryWriter("runs/training_dynamics") # ── Key training hyperparameters ── BATCH_SIZE = 32 # Gradients computed from 32 examples per step ACCUM_STEPS = 4 # Accumulate for 4 steps → effective batch = 128 MAX_EPOCHS = 100 LR = 1e-3 # Learning rate schedule: warmup + cosine decay total_steps = MAX_EPOCHS * len(train_loader) warmup_steps = total_steps // 10 from torch.optim.lr_scheduler import OneCycleLR, CosineAnnealingLR, LinearLR, SequentialLR # Warmup: LR linearly increases from 0 to LR for first 10% of steps warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps) # Decay: LR cosine anneals from LR to 0 for remaining steps cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6) scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps]) train_losses, val_losses, grad_norms = [], [], [] for epoch in range(MAX_EPOCHS): model.train() epoch_loss = 0 optimizer.zero_grad() # Zero gradients at start of epoch for step, (X, y) in enumerate(train_loader): # Forward pass pred = model(X) loss = loss_fn(pred, y) / ACCUM_STEPS # Scale loss for accumulation # Backward pass loss.backward() # Only update every ACCUM_STEPS steps (gradient accumulation) if (step + 1) % ACCUM_STEPS == 0: # Monitor gradient norm BEFORE clipping total_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) grad_norms.append(total_norm.item()) optimizer.step() scheduler.step() optimizer.zero_grad() epoch_loss += loss.item() * ACCUM_STEPS # Validation model.eval() with torch.no_grad(): val_loss = sum(loss_fn(model(X_v), y_v).item() for X_v, y_v in val_loader) val_loss /= len(val_loader) avg_train_loss = epoch_loss / len(train_loader) train_losses.append(avg_train_loss) val_losses.append(val_loss) # Log to TensorBoard writer.add_scalars("Loss", {"train": avg_train_loss, "val": val_loss}, epoch) writer.add_scalar("LR", optimizer.param_groups[0]["lr"], epoch) writer.add_scalar("GradNorm", np.mean(grad_norms[-len(train_loader):]), epoch) if epoch % 10 == 0: print(f"Epoch {epoch:3d}: train={avg_train_loss:.4f} val={val_loss:.4f} " f"lr={optimizer.param_groups[0]['lr']:.2e}") # ── Diagnosing from loss curves ── gap = val_losses[-1] - train_losses[-1] if train_losses[-1] > 0.3: print("Underfitting: train loss still high. Train longer or use larger model.") elif gap > 0.2: print("Overfitting: large train-val gap. Add regularization or more data.") else: print("Good fit: small train-val gap with low loss.")

Scenario

Diagnosis

Fix

Train↓ Val↓ (gap closing)

Both improving — normal training

Continue training, monitor for overfitting

Train↓ Val→ plateau (gap)

Overfitting begins

Regularise (dropout, weight decay), more data

Train↓ Val↑ (crossing)

Overfitting — past optimal

Early stopping. Save model from before crossing.

Train→ Val→ (both stuck)

Underfitting or too low LR

Increase LR, train longer, larger model

Train/Val oscillating wildly

LR too high

Reduce learning rate, add warmup

Loss → NaN

Exploding gradients or LR too high

Gradient clipping, reduce LR, check for inf in data

Training Dynamics — Batch Size, Epochs, Convergence & Loss Landscapes

Batch size, epochs, and effective training time

Loss landscapes and generalization

Practice questions

Training Dynamics — Batch Size, Epochs, Convergence & Loss Landscapes

Batch size, epochs, and effective training time

Loss landscapes and generalization

Practice questions

Practice what you just learned

Related Terms