Dropout and Batch Normalization are two of the most important training techniques in deep learning. Dropout is a regularization method that randomly deactivates neurons during training to prevent overfitting. Batch Normalization normalizes layer inputs during training to stabilize optimization and reduce sensitivity to initialization and learning rate.
Dropout: training an ensemble implicitly
Dropout (Srivastava et al., 2014): during each training step, randomly set each neuron's output to zero with probability p (typically 0.1–0.5). At test time, all neurons are active but outputs are scaled by (1−p) to preserve expected activation magnitude.
Inverted dropout (used in practice): scale active units by 1/(1-p) during training so no scaling is needed at inference. At test time the full network is used unchanged.
Implicit ensemble intuition
With n dropout-capable neurons, there are 2ⁿ possible subnetworks. Each training step samples one. At test time, running the full network with scaled weights approximates averaging over all 2ⁿ subnetworks simultaneously — cheap ensemble learning. This is why dropout is so effective as regularization.
Batch Normalization (BatchNorm)
BatchNorm (Ioffe & Szegedy, 2015): normalize each feature's activations across the current mini-batch to zero mean and unit variance, then apply learned scale and shift:
μ_B and σ²_B are computed over the batch. γ and β are learned per-feature — the network can undo the normalization if needed. ε ≈ 1e-5 prevents division by zero.
BatchNorm vs LayerNorm in PyTorch — key differences
import torch.nn as nn
# BatchNorm: normalizes over batch + spatial dims — standard in CNNs
# Note: behavior differs at train vs eval time (uses running stats at eval)
bn = nn.BatchNorm2d(num_features=64) # for conv feature maps
bn1d = nn.BatchNorm1d(num_features=256) # for linear layer outputs
# LayerNorm: normalizes over feature dims per sample — standard in Transformers
# Same behavior at train and eval — no batch dependency
ln = nn.LayerNorm(normalized_shape=768) # normalize 768-dim hidden state
# RMSNorm (used in LLaMA, Mistral) — simpler, no mean centering
# Not in standard PyTorch — commonly implemented as:
class RMSNorm(nn.Module):
def __init__(self, dim):
super().__init__()
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
rms = x.norm(2, dim=-1, keepdim=True) / (x.shape[-1] ** 0.5)
return self.weight * x / (rms + 1e-8)Layer Normalization vs Batch Normalization
| Property | BatchNorm | LayerNorm | RMSNorm |
|---|---|---|---|
| Normalizes over | Batch + spatial dims | Feature dims per sample | Feature dims (no mean) |
| Works with batch_size=1 | ❌ Unstable | ✅ Yes | ✅ Yes |
| Works with variable seq len | ❌ Problematic | ✅ Yes | ✅ Yes |
| Train vs eval behavior | Different (running stats) | Same | Same |
| Used in | CNNs (ResNet, EfficientNet) | Transformers (BERT, GPT-2) | Modern LLMs (LLaMA, Mistral, Qwen) |
| Parameters | 2 per feature (γ, β) | 2 per feature (γ, β) | 1 per feature (weight only) |
Why LLMs switched to RMSNorm
RMSNorm drops the mean-centering step (just divides by RMS). Experiments show this performs as well as full LayerNorm while being ~15% faster. LLaMA, Mistral, Falcon, and most 2023–2025 LLMs use RMSNorm.
Dropout variants and modern practices
| Variant | What it drops | Best for |
|---|---|---|
| Standard dropout | Individual neurons (p=0.1–0.5) | Dense layers, MLP blocks |
| Attention dropout | Attention weight entries | Transformer attention layers |
| Spatial dropout | Entire feature maps (channels) | CNN layers — preserves spatial structure |
| Stochastic depth | Entire residual blocks | Deep ResNets, ViTs — drops full layers with probability p |
| DropConnect | Individual weight connections | Rarely used in practice |
Modern LLM practice
Most 2024–2025 LLMs (LLaMA 3, Mistral, Qwen 2) use minimal dropout (p=0.0 or 0.1) in attention, relying instead on weight decay (AdamW), gradient clipping, and careful data curation for regularization. Large training datasets make heavy dropout less necessary.
Weight initialization strategies
Proper weight initialization is critical — too small causes vanishing gradients, too large causes exploding gradients. The right strategy depends on the activation function:
| Initialization | Formula | Designed for | When to use |
|---|---|---|---|
| Zero | All weights = 0 | — | Never — all neurons compute identically (symmetry breaking failure) |
| Random small | N(0, 0.01) | Shallow nets | Vanishes in deep networks |
| Xavier / Glorot | N(0, 2/(n_in + n_out)) | tanh, sigmoid | Keeps variance constant — standard for encoders |
| He / Kaiming | N(0, 2/n_in) | ReLU, Leaky ReLU | Accounts for ReLU zeroing half inputs — default for CNNs and MLPs |
| Orthogonal | Random orthogonal matrix | RNNs, deep residual | Preserves gradient norms — useful for very deep networks |
Default for transformers
Most Transformer implementations (GPT, BERT, LLaMA) use small normal N(0, 0.02) for linear weights and zero for biases. Some use scaled initialization: divide by √(2 × n_layers) for residual branch projections to prevent residual stream explosion at initialization.
Practice questions
- What is the difference between dropout during training and dropout during inference? (Answer: Training: randomly set p fraction of neurons to 0 each forward pass (inverted dropout: scale remaining neurons by 1/(1-p) during training). Each mini-batch uses a different random sparse subnetwork. Inference: disable dropout entirely — all neurons active. No random masking. If using inverted dropout (PyTorch default), weights are already properly scaled so no additional scaling is needed at inference. Critical: always call model.eval() before evaluation to disable dropout.)
- What is MC Dropout (Monte Carlo Dropout) and what problem does it solve? (Answer: MC Dropout (Gal & Ghahramani 2016): keep dropout active during inference, run N forward passes (e.g., 100), compute mean and variance of predictions. Mean ≈ the model's best estimate. Variance ≈ model uncertainty. Provides Bayesian approximation to deep learning inference — uncertainty quantification without expensive Bayesian training. Use case: medical diagnosis where knowing 'the model is uncertain here' is as important as the prediction itself. Works with any dropout-trained model.)
- Spatial dropout in CNNs drops entire feature maps rather than individual neurons. Why? (Answer: In convolutional layers, adjacent neurons (pixels in a feature map) are spatially correlated — dropping one neuron while its neighbours are active provides minimal regularisation. Spatial dropout drops entire feature maps (channels) — forcing the network to learn that each feature map is replaceable, promoting robustness. Particularly effective in image segmentation models (U-Net uses spatial dropout) where spatial correlation is very high.)
- What is the relationship between dropout and ensemble learning? (Answer: Training with dropout trains an exponential ensemble of 2^n different subnetworks (n = number of droppable neurons). Each forward pass uses a different random architecture. Inference without dropout is approximate averaging over this ensemble (weight sharing across all subnetworks). This is why dropout generalises well: it is equivalent to training many different models simultaneously with shared weights — the ensemble effect reduces variance without the cost of training separate models.)
- At what dropout rate should you stop adding dropout to your model? (Answer: Too high dropout (p > 0.5 on hidden layers): underfitting — too many neurons disabled each pass, model loses too much information. Too low (p < 0.1): minimal regularisation benefit. Practical guidelines: p=0.5 for large fully connected layers (Hinton's original recommendation). p=0.2–0.3 for smaller layers or when data is plentiful. p=0.0–0.1 on convolutional layers (spatial dropout preferred). For transformers: p=0.1 on attention weights. If training loss >> validation loss: reduce dropout or remove it.)