Dropout & Batch Normalization
Dropout and Batch Normalization are two of the most important training techniques in deep learning. Dropout is a regularization method that randomly deactivates neurons during training to prevent overfitting. Batch Normalization normalizes layer inputs during training to stabilize optimization and reduce sensitivity to initialization and learning rate.
Key training techniques that make deep networks work reliably.
Category: Deep Learning & Neural Networks
Dropout: training an ensemble implicitly
Dropout (Srivastava et al., 2014): during each training step, randomly set each neuron's output to zero with probability p (typically 0.1–0.5). At test time, all neurons are active but outputs are scaled by (1−p) to preserve expected activation magnitude.
\tilde{h}_i = \begin{cases} 0 & \text{with probability } p \\ h_i / (1-p) & \text{with probability } 1-p \end{cases}
Implicit ensemble intuition: With n dropout-capable neurons, there are 2ⁿ possible subnetworks. Each training step samples one. At test time, running the full network with scaled weights approximates averaging over all 2ⁿ subnetworks simultaneously — cheap ensemble learning. This is why dropout is so effective as regularization.
Batch Normalization (BatchNorm)
BatchNorm (Ioffe & Szegedy, 2015): normalize each feature's activations across the current mini-batch to zero mean and unit variance, then apply learned scale and shift:
\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma^2_\mathcal{B} + \varepsilon}}, \qquad y_i = \gamma\,\hat{x}_i + \beta
import torch.nn as nn
# BatchNorm: normalizes over batch + spatial dims — standard in CNNs
# Note: behavior differs at train vs eval time (uses running stats at eval)
bn = nn.BatchNorm2d(num_features=64) # for conv feature maps
bn1d = nn.BatchNorm1d(num_features=256) # for linear layer outputs
# LayerNorm: normalizes over feature dims per sample — standard in Transformers
# Same behavior at train and eval — no batch dependency
ln = nn.LayerNorm(normalized_shape=768) # normalize 768-dim hidden state
# RMSNorm (used in LLaMA, Mistral) — simpler, no mean centering
# Not in standard PyTorch — commonly implemented as:
class RMSNorm(nn.Module):
def __init__(self, dim):
super().__init__()
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
rms = x.norm(2, dim=-1, keepdim=True) / (x.shape[-1] ** 0.5)
return self.weight * x / (rms + 1e-8)
Layer Normalization vs Batch Normalization
| Property | BatchNorm | LayerNorm | RMSNorm |
|---|---|---|---|
| Normalizes over | Batch + spatial dims | Feature dims per sample | Feature dims (no mean) |
| Works with batch_size=1 | ❌ Unstable | ✅ Yes | ✅ Yes |
| Works with variable seq len | ❌ Problematic | ✅ Yes | ✅ Yes |
| Train vs eval behavior | Different (running stats) | Same | Same |
| Used in | CNNs (ResNet, EfficientNet) | Transformers (BERT, GPT-2) | Modern LLMs (LLaMA, Mistral, Qwen) |
| Parameters | 2 per feature (γ, β) | 2 per feature (γ, β) | 1 per feature (weight only) |
Why LLMs switched to RMSNorm: RMSNorm drops the mean-centering step (just divides by RMS). Experiments show this performs as well as full LayerNorm while being ~15% faster. LLaMA, Mistral, Falcon, and most 2023–2025 LLMs use RMSNorm.
Dropout variants and modern practices
| Variant | What it drops | Best for |
|---|---|---|
| Standard dropout | Individual neurons (p=0.1–0.5) | Dense layers, MLP blocks |
| Attention dropout | Attention weight entries | Transformer attention layers |
| Spatial dropout | Entire feature maps (channels) | CNN layers — preserves spatial structure |
| Stochastic depth | Entire residual blocks | Deep ResNets, ViTs — drops full layers with probability p |
| DropConnect | Individual weight connections | Rarely used in practice |
Modern LLM practice: Most 2024–2025 LLMs (LLaMA 3, Mistral, Qwen 2) use minimal dropout (p=0.0 or 0.1) in attention, relying instead on weight decay (AdamW), gradient clipping, and careful data curation for regularization. Large training datasets make heavy dropout less necessary.
Weight initialization strategies
Proper weight initialization is critical — too small causes vanishing gradients, too large causes exploding gradients. The right strategy depends on the activation function:
| Initialization | Formula | Designed for | When to use |
|---|---|---|---|
| Zero | All weights = 0 | — | Never — all neurons compute identically (symmetry breaking failure) |
| Random small | N(0, 0.01) | Shallow nets | Vanishes in deep networks |
| Xavier / Glorot | N(0, 2/(n_in + n_out)) | tanh, sigmoid | Keeps variance constant — standard for encoders |
| He / Kaiming | N(0, 2/n_in) | ReLU, Leaky ReLU | Accounts for ReLU zeroing half inputs — default for CNNs and MLPs |
| Orthogonal | Random orthogonal matrix | RNNs, deep residual | Preserves gradient norms — useful for very deep networks |
Default for transformers: Most Transformer implementations (GPT, BERT, LLaMA) use small normal N(0, 0.02) for linear weights and zero for biases. Some use scaled initialization: divide by √(2 × n_layers) for residual branch projections to prevent residual stream explosion at initialization.
Practice questions
- What is the difference between dropout during training and dropout during inference? (Answer: Training: randomly set p fraction of neurons to 0 each forward pass (inverted dropout: scale remaining neurons by 1/(1-p) during training). Each mini-batch uses a different random sparse subnetwork. Inference: disable dropout entirely — all neurons active. No random masking. If using inverted dropout (PyTorch default), weights are already properly scaled so no additional scaling is needed at inference. Critical: always call model.eval() before evaluation to disable dropout.)
- What is MC Dropout (Monte Carlo Dropout) and what problem does it solve? (Answer: MC Dropout (Gal & Ghahramani 2016): keep dropout active during inference, run N forward passes (e.g., 100), compute mean and variance of predictions. Mean ≈ the model's best estimate. Variance ≈ model uncertainty. Provides Bayesian approximation to deep learning inference — uncertainty quantification without expensive Bayesian training. Use case: medical diagnosis where knowing 'the model is uncertain here' is as important as the prediction itself. Works with any dropout-trained model.)
- Spatial dropout in CNNs drops entire feature maps rather than individual neurons. Why? (Answer: In convolutional layers, adjacent neurons (pixels in a feature map) are spatially correlated — dropping one neuron while its neighbors are active provides minimal regularization. Spatial dropout drops entire feature maps (channels) — forcing the network to learn that each feature map is replaceable, promoting robustness. Particularly effective in image segmentation models (U-Net uses spatial dropout) where spatial correlation is very high.)
- What is the relationship between dropout and ensemble learning? (Answer: Training with dropout trains an exponential ensemble of 2^n different subnetworks (n = number of droppable neurons). Each forward pass uses a different random architecture. Inference without dropout is approximate averaging over this ensemble (weight sharing across all subnetworks). This is why dropout generalizes well: it is equivalent to training many different models simultaneously with shared weights — the ensemble effect reduces variance without the cost of training separate models.)
- At what dropout rate should you stop adding dropout to your model? (Answer: Too high dropout (p > 0.5 on hidden layers): underfitting — too many neurons disabled each pass, model loses too much information. Too low (p < 0.1): minimal regularization benefit. Practical guidelines: p=0.5 for large fully connected layers (Hinton's original recommendation). p=0.2–0.3 for smaller layers or when data is plentiful. p=0.0–0.1 on convolutional layers (spatial dropout preferred). For transformers: p=0.1 on attention weights. If training loss >> validation loss: reduce dropout or remove it.)