Glossary/Dropout & Batch Normalization
Deep Learning & Neural Networks

Dropout & Batch Normalization

Key training techniques that make deep networks work reliably.


Definition

Dropout and Batch Normalization are two of the most important training techniques in deep learning. Dropout is a regularization method that randomly deactivates neurons during training to prevent overfitting. Batch Normalization normalizes layer inputs during training to stabilize optimization and reduce sensitivity to initialization and learning rate.

Dropout: training an ensemble implicitly

Dropout (Srivastava et al., 2014): during each training step, randomly set each neuron's output to zero with probability p (typically 0.1–0.5). At test time, all neurons are active but outputs are scaled by (1−p) to preserve expected activation magnitude.

Inverted dropout (used in practice): scale active units by 1/(1-p) during training so no scaling is needed at inference. At test time the full network is used unchanged.

Implicit ensemble intuition

With n dropout-capable neurons, there are 2ⁿ possible subnetworks. Each training step samples one. At test time, running the full network with scaled weights approximates averaging over all 2ⁿ subnetworks simultaneously — cheap ensemble learning. This is why dropout is so effective as regularization.

Batch Normalization (BatchNorm)

BatchNorm (Ioffe & Szegedy, 2015): normalize each feature's activations across the current mini-batch to zero mean and unit variance, then apply learned scale and shift:

μ_B and σ²_B are computed over the batch. γ and β are learned per-feature — the network can undo the normalization if needed. ε ≈ 1e-5 prevents division by zero.

BatchNorm vs LayerNorm in PyTorch — key differences

import torch.nn as nn

# BatchNorm: normalizes over batch + spatial dims — standard in CNNs
# Note: behavior differs at train vs eval time (uses running stats at eval)
bn = nn.BatchNorm2d(num_features=64)   # for conv feature maps
bn1d = nn.BatchNorm1d(num_features=256) # for linear layer outputs

# LayerNorm: normalizes over feature dims per sample — standard in Transformers
# Same behavior at train and eval — no batch dependency
ln = nn.LayerNorm(normalized_shape=768)  # normalize 768-dim hidden state

# RMSNorm (used in LLaMA, Mistral) — simpler, no mean centering
# Not in standard PyTorch — commonly implemented as:
class RMSNorm(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(dim))
    def forward(self, x):
        rms = x.norm(2, dim=-1, keepdim=True) / (x.shape[-1] ** 0.5)
        return self.weight * x / (rms + 1e-8)

Layer Normalization vs Batch Normalization

PropertyBatchNormLayerNormRMSNorm
Normalizes overBatch + spatial dimsFeature dims per sampleFeature dims (no mean)
Works with batch_size=1❌ Unstable✅ Yes✅ Yes
Works with variable seq len❌ Problematic✅ Yes✅ Yes
Train vs eval behaviorDifferent (running stats)SameSame
Used inCNNs (ResNet, EfficientNet)Transformers (BERT, GPT-2)Modern LLMs (LLaMA, Mistral, Qwen)
Parameters2 per feature (γ, β)2 per feature (γ, β)1 per feature (weight only)

Why LLMs switched to RMSNorm

RMSNorm drops the mean-centering step (just divides by RMS). Experiments show this performs as well as full LayerNorm while being ~15% faster. LLaMA, Mistral, Falcon, and most 2023–2025 LLMs use RMSNorm.

Dropout variants and modern practices

VariantWhat it dropsBest for
Standard dropoutIndividual neurons (p=0.1–0.5)Dense layers, MLP blocks
Attention dropoutAttention weight entriesTransformer attention layers
Spatial dropoutEntire feature maps (channels)CNN layers — preserves spatial structure
Stochastic depthEntire residual blocksDeep ResNets, ViTs — drops full layers with probability p
DropConnectIndividual weight connectionsRarely used in practice

Modern LLM practice

Most 2024–2025 LLMs (LLaMA 3, Mistral, Qwen 2) use minimal dropout (p=0.0 or 0.1) in attention, relying instead on weight decay (AdamW), gradient clipping, and careful data curation for regularization. Large training datasets make heavy dropout less necessary.

Weight initialization strategies

Proper weight initialization is critical — too small causes vanishing gradients, too large causes exploding gradients. The right strategy depends on the activation function:

InitializationFormulaDesigned forWhen to use
ZeroAll weights = 0Never — all neurons compute identically (symmetry breaking failure)
Random smallN(0, 0.01)Shallow netsVanishes in deep networks
Xavier / GlorotN(0, 2/(n_in + n_out))tanh, sigmoidKeeps variance constant — standard for encoders
He / KaimingN(0, 2/n_in)ReLU, Leaky ReLUAccounts for ReLU zeroing half inputs — default for CNNs and MLPs
OrthogonalRandom orthogonal matrixRNNs, deep residualPreserves gradient norms — useful for very deep networks

Default for transformers

Most Transformer implementations (GPT, BERT, LLaMA) use small normal N(0, 0.02) for linear weights and zero for biases. Some use scaled initialization: divide by √(2 × n_layers) for residual branch projections to prevent residual stream explosion at initialization.

Practice questions

  1. What is the difference between dropout during training and dropout during inference? (Answer: Training: randomly set p fraction of neurons to 0 each forward pass (inverted dropout: scale remaining neurons by 1/(1-p) during training). Each mini-batch uses a different random sparse subnetwork. Inference: disable dropout entirely — all neurons active. No random masking. If using inverted dropout (PyTorch default), weights are already properly scaled so no additional scaling is needed at inference. Critical: always call model.eval() before evaluation to disable dropout.)
  2. What is MC Dropout (Monte Carlo Dropout) and what problem does it solve? (Answer: MC Dropout (Gal & Ghahramani 2016): keep dropout active during inference, run N forward passes (e.g., 100), compute mean and variance of predictions. Mean ≈ the model's best estimate. Variance ≈ model uncertainty. Provides Bayesian approximation to deep learning inference — uncertainty quantification without expensive Bayesian training. Use case: medical diagnosis where knowing 'the model is uncertain here' is as important as the prediction itself. Works with any dropout-trained model.)
  3. Spatial dropout in CNNs drops entire feature maps rather than individual neurons. Why? (Answer: In convolutional layers, adjacent neurons (pixels in a feature map) are spatially correlated — dropping one neuron while its neighbours are active provides minimal regularisation. Spatial dropout drops entire feature maps (channels) — forcing the network to learn that each feature map is replaceable, promoting robustness. Particularly effective in image segmentation models (U-Net uses spatial dropout) where spatial correlation is very high.)
  4. What is the relationship between dropout and ensemble learning? (Answer: Training with dropout trains an exponential ensemble of 2^n different subnetworks (n = number of droppable neurons). Each forward pass uses a different random architecture. Inference without dropout is approximate averaging over this ensemble (weight sharing across all subnetworks). This is why dropout generalises well: it is equivalent to training many different models simultaneously with shared weights — the ensemble effect reduces variance without the cost of training separate models.)
  5. At what dropout rate should you stop adding dropout to your model? (Answer: Too high dropout (p > 0.5 on hidden layers): underfitting — too many neurons disabled each pass, model loses too much information. Too low (p < 0.1): minimal regularisation benefit. Practical guidelines: p=0.5 for large fully connected layers (Hinton's original recommendation). p=0.2–0.3 for smaller layers or when data is plentiful. p=0.0–0.1 on convolutional layers (spatial dropout preferred). For transformers: p=0.1 on attention weights. If training loss >> validation loss: reduce dropout or remove it.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms