Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activation functions, stacking any number of linear layers is equivalent to a single linear transformation — the network cannot learn non-linear decision boundaries. ReLU is the default for hidden layers. Sigmoid for binary output. Softmax for multi-class output. Tanh for RNNs. GELU for transformers. The choice of activation significantly impacts training speed, gradient flow, and ultimately model performance.
Why non-linearity is essential
Without activation functions: layer 2 output = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁+b₂) = W'x + b'. No matter how many layers, the result is always a linear function of x. A neural network without activations cannot classify circles, spirals, or any non-linearly separable data — it is just linear regression. Activation functions are what make deep learning actually deep.
All major activation functions
Every major activation function implemented and compared
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
x = np.linspace(-5, 5, 100)
# ── SIGMOID ──
# Range: (0, 1) | Output: probability | Problem: vanishing gradients
def sigmoid(x): return 1 / (1 + np.exp(-x))
# Derivative: σ(x)(1-σ(x)) — max 0.25 at x=0, near 0 at extremes
# Use: Binary output layer | Avoid: hidden layers (gradient vanishes)
print(f"Sigmoid(0) = {sigmoid(0):.4f}") # 0.5
print(f"Sigmoid(5) = {sigmoid(5):.4f}") # ~0.99
# ── TANH ──
# Range: (-1, 1) | Zero-centred (better than sigmoid) | Still vanishing gradients
def tanh(x): return np.tanh(x)
# Derivative: 1 - tanh²(x) — max 1.0 at x=0 (4× stronger than sigmoid gradient!)
# Use: RNN hidden states, some hidden layers | Better than sigmoid but still saturates
print(f"Tanh(0) = {tanh(0):.4f}") # 0.0
# ── ReLU (Rectified Linear Unit) ──
# Range: [0, ∞) | Simple, fast | Problem: dying ReLU
def relu(x): return np.maximum(0, x)
# Derivative: 1 for x>0, 0 for x<0 — NO gradient vanishing for positive inputs!
# This is why deep networks became practical with ReLU (2011)
# Dying ReLU: if x always < 0 (e.g., due to bad init), gradient = 0 forever
print(f"ReLU(-3) = {relu(-3)}") # 0
print(f"ReLU(3) = {relu(3)}") # 3
# ── Leaky ReLU ──
# Fixes dying ReLU: small negative slope for x<0
def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)
print(f"LeakyReLU(-3) = {leaky_relu(-3):.4f}") # -0.03
# ── PReLU (Parametric ReLU) ──
# Like Leaky ReLU but alpha is learned — PyTorch: nn.PReLU()
prelu = nn.PReLU() # alpha initialised to 0.25, learned during training
# ── ELU (Exponential Linear Unit) ──
# Negative side: α(eˣ-1) — smooth, avoids dead neurons, negative outputs allowed
def elu(x, alpha=1.0): return np.where(x > 0, x, alpha * (np.exp(x) - 1))
# ── GELU (Gaussian Error Linear Unit) ── MOST IMPORTANT FOR TRANSFORMERS
# GELU(x) ≈ x × Φ(x) where Φ is Gaussian CDF
# Smooth approximation: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
def gelu_approx(x):
return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
# Used in: GPT-2, BERT, GPT-3, GPT-4, Claude, T5 — ALL modern transformers
# Advantage: stochastic gate — multiplies by P(X<x) where X~N(0,1)
# Smoother than ReLU, slightly better performance on language tasks
print(f"GELU(1) = {gelu_approx(1):.4f}") # ~0.84
print(f"GELU(-1) = {gelu_approx(-1):.4f}") # ~-0.16
# ── SOFTMAX ── (output layer for multi-class)
def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Softmax({logits}) = {probs.round(3)}") # [0.659, 0.242, 0.099]
print(f"Sum = {probs.sum():.6f}") # 1.0
# ── SILU / Swish ── (modern, used in LLaMA, EfficientNet)
def swish(x): return x * sigmoid(x)
# Smooth, non-monotonic — slightly better than ReLU on some tasks
print(f"Swish(2) = {swish(2):.4f}") # ~1.76
print(f"Swish(-2) = {swish(-2):.4f}") # ~-0.24Comparison and selection guide
| Function | Range | Gradient | Dying neurons? | Where to use |
|---|---|---|---|---|
| Sigmoid | (0, 1) | 0 to 0.25 (low) | No (saturates) | Binary output layer only |
| Tanh | (-1, 1) | 0 to 1 (better) | No (saturates) | RNN hidden states |
| ReLU | [0, ∞) | 0 or 1 | Yes (x<0 → dead) | Default hidden layers (CNN, MLP) |
| Leaky ReLU | (-∞, ∞) | 0.01 or 1 | No (tiny gradient) | When dying ReLU is a problem |
| ELU | α×(e^x-1), ∞) | Smooth | No | Deep networks needing negative values |
| GELU | (-0.17, ∞) | Smooth | Effectively no | Transformers (BERT, GPT, Claude) |
| Swish/SiLU | (-0.28, ∞) | Smooth, non-monotonic | No | LLaMA, EfficientNet, modern CNNs |
| Softmax | (0,1), sum=1 | Complex (Jacobian) | No | Multi-class output layer only |
Activation function quick selection guide
Hidden layers: use ReLU as default. If dying ReLU is a problem (many zero outputs), use Leaky ReLU or ELU. For transformers and modern architectures: GELU or SiLU. For RNNs: Tanh. Output layers: binary classification → Sigmoid. Multi-class → Softmax. Regression → Linear (no activation). GELU is why transformers outperform ReLU-based models — the smooth gating behaviour is better for language modeling.
Practice questions
- Why does using all sigmoid activations in a deep network cause training failure? (Answer: Vanishing gradient problem. Sigmoid gradient is at most 0.25. With 10 layers: 0.25^10 = 9.5×10⁻⁷. Gradients in early layers are essentially zero — weights barely update. ReLU has gradient 1 for positive inputs, completely solving this for deep networks.)
- ReLU outputs 0 for x<0. Why is this called "dying ReLU" and how do you fix it? (Answer: If a neuron consistently receives negative inputs, its output and gradient are always 0 — it never updates and is permanently "dead". Solutions: Leaky ReLU (small slope for x<0), better weight initialisation (He initialisation), batch normalisation (keeps pre-activations in reasonable range).)
- What makes GELU better than ReLU for transformers? (Answer: GELU is smooth and differentiable everywhere (no sharp corner at 0). It stochastically gates inputs: for x near 0, sometimes the neuron fires, sometimes not — provides regularisation. Empirically improves performance on NLP tasks. Also, transformers use residual connections which reduce the vanishing gradient issue, making the smoothness of GELU more impactful.)
- Softmax is only applied to the OUTPUT layer. Why not in hidden layers? (Answer: Softmax normalises all activations to sum to 1 — if one unit increases, all others decrease. In hidden layers, this creates competition between neurons, preventing them from independently detecting different features. Hidden layers need each neuron to independently activate — ReLU/GELU are better.)
- tanh outputs zero-centred values but sigmoid does not. Why does this matter? (Answer: Sigmoid outputs are always positive (0-1). When used as layer input, all gradients in the next layer have the same sign — causing zig-zagging in weight updates (gradient descent takes longer to converge). tanh(-1 to 1) is zero-centred, so gradients can be positive or negative, enabling more efficient updates.)
On LumiChats
Claude uses GELU activations in its MLP layers — the same activation described here. Every token processed by Claude goes through thousands of GELU computations per layer, with the stochastic gating property helping the model represent complex linguistic patterns.
Try it free