Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activation functions, stacking any number of linear layers is equivalent to a single linear transformation — the network cannot learn non-linear decision boundaries. ReLU is the default for hidden layers. Sigmoid for binary output. Softmax for multi-class output. Tanh for RNNs. GELU for transformers. The choice of activation significantly impacts training speed, gradient flow, and ultimately model performance.
The non-linearity that lets neural networks learn anything — without it, deep learning is just linear algebra.
Category: Deep Learning & Neural Networks
Why non-linearity is essential
Without activation functions: layer 2 output = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁+b₂) = W'x + b'. No matter how many layers, the result is always a linear function of x. A neural network without activations cannot classify circles, spirals, or any non-linearly separable data — it is just linear regression. Activation functions are what make deep learning actually deep.
All major activation functions
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
x = np.linspace(-5, 5, 100)
# ── SIGMOID ──
# Range: (0, 1) | Output: probability | Problem: vanishing gradients
def sigmoid(x): return 1 / (1 + np.exp(-x))
# Derivative: σ(x)(1-σ(x)) — max 0.25 at x=0, near 0 at extremes
# Use: Binary output layer | Avoid: hidden layers (gradient vanishes)
print(f"Sigmoid(0) = {sigmoid(0):.4f}") # 0.5
print(f"Sigmoid(5) = {sigmoid(5):.4f}") # ~0.99
# ── TANH ──
# Range: (-1, 1) | Zero-centerd (better than sigmoid) | Still vanishing gradients
def tanh(x): return np.tanh(x)
# Derivative: 1 - tanh²(x) — max 1.0 at x=0 (4× stronger than sigmoid gradient!)
# Use: RNN hidden states, some hidden layers | Better than sigmoid but still saturates
print(f"Tanh(0) = {tanh(0):.4f}") # 0.0
# ── ReLU (Rectified Linear Unit) ──
# Range: [0, ∞) | Simple, fast | Problem: dying ReLU
def relu(x): return np.maximum(0, x)
# Derivative: 1 for x>0, 0 for x<0 — NO gradient vanishing for positive inputs!
# This is why deep networks became practical with ReLU (2011)
# Dying ReLU: if x always < 0 (e.g., due to bad init), gradient = 0 forever
print(f"ReLU(-3) = {relu(-3)}") # 0
print(f"ReLU(3) = {relu(3)}") # 3
# ── Leaky ReLU ──
# Fixes dying ReLU: small negative slope for x<0
def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)
print(f"LeakyReLU(-3) = {leaky_relu(-3):.4f}") # -0.03
# ── PReLU (Parametric ReLU) ──
# Like Leaky ReLU but alpha is learned — PyTorch: nn.PReLU()
prelu = nn.PReLU() # alpha initialized to 0.25, learned during training
# ── ELU (Exponential Linear Unit) ──
# Negative side: α(eˣ-1) — smooth, avoids dead neurons, negative outputs allowed
def elu(x, alpha=1.0): return np.where(x > 0, x, alpha * (np.exp(x) - 1))
# ── GELU (Gaussian Error Linear Unit) ── MOST IMPORTANT FOR TRANSFORMERS
# GELU(x) ≈ x × Φ(x) where Φ is Gaussian CDF
# Smooth approximation: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
def gelu_approx(x):
return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
# Used in: GPT-2, BERT, GPT-3, GPT-4, Claude, T5 — ALL modern transformers
# Advantage: stochastic gate — multiplies by P(X<x) where X~N(0,1)
# Smoother than ReLU, slightly better performance on language tasks
print(f"GELU(1) = {gelu_approx(1):.4f}") # ~0.84
print(f"GELU(-1) = {gelu_approx(-1):.4f}") # ~-0.16
# ── SOFTMAX ── (output layer for multi-class)
def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Softmax({logits}) = {probs.round(3)}") # [0.659, 0.242, 0.099]
print(f"Sum = {probs.sum():.6f}") # 1.0
# ── SILU / Swish ── (modern, used in LLaMA, EfficientNet)
def swish(x): return x * sigmoid(x)
# Smooth, non-monotonic — slightly better than ReLU on some tasks
print(f"Swish(2) = {swish(2):.4f}") # ~1.76
print(f"Swish(-2) = {swish(-2):.4f}") # ~-0.24
Comparison and selection guide
| Function | Range | Gradient | Dying neurons? | Where to use |
|---|---|---|---|---|
| Sigmoid | (0, 1) | 0 to 0.25 (low) | No (saturates) | Binary output layer only |
| Tanh | (-1, 1) | 0 to 1 (better) | No (saturates) | RNN hidden states |
| ReLU | [0, ∞) | 0 or 1 | Yes (x<0 → dead) | Default hidden layers (CNN, MLP) |
| Leaky ReLU | (-∞, ∞) | 0.01 or 1 | No (tiny gradient) | When dying ReLU is a problem |
| ELU | α×(e^x-1), ∞) | Smooth | No | Deep networks needing negative values |
| GELU | (-0.17, ∞) | Smooth | Effectively no | Transformers (BERT, GPT, Claude) |
| Swish/SiLU | (-0.28, ∞) | Smooth, non-monotonic | No | LLaMA, EfficientNet, modern CNNs |
| Softmax | (0,1), sum=1 | Complex (Jacobian) | No | Multi-class output layer only |
Activation function quick selection guide: Hidden layers: use ReLU as default. If dying ReLU is a problem (many zero outputs), use Leaky ReLU or ELU. For transformers and modern architectures: GELU or SiLU. For RNNs: Tanh. Output layers: binary classification → Sigmoid. Multi-class → Softmax. Regression → Linear (no activation). GELU is why transformers outperform ReLU-based models — the smooth gating behavior is better for language modeling.
Practice questions
- Why does using all sigmoid activations in a deep network cause training failure? (Answer: Vanishing gradient problem. Sigmoid gradient is at most 0.25. With 10 layers: 0.25^10 = 9.5×10⁻⁷. Gradients in early layers are essentially zero — weights barely update. ReLU has gradient 1 for positive inputs, completely solving this for deep networks.)
- ReLU outputs 0 for x<0. Why is this called "dying ReLU" and how do you fix it? (Answer: If a neuron consistently receives negative inputs, its output and gradient are always 0 — it never updates and is permanently "dead". Solutions: Leaky ReLU (small slope for x<0), better weight initialization (He initialization), batch normalization (keeps pre-activations in reasonable range).)
- What makes GELU better than ReLU for transformers? (Answer: GELU is smooth and differentiable everywhere (no sharp corner at 0). It stochastically gates inputs: for x near 0, sometimes the neuron fires, sometimes not — provides regularization. Empirically improves performance on NLP tasks. Also, transformers use residual connections which reduce the vanishing gradient issue, making the smoothness of GELU more impactful.)
- Softmax is only applied to the OUTPUT layer. Why not in hidden layers? (Answer: Softmax normalizes all activations to sum to 1 — if one unit increases, all others decrease. In hidden layers, this creates competition between neurons, preventing them from independently detecting different features. Hidden layers need each neuron to independently activate — ReLU/GELU are better.)
- tanh outputs zero-centerd values but sigmoid does not. Why does this matter? (Answer: Sigmoid outputs are always positive (0-1). When used as layer input, all gradients in the next layer have the same sign — causing zig-zagging in weight updates (gradient descent takes longer to converge). tanh(-1 to 1) is zero-centerd, so gradients can be positive or negative, enabling more efficient updates.)
Claude uses GELU activations in its MLP layers — the same activation described here. Every token processed by Claude goes through thousands of GELU computations per layer, with the stochastic gating property helping the model represent complex linguistic patterns.