What is all major activation functions?

Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU: All major activation functions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/dl-activation-functions

What is comparison and selection guide?

Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU: Comparison and selection guide. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/dl-activation-functions

What is practice questions?

Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/dl-activation-functions

Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activation functions, stacking any number of linear layers is equivalent to a single linear transformation — the network cannot learn non-linear decision boundaries. ReLU is the default for hidden layers. Sigmoid for binary output. Softmax for multi-class output. Tanh for RNNs. GELU for transformers. The choice of activation significantly impacts training speed, gradient flow, and ultimately model performance.

The non-linearity that lets neural networks learn anything — without it, deep learning is just linear algebra.

Category: Deep Learning & Neural Networks

Why non-linearity is essential

Without activation functions: layer 2 output = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁+b₂) = W'x + b'. No matter how many layers, the result is always a linear function of x. A neural network without activations cannot classify circles, spirals, or any non-linearly separable data — it is just linear regression. Activation functions are what make deep learning actually deep.

All major activation functions

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

x = np.linspace(-5, 5, 100)

# ── SIGMOID ──
# Range: (0, 1) | Output: probability | Problem: vanishing gradients
def sigmoid(x): return 1 / (1 + np.exp(-x))
# Derivative: σ(x)(1-σ(x)) — max 0.25 at x=0, near 0 at extremes
# Use: Binary output layer | Avoid: hidden layers (gradient vanishes)
print(f"Sigmoid(0) = {sigmoid(0):.4f}")  # 0.5
print(f"Sigmoid(5) = {sigmoid(5):.4f}")  # ~0.99

# ── TANH ──
# Range: (-1, 1) | Zero-centerd (better than sigmoid) | Still vanishing gradients
def tanh(x): return np.tanh(x)
# Derivative: 1 - tanh²(x) — max 1.0 at x=0 (4× stronger than sigmoid gradient!)
# Use: RNN hidden states, some hidden layers | Better than sigmoid but still saturates
print(f"Tanh(0) = {tanh(0):.4f}")   # 0.0

# ── ReLU (Rectified Linear Unit) ──
# Range: [0, ∞) | Simple, fast | Problem: dying ReLU
def relu(x): return np.maximum(0, x)
# Derivative: 1 for x>0, 0 for x<0 — NO gradient vanishing for positive inputs!
# This is why deep networks became practical with ReLU (2011)
# Dying ReLU: if x always < 0 (e.g., due to bad init), gradient = 0 forever
print(f"ReLU(-3) = {relu(-3)}")    # 0
print(f"ReLU(3)  = {relu(3)}")     # 3

# ── Leaky ReLU ──
# Fixes dying ReLU: small negative slope for x<0
def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)
print(f"LeakyReLU(-3) = {leaky_relu(-3):.4f}")  # -0.03

# ── PReLU (Parametric ReLU) ──
# Like Leaky ReLU but alpha is learned — PyTorch: nn.PReLU()
prelu = nn.PReLU()   # alpha initialized to 0.25, learned during training

# ── ELU (Exponential Linear Unit) ──
# Negative side: α(eˣ-1) — smooth, avoids dead neurons, negative outputs allowed
def elu(x, alpha=1.0): return np.where(x > 0, x, alpha * (np.exp(x) - 1))

# ── GELU (Gaussian Error Linear Unit) ── MOST IMPORTANT FOR TRANSFORMERS
# GELU(x) ≈ x × Φ(x) where Φ is Gaussian CDF
# Smooth approximation: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
def gelu_approx(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
# Used in: GPT-2, BERT, GPT-3, GPT-4, Claude, T5 — ALL modern transformers
# Advantage: stochastic gate — multiplies by P(X<x) where X~N(0,1)
# Smoother than ReLU, slightly better performance on language tasks
print(f"GELU(1)  = {gelu_approx(1):.4f}")   # ~0.84
print(f"GELU(-1) = {gelu_approx(-1):.4f}")  # ~-0.16

# ── SOFTMAX ── (output layer for multi-class)
def softmax(x):
    e = np.exp(x - x.max())
    return e / e.sum()
logits = np.array([2.0, 1.0, 0.1])
probs  = softmax(logits)
print(f"Softmax({logits}) = {probs.round(3)}")  # [0.659, 0.242, 0.099]
print(f"Sum = {probs.sum():.6f}")               # 1.0

# ── SILU / Swish ── (modern, used in LLaMA, EfficientNet)
def swish(x): return x * sigmoid(x)
# Smooth, non-monotonic — slightly better than ReLU on some tasks
print(f"Swish(2) = {swish(2):.4f}")   # ~1.76
print(f"Swish(-2) = {swish(-2):.4f}") # ~-0.24

Comparison and selection guide

Function	Range	Gradient	Dying neurons?	Where to use
Sigmoid	(0, 1)	0 to 0.25 (low)	No (saturates)	Binary output layer only
Tanh	(-1, 1)	0 to 1 (better)	No (saturates)	RNN hidden states
ReLU	[0, ∞)	0 or 1	Yes (x<0 → dead)	Default hidden layers (CNN, MLP)
Leaky ReLU	(-∞, ∞)	0.01 or 1	No (tiny gradient)	When dying ReLU is a problem
ELU	α×(e^x-1), ∞)	Smooth	No	Deep networks needing negative values
GELU	(-0.17, ∞)	Smooth	Effectively no	Transformers (BERT, GPT, Claude)
Swish/SiLU	(-0.28, ∞)	Smooth, non-monotonic	No	LLaMA, EfficientNet, modern CNNs
Softmax	(0,1), sum=1	Complex (Jacobian)	No	Multi-class output layer only

Activation function quick selection guide: Hidden layers: use ReLU as default. If dying ReLU is a problem (many zero outputs), use Leaky ReLU or ELU. For transformers and modern architectures: GELU or SiLU. For RNNs: Tanh. Output layers: binary classification → Sigmoid. Multi-class → Softmax. Regression → Linear (no activation). GELU is why transformers outperform ReLU-based models — the smooth gating behavior is better for language modeling.

Practice questions

Why does using all sigmoid activations in a deep network cause training failure? (Answer: Vanishing gradient problem. Sigmoid gradient is at most 0.25. With 10 layers: 0.25^10 = 9.5×10⁻⁷. Gradients in early layers are essentially zero — weights barely update. ReLU has gradient 1 for positive inputs, completely solving this for deep networks.)
ReLU outputs 0 for x<0. Why is this called "dying ReLU" and how do you fix it? (Answer: If a neuron consistently receives negative inputs, its output and gradient are always 0 — it never updates and is permanently "dead". Solutions: Leaky ReLU (small slope for x<0), better weight initialization (He initialization), batch normalization (keeps pre-activations in reasonable range).)
What makes GELU better than ReLU for transformers? (Answer: GELU is smooth and differentiable everywhere (no sharp corner at 0). It stochastically gates inputs: for x near 0, sometimes the neuron fires, sometimes not — provides regularization. Empirically improves performance on NLP tasks. Also, transformers use residual connections which reduce the vanishing gradient issue, making the smoothness of GELU more impactful.)
Softmax is only applied to the OUTPUT layer. Why not in hidden layers? (Answer: Softmax normalizes all activations to sum to 1 — if one unit increases, all others decrease. In hidden layers, this creates competition between neurons, preventing them from independently detecting different features. Hidden layers need each neuron to independently activate — ReLU/GELU are better.)
tanh outputs zero-centerd values but sigmoid does not. Why does this matter? (Answer: Sigmoid outputs are always positive (0-1). When used as layer input, all gradients in the next layer have the same sign — causing zig-zagging in weight updates (gradient descent takes longer to converge). tanh(-1 to 1) is zero-centerd, so gradients can be positive or negative, enabling more efficient updates.)

Claude uses GELU activations in its MLP layers — the same activation described here. Every token processed by Claude goes through thousands of GELU computations per layer, with the stochastic gating property helping the model represent complex linguistic patterns.

import numpy as np import torch import torch.nn as nn import torch.nn.functional as F x = np.linspace(-5, 5, 100) # ── SIGMOID ── # Range: (0, 1) | Output: probability | Problem: vanishing gradients def sigmoid(x): return 1 / (1 + np.exp(-x)) # Derivative: σ(x)(1-σ(x)) — max 0.25 at x=0, near 0 at extremes # Use: Binary output layer | Avoid: hidden layers (gradient vanishes) print(f"Sigmoid(0) = {sigmoid(0):.4f}") # 0.5 print(f"Sigmoid(5) = {sigmoid(5):.4f}") # ~0.99 # ── TANH ── # Range: (-1, 1) | Zero-centerd (better than sigmoid) | Still vanishing gradients def tanh(x): return np.tanh(x) # Derivative: 1 - tanh²(x) — max 1.0 at x=0 (4× stronger than sigmoid gradient!) # Use: RNN hidden states, some hidden layers | Better than sigmoid but still saturates print(f"Tanh(0) = {tanh(0):.4f}") # 0.0 # ── ReLU (Rectified Linear Unit) ── # Range: [0, ∞) | Simple, fast | Problem: dying ReLU def relu(x): return np.maximum(0, x) # Derivative: 1 for x>0, 0 for x<0 — NO gradient vanishing for positive inputs! # This is why deep networks became practical with ReLU (2011) # Dying ReLU: if x always < 0 (e.g., due to bad init), gradient = 0 forever print(f"ReLU(-3) = {relu(-3)}") # 0 print(f"ReLU(3) = {relu(3)}") # 3 # ── Leaky ReLU ── # Fixes dying ReLU: small negative slope for x<0 def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x) print(f"LeakyReLU(-3) = {leaky_relu(-3):.4f}") # -0.03 # ── PReLU (Parametric ReLU) ── # Like Leaky ReLU but alpha is learned — PyTorch: nn.PReLU() prelu = nn.PReLU() # alpha initialized to 0.25, learned during training # ── ELU (Exponential Linear Unit) ── # Negative side: α(eˣ-1) — smooth, avoids dead neurons, negative outputs allowed def elu(x, alpha=1.0): return np.where(x > 0, x, alpha * (np.exp(x) - 1)) # ── GELU (Gaussian Error Linear Unit) ── MOST IMPORTANT FOR TRANSFORMERS # GELU(x) ≈ x × Φ(x) where Φ is Gaussian CDF # Smooth approximation: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))) def gelu_approx(x): return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3))) # Used in: GPT-2, BERT, GPT-3, GPT-4, Claude, T5 — ALL modern transformers # Advantage: stochastic gate — multiplies by P(X<x) where X~N(0,1) # Smoother than ReLU, slightly better performance on language tasks print(f"GELU(1) = {gelu_approx(1):.4f}") # ~0.84 print(f"GELU(-1) = {gelu_approx(-1):.4f}") # ~-0.16 # ── SOFTMAX ── (output layer for multi-class) def softmax(x): e = np.exp(x - x.max()) return e / e.sum() logits = np.array([2.0, 1.0, 0.1]) probs = softmax(logits) print(f"Softmax({logits}) = {probs.round(3)}") # [0.659, 0.242, 0.099] print(f"Sum = {probs.sum():.6f}") # 1.0 # ── SILU / Swish ── (modern, used in LLaMA, EfficientNet) def swish(x): return x * sigmoid(x) # Smooth, non-monotonic — slightly better than ReLU on some tasks print(f"Swish(2) = {swish(2):.4f}") # ~1.76 print(f"Swish(-2) = {swish(-2):.4f}") # ~-0.24

Function

Range

Gradient

Dying neurons?

Where to use

Sigmoid

(0, 1)

0 to 0.25 (low)

No (saturates)

Binary output layer only

Tanh

(-1, 1)

0 to 1 (better)

No (saturates)

RNN hidden states

ReLU

[0, ∞)

0 or 1

Yes (x<0 → dead)

Default hidden layers (CNN, MLP)

Leaky ReLU

(-∞, ∞)

0.01 or 1

No (tiny gradient)

When dying ReLU is a problem

ELU

α×(e^x-1), ∞)

Smooth

Deep networks needing negative values

GELU

(-0.17, ∞)

Smooth

Effectively no

Transformers (BERT, GPT, Claude)

Swish/SiLU

(-0.28, ∞)

Smooth, non-monotonic

LLaMA, EfficientNet, modern CNNs

Softmax

(0,1), sum=1

Complex (Jacobian)

Multi-class output layer only

Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU

Why non-linearity is essential

All major activation functions

Comparison and selection guide

Practice questions

Activation Functions — ReLU, Sigmoid, Tanh, Softmax & GELU

Why non-linearity is essential

All major activation functions

Comparison and selection guide

Practice questions

Practice what you just learned

Related Terms