What is practice questions?

Activation Functions: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/activation-function

Activation Functions

Activation functions introduce nonlinearity into neural networks — without them, any number of stacked linear layers would collapse into a single linear transformation, unable to learn complex patterns. Applied element-wise after each linear transformation, activation functions determine whether and how strongly a neuron 'fires' in response to its inputs.

The nonlinearity that makes deep learning work.

Category: Deep Learning & Neural Networks

Why nonlinearity is essential

A neural network is a composition of functions: f_n(f_{n-1}(…f_1(x))). If all f_i are linear (Wx + b), then the whole stack collapses to a single linear function — no matter how many layers you stack:

W^{(3)}\bigl(W^{(2)}(W^{(1)}x + b^{(1)}) + b^{(2)}\bigr) + b^{(3)} = Ax + c

Activation functions break this — by applying a nonlinear function after each linear step, the composition can model arbitrary nonlinear relationships. This is the key ingredient behind the Universal Approximation Theorem.

ReLU and its variants

ReLU (Rectified Linear Unit) is the most widely used activation in hidden layers:

\text{ReLU}(x) = \max(0,\, x)

ReLU is popular because it is computationally trivial, avoids vanishing gradients for positive inputs, and creates sparse activations. Its main problem is 'dying ReLU' — neurons with consistently negative pre-activations get zero gradient and stop learning. Variants fix this:

\text{Leaky ReLU}(x) = \begin{cases} x & x > 0 \\ 0.01x & x \le 0 \end{cases}

\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \le 0 \end{cases}

\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702\, x)

import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

print("ReLU:       ", F.relu(x))
print("Leaky ReLU: ", F.leaky_relu(x, negative_slope=0.01))
print("ELU:        ", F.elu(x, alpha=1.0))
print("GELU:       ", F.gelu(x))
print("SiLU/Swish: ", F.silu(x))  # = x * sigmoid(x)

# ReLU:        [0.00, 0.00, 0.00, 0.50, 2.00]
# Leaky ReLU:  [-0.02, -0.005, 0.00, 0.50, 2.00]
# ELU:         [-0.865, -0.393, 0.00, 0.50, 2.00]
# GELU:        [-0.045, -0.154, 0.00, 0.346, 1.954]
# SiLU/Swish:  [-0.238, -0.189, 0.00, 0.311, 1.762]

Sigmoid and tanh

\sigma(x) = \frac{1}{1 + e^{-x}}, \quad \sigma'(x) = \sigma(x)\,(1 - \sigma(x))

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}, \quad \tanh'(x) = 1 - \tanh^2(x)

Both are largely replaced by ReLU/GELU in hidden layers of modern networks. Sigmoid remains useful in output layers for binary classification (probability output) and in LSTM gating mechanisms.

Vanishing gradient: Sigmoid and tanh saturate for |x| > 3 — gradient becomes ≈ 0. Stack 10 sigmoid layers and gradients shrink by 0.25¹⁰ ≈ 10⁻⁶. This killed deep network training before ReLU became standard.

Softmax for output distributions

Softmax converts a vector of raw scores (logits) into a valid probability distribution:

\text{softmax}(x_i) = \frac{e^{x_i}}{\displaystyle\sum_{j} e^{x_j}}

LLMs apply softmax over a vocabulary of 50,000–100,000 tokens at every generation step, producing a probability for each possible next token. Temperature scaling controls the sharpness of this distribution:

\text{softmax}\!\left(\frac{x_i}{T}\right)

import numpy as np

def softmax(logits, T=1.0):
    """Numerically stable softmax with temperature."""
    scaled = logits / T
    shifted = scaled - scaled.max()     # subtract max for stability (avoids overflow)
    e = np.exp(shifted)
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])

print("T=1.0 (default):   ", softmax(logits, T=1.0).round(3))
# [0.659, 0.242, 0.099]

print("T=0.5 (confident): ", softmax(logits, T=0.5).round(3))
# [0.844, 0.114, 0.042]  ← more peaked

print("T=2.0 (creative):  ", softmax(logits, T=2.0).round(3))
# [0.484, 0.311, 0.205]  ← flatter, more diverse

SwiGLU and modern activations in LLMs

Modern LLMs replace the standard ReLU feedforward block with gated activation units. SwiGLU (used in LLaMA, PaLM, Mistral) combines a Swish activation with a gating mechanism:

\text{SwiGLU}(x, W, V, W_2) = \bigl(\text{Swish}(xW) \otimes xV\bigr) W_2

\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

The gating mechanism lets the network selectively amplify or suppress parts of the representation — similar to LSTM gates but inside a feedforward layer. GeGLU is an identical structure with GELU instead of Swish. These choices meaningfully impact model performance at scale.

Activation	Used in	FFN formula
ReLU	Early Transformers, BERT	max(0, Wx + b)
GELU	BERT, GPT-2, GPT-3	GELU(Wx + b)
SwiGLU	LLaMA 2/3, PaLM, Mistral	(Swish(W₁x) ⊗ W₂x) W₃
GeGLU	T5 v1.1, Flan-T5	(GELU(W₁x) ⊗ W₂x) W₃

Practice questions

What is the dying ReLU problem and three ways to fix it? (Answer: Dying ReLU: a neuron permanently outputs 0 because its pre-activation is always negative — gradients are 0, so the neuron never updates (it is 'dead'). Causes: bad weight initialization, very high learning rate. Fixes: (1) Leaky ReLU: small negative slope (0.01) for negative inputs — dead neurons can recover. (2) ELU (Exponential Linear Unit): smooth negative saturation, no dead neurons, allows negative outputs. (3) Better initialization: He initialization for ReLU networks prevents initial dead neurons. (4) Batch normalization: keeps pre-activations in range where ReLU is active.)
What is the GELU activation and why is it used in GPT, BERT, and most modern transformers? (Answer: GELU (Gaussian Error Linear Unit): x × Φ(x) where Φ is the standard Gaussian CDF. Approximated as: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))). Properties: smooth everywhere (no sharp corner at 0 like ReLU), non-monotonic (small negative values for slightly negative inputs), stochastic interpretation (acts as a probabilistic gate). Outperforms ReLU empirically on NLP tasks — the smooth gating behavior better fits language model training. Used in: GPT-2/3/4, BERT, Claude, Llama, all major transformers.)
What is the swish/SiLU activation and in which models is it used? (Answer: Swish (Google Brain) / SiLU (Sigmoid Linear Unit): f(x) = x × σ(x). Similar to GELU — smooth, non-monotonic, self-gated. Slightly outperforms GELU on some tasks. Used in: EfficientNet (vision), LLaMA-2 and LLaMA-3 (SiLU in FFN layers), PaLM, Mistral. The difference from GELU is subtle: SiLU uses sigmoid gating; GELU uses Gaussian CDF. Both are smooth gates that outperform ReLU on NLP and modern vision tasks.)
Why does softmax behave poorly at extreme temperature values? (Answer: High temperature (softmax(logits/T) with T→∞): all probabilities approach 1/K — uniform distribution, maximum entropy, random sampling. Low temperature (T→0): probabilities approach one-hot for the maximum logit — deterministic argmax. In practice: high T in generation creates incoherent output (random tokens). Very low T in generation creates repetitive output (always the highest probability token). T=0 causes overflow/underflow in exp(logits/T); numerical stabilization (subtract max before exp) is necessary.)
What is the 'curse of dimensionality' effect on activation functions in very wide networks? (Answer: In very wide networks, ReLU creates sparse activations — many neurons output 0 (dead or just inactive for a given input). Sparsity can be beneficial (efficient computation in sparse tensor formats) but can also cause unstable gradients in very deep wide networks. Smooth activations (GELU, SiLU) maintain more neurons active per input, providing denser gradient signal. However, research suggests sparse activation can improve generalization — Mixture of Experts architectures exploit this deliberately by routing inputs to sparse subsets of expert networks.)

Definition

Why nonlinearity is essential

Three stacked linear layers = one linear transformation. Without activations, depth buys nothing.

ReLU and its variants

ReLU (Rectified Linear Unit) is the most widely used activation in hidden layers:

Simply clips negative values to zero. Gradient is 1 for x > 0, and 0 for x ≤ 0.

Small negative slope (0.01) keeps gradient non-zero for negative inputs, preventing dying neurons.

Exponential Linear Unit — smooth transition and negative saturation. Typically α = 1.

Gaussian Error Linear Unit — used in BERT and GPT. Smooth, probabilistic gating. Slightly outperforms ReLU in Transformers.

All ReLU variants in PyTorch

import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

print("ReLU:       ", F.relu(x))
print("Leaky ReLU: ", F.leaky_relu(x, negative_slope=0.01))
print("ELU:        ", F.elu(x, alpha=1.0))
print("GELU:       ", F.gelu(x))
print("SiLU/Swish: ", F.silu(x))  # = x * sigmoid(x)

# ReLU:        [0.00, 0.00, 0.00, 0.50, 2.00]
# Leaky ReLU:  [-0.02, -0.005, 0.00, 0.50, 2.00]
# ELU:         [-0.865, -0.393, 0.00, 0.50, 2.00]
# GELU:        [-0.045, -0.154, 0.00, 0.346, 1.954]
# SiLU/Swish:  [-0.238, -0.189, 0.00, 0.311, 1.762]

Sigmoid and tanh

Sigmoid: maps to (0, 1). Gradient vanishes for large |x| — causes vanishing gradients in deep nets.

Tanh: maps to (−1, 1). Zero-centered (unlike sigmoid), but still has vanishing gradient problem for large |x|.

Both are largely replaced by ReLU/GELU in hidden layers of modern networks. Sigmoid remains useful in output layers for binary classification (probability output) and in LSTM gating mechanisms.

Vanishing gradient

Sigmoid and tanh saturate for |x| > 3 — gradient becomes ≈ 0. Stack 10 sigmoid layers and gradients shrink by 0.25¹⁰ ≈ 10⁻⁶. This killed deep network training before ReLU became standard.

Softmax for output distributions

Softmax converts a vector of raw scores (logits) into a valid probability distribution:

All outputs are positive and sum to 1 — a valid probability distribution. Used in classification output layers and LLM next-token prediction.

T < 1: distribution is sharper (more confident, less diverse). T > 1: distribution is flatter (more random). T → 0: always picks the argmax (greedy decoding).

Numerically stable softmax + temperature scaling

import numpy as np

def softmax(logits, T=1.0):
    """Numerically stable softmax with temperature."""
    scaled = logits / T
    shifted = scaled - scaled.max()     # subtract max for stability (avoids overflow)
    e = np.exp(shifted)
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])

print("T=1.0 (default):   ", softmax(logits, T=1.0).round(3))
# [0.659, 0.242, 0.099]

print("T=0.5 (confident): ", softmax(logits, T=0.5).round(3))
# [0.844, 0.114, 0.042]  ← more peaked

print("T=2.0 (creative):  ", softmax(logits, T=2.0).round(3))
# [0.484, 0.311, 0.205]  ← flatter, more diverse

SwiGLU and modern activations in LLMs

Modern LLMs replace the standard ReLU feedforward block with gated activation units. SwiGLU (used in LLaMA, PaLM, Mistral) combines a Swish activation with a gating mechanism:

Where Swish(x) = x · σ(x) and ⊗ is element-wise multiplication. Two parallel projections — one gating the other.

Swish (also called SiLU) — smooth, non-monotonic. Outperforms ReLU on many benchmarks. Used standalone in EfficientNet.

Activation	Used in	FFN formula
ReLU	Early Transformers, BERT	max(0, Wx + b)
GELU	BERT, GPT-2, GPT-3	GELU(Wx + b)
SwiGLU	LLaMA 2/3, PaLM, Mistral	(Swish(W₁x) ⊗ W₂x) W₃
GeGLU	T5 v1.1, Flan-T5	(GELU(W₁x) ⊗ W₂x) W₃

Practice questions

What is the dying ReLU problem and three ways to fix it? (Answer: Dying ReLU: a neuron permanently outputs 0 because its pre-activation is always negative — gradients are 0, so the neuron never updates (it is 'dead'). Causes: bad weight initialization, very high learning rate. Fixes: (1) Leaky ReLU: small negative slope (0.01) for negative inputs — dead neurons can recover. (2) ELU (Exponential Linear Unit): smooth negative saturation, no dead neurons, allows negative outputs. (3) Better initialization: He initialization for ReLU networks prevents initial dead neurons. (4) Batch normalization: keeps pre-activations in range where ReLU is active.)
What is the GELU activation and why is it used in GPT, BERT, and most modern transformers? (Answer: GELU (Gaussian Error Linear Unit): x × Φ(x) where Φ is the standard Gaussian CDF. Approximated as: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))). Properties: smooth everywhere (no sharp corner at 0 like ReLU), non-monotonic (small negative values for slightly negative inputs), stochastic interpretation (acts as a probabilistic gate). Outperforms ReLU empirically on NLP tasks — the smooth gating behavior better fits language model training. Used in: GPT-2/3/4, BERT, Claude, Llama, all major transformers.)
What is the swish/SiLU activation and in which models is it used? (Answer: Swish (Google Brain) / SiLU (Sigmoid Linear Unit): f(x) = x × σ(x). Similar to GELU — smooth, non-monotonic, self-gated. Slightly outperforms GELU on some tasks. Used in: EfficientNet (vision), LLaMA-2 and LLaMA-3 (SiLU in FFN layers), PaLM, Mistral. The difference from GELU is subtle: SiLU uses sigmoid gating; GELU uses Gaussian CDF. Both are smooth gates that outperform ReLU on NLP and modern vision tasks.)
Why does softmax behave poorly at extreme temperature values? (Answer: High temperature (softmax(logits/T) with T→∞): all probabilities approach 1/K — uniform distribution, maximum entropy, random sampling. Low temperature (T→0): probabilities approach one-hot for the maximum logit — deterministic argmax. In practice: high T in generation creates incoherent output (random tokens). Very low T in generation creates repetitive output (always the highest probability token). T=0 causes overflow/underflow in exp(logits/T); numerical stabilization (subtract max before exp) is necessary.)
What is the 'curse of dimensionality' effect on activation functions in very wide networks? (Answer: In very wide networks, ReLU creates sparse activations — many neurons output 0 (dead or just inactive for a given input). Sparsity can be beneficial (efficient computation in sparse tensor formats) but can also cause unstable gradients in very deep wide networks. Smooth activations (GELU, SiLU) maintain more neurons active per input, providing denser gradient signal. However, research suggests sparse activation can improve generalization — Mixture of Experts architectures exploit this deliberately by routing inputs to sparse subsets of expert networks.)

Activation Functions

Why nonlinearity is essential

ReLU and its variants

Sigmoid and tanh

Softmax for output distributions

SwiGLU and modern activations in LLMs

Practice questions

Activation Functions

Why nonlinearity is essential

ReLU and its variants

Sigmoid and tanh

Softmax for output distributions

SwiGLU and modern activations in LLMs

Practice questions

Practice what you just learned

Related Terms