Forward propagation is the process by which input data flows through a neural network layer by layer to produce a prediction. Each layer applies a linear transformation (weights × input + bias) followed by a non-linear activation function. The output of each layer becomes the input to the next. This is the inference step — the same computation used both during training (to compute loss) and deployment (to make predictions). Understanding forward propagation is the foundation for understanding backpropagation and automatic differentiation.
Real-life analogy: The assembly line
A car assembly line has sequential stations — each station receives a partially built car, adds components (transformation), and passes it to the next. The input is raw metal; the output is a finished car. Neural network forward propagation works identically: each layer receives the previous layer's output, applies its transformation (weights × activations + bias through activation function), and passes the result forward. The input is raw data; the output is a prediction.
Layer-by-layer computation
Forward pass equations. a^[0] = input x. z^[l] = pre-activation of layer l. W^[l] = weight matrix of layer l. b^[l] = bias vector. σ^[l] = activation function. a^[l] = activations. â = final output prediction. L = number of layers.
Forward propagation from scratch through a 3-layer network
import numpy as np
def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def softmax(z):
e = np.exp(z - z.max(axis=1, keepdims=True)) # Numerical stability
return e / e.sum(axis=1, keepdims=True)
class NeuralNetwork:
"""3-layer NN for multi-class classification."""
def __init__(self, input_size, hidden1, hidden2, output_size):
# Xavier initialisation
self.W1 = np.random.randn(input_size, hidden1) * np.sqrt(2/input_size)
self.b1 = np.zeros(hidden1)
self.W2 = np.random.randn(hidden1, hidden2) * np.sqrt(2/hidden1)
self.b2 = np.zeros(hidden2)
self.W3 = np.random.randn(hidden2, output_size) * np.sqrt(2/hidden2)
self.b3 = np.zeros(output_size)
self.cache = {} # Store activations for backprop
def forward(self, X):
"""
Forward pass: X → hidden1 → hidden2 → output
Input: X (batch_size × input_size)
Output: probabilities (batch_size × output_size)
"""
# Layer 1: Linear → ReLU
self.cache['A0'] = X
self.cache['Z1'] = X @ self.W1 + self.b1 # (32, 128)
self.cache['A1'] = relu(self.cache['Z1']) # (32, 128)
# Layer 2: Linear → ReLU
self.cache['Z2'] = self.cache['A1'] @ self.W2 + self.b2 # (32, 64)
self.cache['A2'] = relu(self.cache['Z2']) # (32, 64)
# Layer 3: Linear → Softmax (output layer)
self.cache['Z3'] = self.cache['A2'] @ self.W3 + self.b3 # (32, 10)
self.cache['A3'] = softmax(self.cache['Z3']) # (32, 10) probabilities
return self.cache['A3']
def predict(self, X):
probs = self.forward(X)
return np.argmax(probs, axis=1) # Class with highest probability
# Example: MNIST-like problem (28×28 images → 10 digit classes)
np.random.seed(42)
network = NeuralNetwork(input_size=784, hidden1=128, hidden2=64, output_size=10)
# Batch of 32 images (flattened to 784 features each)
X_batch = np.random.randn(32, 784)
probs = network.forward(X_batch)
preds = network.predict(X_batch)
print(f"Input shape: {X_batch.shape}") # (32, 784)
print(f"Layer 1 out: {network.cache['A1'].shape}") # (32, 128)
print(f"Layer 2 out: {network.cache['A2'].shape}") # (32, 64)
print(f"Output probs: {probs.shape}") # (32, 10)
print(f"Predictions: {preds}") # [4, 7, 2, ...] — digit class per image
print(f"Sum of probs: {probs[0].sum():.6f}") # Must equal 1.0
# Count total parameters
def count_params(net):
return sum([
net.W1.size + net.b1.size,
net.W2.size + net.b2.size,
net.W3.size + net.b3.size
])
print(f"Total parameters: {count_params(network):,}")
# 784×128 + 128 + 128×64 + 64 + 64×10 + 10 = 109,386Why activations (cache) are stored during forward pass
Notice the self.cache dictionary in the code above — it stores every layer's Z and A values. This is intentional. Backpropagation needs these cached values to compute gradients: the gradient at layer l depends on the activations at layer l and l+1. Modern frameworks (PyTorch, TensorFlow) build a computational graph during the forward pass and traverse it in reverse during backpropagation.
Inference vs Training mode
During inference (making predictions), you only need the forward pass — no cache needed, no gradients computed. In PyTorch: with torch.no_grad(): turns off the autograd engine, reducing memory by ~50% and speeding up inference by ~30%. Always use model.eval() + torch.no_grad() for inference.
Practice questions
- A network has input layer (4), hidden layer (3), output layer (2). Draw the forward pass computation. (Answer: X (4,) → Z1 = W1×X + b1, W1 is (3,4), Z1 is (3,) → A1 = relu(Z1) is (3,) → Z2 = W2×A1 + b2, W2 is (2,3), Z2 is (2,) → A2 = softmax(Z2) is (2,) — probability for 2 classes.)
- Why is the softmax function used in the output layer for multi-class classification? (Answer: Softmax converts raw scores (logits) into a probability distribution over K classes that sums to exactly 1. Each output is in (0,1) and interpretable as a class probability. The class with the highest softmax output is the prediction.)
- What is the numerical stability fix in softmax: exp(z - max(z))? (Answer: exp(z) can overflow for large z values. Subtracting max(z) shifts all values to ≤0, making exp(z-max) ≤1. The result is mathematically identical (the max cancels in numerator and denominator) but avoids overflow/NaN.)
- During forward propagation, why are the intermediate activations (cache) stored? (Answer: Backpropagation requires the activations from the forward pass to compute gradients. The gradient of the loss w.r.t. weights in layer l depends on the activations from layer l-1. Without caching, you would need to recompute the entire forward pass during backprop — doubling computation.)
- torch.no_grad() is used during inference. What does it do and why? (Answer: Disables PyTorch's autograd engine — no computational graph is built, no gradients are tracked. Saves memory (no activation caching) and speeds up computation. Gradients are only needed during training (backprop). Always use it for model evaluation and deployment.)
On LumiChats
Every time you send a message to LumiChats, a forward pass runs through hundreds of transformer layers — billions of the weighted-sum + activation computations described here. Understanding forward propagation directly explains how language models process and understand your text.
Try it free