Glossary/Forward Propagation — How Neural Networks Make Predictions
Deep Learning & Neural Networks

Forward Propagation — How Neural Networks Make Predictions

The data journey from input to output — layer by layer through the network.


Definition

Forward propagation is the process by which input data flows through a neural network layer by layer to produce a prediction. Each layer applies a linear transformation (weights × input + bias) followed by a non-linear activation function. The output of each layer becomes the input to the next. This is the inference step — the same computation used both during training (to compute loss) and deployment (to make predictions). Understanding forward propagation is the foundation for understanding backpropagation and automatic differentiation.

Real-life analogy: The assembly line

A car assembly line has sequential stations — each station receives a partially built car, adds components (transformation), and passes it to the next. The input is raw metal; the output is a finished car. Neural network forward propagation works identically: each layer receives the previous layer's output, applies its transformation (weights × activations + bias through activation function), and passes the result forward. The input is raw data; the output is a prediction.

Layer-by-layer computation

Forward pass equations. a^[0] = input x. z^[l] = pre-activation of layer l. W^[l] = weight matrix of layer l. b^[l] = bias vector. σ^[l] = activation function. a^[l] = activations. â = final output prediction. L = number of layers.

Forward propagation from scratch through a 3-layer network

import numpy as np

def relu(z):    return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def softmax(z):
    e = np.exp(z - z.max(axis=1, keepdims=True))  # Numerical stability
    return e / e.sum(axis=1, keepdims=True)

class NeuralNetwork:
    """3-layer NN for multi-class classification."""

    def __init__(self, input_size, hidden1, hidden2, output_size):
        # Xavier initialisation
        self.W1 = np.random.randn(input_size, hidden1)  * np.sqrt(2/input_size)
        self.b1 = np.zeros(hidden1)
        self.W2 = np.random.randn(hidden1, hidden2)      * np.sqrt(2/hidden1)
        self.b2 = np.zeros(hidden2)
        self.W3 = np.random.randn(hidden2, output_size)  * np.sqrt(2/hidden2)
        self.b3 = np.zeros(output_size)
        self.cache = {}   # Store activations for backprop

    def forward(self, X):
        """
        Forward pass: X → hidden1 → hidden2 → output
        Input: X (batch_size × input_size)
        Output: probabilities (batch_size × output_size)
        """
        # Layer 1: Linear → ReLU
        self.cache['A0'] = X
        self.cache['Z1'] = X @ self.W1 + self.b1      # (32, 128)
        self.cache['A1'] = relu(self.cache['Z1'])       # (32, 128)

        # Layer 2: Linear → ReLU
        self.cache['Z2'] = self.cache['A1'] @ self.W2 + self.b2  # (32, 64)
        self.cache['A2'] = relu(self.cache['Z2'])                  # (32, 64)

        # Layer 3: Linear → Softmax (output layer)
        self.cache['Z3'] = self.cache['A2'] @ self.W3 + self.b3  # (32, 10)
        self.cache['A3'] = softmax(self.cache['Z3'])               # (32, 10) probabilities

        return self.cache['A3']

    def predict(self, X):
        probs = self.forward(X)
        return np.argmax(probs, axis=1)   # Class with highest probability

# Example: MNIST-like problem (28×28 images → 10 digit classes)
np.random.seed(42)
network = NeuralNetwork(input_size=784, hidden1=128, hidden2=64, output_size=10)

# Batch of 32 images (flattened to 784 features each)
X_batch = np.random.randn(32, 784)
probs   = network.forward(X_batch)
preds   = network.predict(X_batch)

print(f"Input shape:  {X_batch.shape}")     # (32, 784)
print(f"Layer 1 out:  {network.cache['A1'].shape}")  # (32, 128)
print(f"Layer 2 out:  {network.cache['A2'].shape}")  # (32, 64)
print(f"Output probs: {probs.shape}")        # (32, 10)
print(f"Predictions:  {preds}")              # [4, 7, 2, ...] — digit class per image
print(f"Sum of probs: {probs[0].sum():.6f}")  # Must equal 1.0

# Count total parameters
def count_params(net):
    return sum([
        net.W1.size + net.b1.size,
        net.W2.size + net.b2.size,
        net.W3.size + net.b3.size
    ])
print(f"Total parameters: {count_params(network):,}")
# 784×128 + 128 + 128×64 + 64 + 64×10 + 10 = 109,386

Why activations (cache) are stored during forward pass

Notice the self.cache dictionary in the code above — it stores every layer's Z and A values. This is intentional. Backpropagation needs these cached values to compute gradients: the gradient at layer l depends on the activations at layer l and l+1. Modern frameworks (PyTorch, TensorFlow) build a computational graph during the forward pass and traverse it in reverse during backpropagation.

Inference vs Training mode

During inference (making predictions), you only need the forward pass — no cache needed, no gradients computed. In PyTorch: with torch.no_grad(): turns off the autograd engine, reducing memory by ~50% and speeding up inference by ~30%. Always use model.eval() + torch.no_grad() for inference.

Practice questions

  1. A network has input layer (4), hidden layer (3), output layer (2). Draw the forward pass computation. (Answer: X (4,) → Z1 = W1×X + b1, W1 is (3,4), Z1 is (3,) → A1 = relu(Z1) is (3,) → Z2 = W2×A1 + b2, W2 is (2,3), Z2 is (2,) → A2 = softmax(Z2) is (2,) — probability for 2 classes.)
  2. Why is the softmax function used in the output layer for multi-class classification? (Answer: Softmax converts raw scores (logits) into a probability distribution over K classes that sums to exactly 1. Each output is in (0,1) and interpretable as a class probability. The class with the highest softmax output is the prediction.)
  3. What is the numerical stability fix in softmax: exp(z - max(z))? (Answer: exp(z) can overflow for large z values. Subtracting max(z) shifts all values to ≤0, making exp(z-max) ≤1. The result is mathematically identical (the max cancels in numerator and denominator) but avoids overflow/NaN.)
  4. During forward propagation, why are the intermediate activations (cache) stored? (Answer: Backpropagation requires the activations from the forward pass to compute gradients. The gradient of the loss w.r.t. weights in layer l depends on the activations from layer l-1. Without caching, you would need to recompute the entire forward pass during backprop — doubling computation.)
  5. torch.no_grad() is used during inference. What does it do and why? (Answer: Disables PyTorch's autograd engine — no computational graph is built, no gradients are tracked. Saves memory (no activation caching) and speeds up computation. Gradients are only needed during training (backprop). Always use it for model evaluation and deployment.)

On LumiChats

Every time you send a message to LumiChats, a forward pass runs through hundreds of transformer layers — billions of the weighted-sum + activation computations described here. Understanding forward propagation directly explains how language models process and understand your text.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms