Loss Function
A loss function (also called cost function or objective function) measures the difference between a model's predictions and the ground truth. It produces a single scalar — 'how wrong the model is' — and gradient descent minimizes this value during training. The choice of loss function fundamentally shapes what the model learns to optimize.
How ML measures its own mistakes.
Category: Machine Learning
Mean Squared Error (MSE) — for regression
\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2
MSE penalizes large errors heavily (a prediction error of 10 is penalized 100× more than an error of 1) because of the squaring. This makes it sensitive to outliers.
import numpy as np
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.3, 2.8, 4.5, 4.9])
# Mean Squared Error
mse = np.mean((y_pred - y_true) ** 2)
print(f"MSE: {mse:.4f}") # 0.0920
# Root Mean Squared Error (same units as target)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}") # 0.3033
# Mean Absolute Error (less sensitive to outliers)
mae = np.mean(np.abs(y_pred - y_true))
print(f"MAE: {mae:.4f}") # 0.2600
# Huber Loss (MSE for small errors, MAE for large) — robust to outliers
delta = 1.0
errors = y_pred - y_true
huber = np.where(
np.abs(errors) <= delta,
0.5 * errors**2, # quadratic for |e| ≤ δ
delta * np.abs(errors) - 0.5 * delta**2 # linear for |e| > δ
)
print(f"Huber: {np.mean(huber):.4f}") # 0.0460
Cross-Entropy — for classification
For binary classification, Binary Cross-Entropy (BCE) is:
\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\bigl[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\bigr]
For multi-class classification, Categorical Cross-Entropy:
\mathcal{L}_{\text{CCE}} = -\sum_{c=1}^{C} y_c \log(\hat{p}_c)
import numpy as np
import torch
import torch.nn.functional as F
# --- Binary Cross-Entropy (from scratch) ---
y_true = np.array([1, 0, 1, 1, 0], dtype=float)
y_pred = np.array([0.9, 0.1, 0.8, 0.3, 0.7]) # predicted probabilities
eps = 1e-8 # prevent log(0)
bce = -np.mean(y_true * np.log(y_pred + eps) + (1 - y_true) * np.log(1 - y_pred + eps))
print(f"BCE: {bce:.4f}") # 0.3790
# Intuition: predicting p=0.3 when y=1 gives loss = -log(0.3) ≈ 1.20 (large penalty)
# Predicting p=0.9 when y=1 gives loss = -log(0.9) ≈ 0.10 (small penalty)
# --- Multi-class Cross-Entropy (PyTorch) ---
# Raw logits (before softmax) for 3 classes, batch of 4
logits = torch.tensor([[2.0, 1.0, 0.1],
[0.1, 3.0, 0.2],
[0.1, 0.1, 3.5],
[1.5, 0.5, 0.5]])
labels = torch.tensor([0, 1, 2, 0]) # true class indices
# CrossEntropyLoss = log_softmax + NLLLoss in one step
loss = F.cross_entropy(logits, labels)
print(f"Cross-entropy: {loss.item():.4f}") # ~0.2012
Next-token prediction: the LLM training loss
Language models are trained by predicting the next token at every position in the sequence. For a sequence of n tokens, the loss is the average cross-entropy over all positions:
\mathcal{L}_{\text{LM}} = -\frac{1}{n} \sum_{t=1}^{n} \log P(x_t \mid x_{ Perplexity is the standard metric derived from this loss: \text{Perplexity} = e^{\mathcal{L}_{\text{LM}}} Perplexity benchmarks: GPT-2 (2019): ~35 perplexity on WikiText-103. GPT-3 (2020): ~20. LLaMA 3 8B (2024): ~9. Better models become more "certain" about the next token. KL Divergence measures how different one probability distribution is from another: D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} In RLHF training, a KL penalty is added to prevent the policy model from deviating too far from the base model while maximizing reward: \mathcal{L}_{\text{RLHF}} = -\mathbb{E}[r(x,y)] + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})KL Divergence and RLHF losses
Practice questions