What is practice questions?

Loss Function: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/loss-function

Loss Function

A loss function (also called cost function or objective function) measures the difference between a model's predictions and the ground truth. It produces a single scalar — 'how wrong the model is' — and gradient descent minimizes this value during training. The choice of loss function fundamentally shapes what the model learns to optimize.

How ML measures its own mistakes.

Category: Machine Learning

Mean Squared Error (MSE) — for regression

\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

MSE penalizes large errors heavily (a prediction error of 10 is penalized 100× more than an error of 1) because of the squaring. This makes it sensitive to outliers.

import numpy as np

y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.3, 2.8, 4.5, 4.9])

# Mean Squared Error
mse = np.mean((y_pred - y_true) ** 2)
print(f"MSE:  {mse:.4f}")   # 0.0920

# Root Mean Squared Error (same units as target)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}")  # 0.3033

# Mean Absolute Error (less sensitive to outliers)
mae = np.mean(np.abs(y_pred - y_true))
print(f"MAE:  {mae:.4f}")   # 0.2600

# Huber Loss (MSE for small errors, MAE for large) — robust to outliers
delta = 1.0
errors = y_pred - y_true
huber = np.where(
    np.abs(errors) <= delta,
    0.5 * errors**2,                      # quadratic for |e| ≤ δ
    delta * np.abs(errors) - 0.5 * delta**2  # linear for |e| > δ
)
print(f"Huber: {np.mean(huber):.4f}")  # 0.0460

Cross-Entropy — for classification

For binary classification, Binary Cross-Entropy (BCE) is:

\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\bigl[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\bigr]

For multi-class classification, Categorical Cross-Entropy:

\mathcal{L}_{\text{CCE}} = -\sum_{c=1}^{C} y_c \log(\hat{p}_c)

import numpy as np
import torch
import torch.nn.functional as F

# --- Binary Cross-Entropy (from scratch) ---
y_true = np.array([1, 0, 1, 1, 0], dtype=float)
y_pred = np.array([0.9, 0.1, 0.8, 0.3, 0.7])   # predicted probabilities

eps = 1e-8  # prevent log(0)
bce = -np.mean(y_true * np.log(y_pred + eps) + (1 - y_true) * np.log(1 - y_pred + eps))
print(f"BCE: {bce:.4f}")  # 0.3790

# Intuition: predicting p=0.3 when y=1 gives loss = -log(0.3) ≈ 1.20 (large penalty)
# Predicting p=0.9 when y=1 gives loss = -log(0.9) ≈ 0.10 (small penalty)

# --- Multi-class Cross-Entropy (PyTorch) ---
# Raw logits (before softmax) for 3 classes, batch of 4
logits = torch.tensor([[2.0, 1.0, 0.1],
                        [0.1, 3.0, 0.2],
                        [0.1, 0.1, 3.5],
                        [1.5, 0.5, 0.5]])
labels = torch.tensor([0, 1, 2, 0])  # true class indices

# CrossEntropyLoss = log_softmax + NLLLoss in one step
loss = F.cross_entropy(logits, labels)
print(f"Cross-entropy: {loss.item():.4f}")  # ~0.2012

Next-token prediction: the LLM training loss

Language models are trained by predicting the next token at every position in the sequence. For a sequence of n tokens, the loss is the average cross-entropy over all positions:

\mathcal{L}_{\text{LM}} = -\frac{1}{n} \sum_{t=1}^{n} \log P(x_t \mid x_{

Perplexity is the standard metric derived from this loss:

\text{Perplexity} = e^{\mathcal{L}_{\text{LM}}}

Perplexity benchmarks: GPT-2 (2019): ~35 perplexity on WikiText-103. GPT-3 (2020): ~20. LLaMA 3 8B (2024): ~9. Better models become more "certain" about the next token.

KL Divergence and RLHF losses

KL Divergence measures how different one probability distribution is from another:

D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

In RLHF training, a KL penalty is added to prevent the policy model from deviating too far from the base model while maximizing reward:

\mathcal{L}_{\text{RLHF}} = -\mathbb{E}[r(x,y)] + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

Practice questions

Why is cross-entropy loss preferred over MSE for classification problems? (Answer: MSE with sigmoid output creates a loss landscape with near-zero gradients when predictions are near 0 or 1 — the derivative of (y - sigmoid(z))² w.r.t. z approaches zero in both correct and incorrect cases. Cross-entropy loss with sigmoid has gradient = (sigmoid(z) - y) — proportional to prediction error, no saturation. Cross-entropy learning is fast even for confidently wrong predictions; MSE learning slows dramatically.)
What is the difference between binary cross-entropy and categorical cross-entropy? (Answer: Binary CE (BCELoss): for binary classification (0 or 1 output). Uses sigmoid activation. Loss = -y·log(p) - (1-y)·log(1-p). Categorical CE (CrossEntropyLoss): for multi-class (K classes). Internally applies log-softmax and NLLLoss. Loss = -log(p_correct_class). BCELoss can also handle multi-label classification where each example can have multiple labels simultaneously — each output is an independent binary classification.)
Focal Loss was designed for object detection (RetinaNet). What problem does it solve? (Answer: In object detection, 99%+ of anchor boxes contain background — the model quickly learns to predict 'background' with high confidence, achieving low loss without learning to detect objects. Standard CE loss is dominated by easy negatives. Focal loss: L = -(1-p)^γ · log(p). The factor (1-p)^γ down-weights easy examples (high p) and focuses training on hard, misclassified examples. With γ=2, an example correctly predicted with p=0.9 contributes 0.01× the loss of a hard example.)
What is label smoothing and how does it prevent overconfidence? (Answer: Instead of one-hot targets [0, 0, 1, 0], use smoothed targets [ε/K, ε/K, 1-ε+ε/K, ε/K] with ε=0.1. The model can never achieve loss=0 (which would require p_correct=1.0 exactly). This prevents the model from becoming overconfident (pushing logits to ±∞ which saturates softmax). Improves calibration (predicted probabilities better reflect actual correctness rates) and often improves generalization by 0.5–2% on classification tasks.)
Triplet loss is used in face recognition (FaceNet). What does it optimize? (Answer: Triplet loss: given an anchor (a person's face), positive (same person, different photo), and negative (different person): L = max(0, ||f(a)-f(p)||² - ||f(a)-f(n)||² + margin). It optimizes an embedding space where: same-person faces are close (small anchor-positive distance), different-person faces are far (large anchor-negative distance) by at least 'margin'. This directly optimizes for the downstream similarity search task rather than cross-entropy on softmax logits.)

Definition

Mean Squared Error (MSE) — for regression

MSE: average of squared differences between predictions and true values.

MSE penalizes large errors heavily (a prediction error of 10 is penalized 100× more than an error of 1) because of the squaring. This makes it sensitive to outliers.

Common regression loss functions

import numpy as np

y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.3, 2.8, 4.5, 4.9])

# Mean Squared Error
mse = np.mean((y_pred - y_true) ** 2)
print(f"MSE:  {mse:.4f}")   # 0.0920

# Root Mean Squared Error (same units as target)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}")  # 0.3033

# Mean Absolute Error (less sensitive to outliers)
mae = np.mean(np.abs(y_pred - y_true))
print(f"MAE:  {mae:.4f}")   # 0.2600

# Huber Loss (MSE for small errors, MAE for large) — robust to outliers
delta = 1.0
errors = y_pred - y_true
huber = np.where(
    np.abs(errors) <= delta,
    0.5 * errors**2,                      # quadratic for |e| ≤ δ
    delta * np.abs(errors) - 0.5 * delta**2  # linear for |e| > δ
)
print(f"Huber: {np.mean(huber):.4f}")  # 0.0460

Cross-Entropy — for classification

For binary classification, Binary Cross-Entropy (BCE) is:

y_i = true label (0 or 1), p̂_i = predicted probability. Punishes confident wrong predictions heavily.

For multi-class classification, Categorical Cross-Entropy:

C = number of classes. y_c is 1 for the true class, 0 otherwise. Combined with softmax output.

Cross-entropy from scratch and with PyTorch

import numpy as np
import torch
import torch.nn.functional as F

# --- Binary Cross-Entropy (from scratch) ---
y_true = np.array([1, 0, 1, 1, 0], dtype=float)
y_pred = np.array([0.9, 0.1, 0.8, 0.3, 0.7])   # predicted probabilities

eps = 1e-8  # prevent log(0)
bce = -np.mean(y_true * np.log(y_pred + eps) + (1 - y_true) * np.log(1 - y_pred + eps))
print(f"BCE: {bce:.4f}")  # 0.3790

# Intuition: predicting p=0.3 when y=1 gives loss = -log(0.3) ≈ 1.20 (large penalty)
# Predicting p=0.9 when y=1 gives loss = -log(0.9) ≈ 0.10 (small penalty)

# --- Multi-class Cross-Entropy (PyTorch) ---
# Raw logits (before softmax) for 3 classes, batch of 4
logits = torch.tensor([[2.0, 1.0, 0.1],
                        [0.1, 3.0, 0.2],
                        [0.1, 0.1, 3.5],
                        [1.5, 0.5, 0.5]])
labels = torch.tensor([0, 1, 2, 0])  # true class indices

# CrossEntropyLoss = log_softmax + NLLLoss in one step
loss = F.cross_entropy(logits, labels)
print(f"Cross-entropy: {loss.item():.4f}")  # ~0.2012

Next-token prediction: the LLM training loss

Language models are trained by predicting the next token at every position in the sequence. For a sequence of n tokens, the loss is the average cross-entropy over all positions:

Sum of log-probabilities of each true next token, given all previous tokens. Minimizing this = maximizing the model's predicted probability of the actual text.

Perplexity is the standard metric derived from this loss:

A perplexity of 10 means the model is as uncertain as choosing uniformly among 10 options. Lower = better.

Perplexity benchmarks

GPT-2 (2019): ~35 perplexity on WikiText-103. GPT-3 (2020): ~20. LLaMA 3 8B (2024): ~9. Better models become more "certain" about the next token.

KL Divergence and RLHF losses

KL Divergence measures how different one probability distribution is from another:

KL divergence is not symmetric: D_KL(P||Q) ≠ D_KL(Q||P). Zero when P = Q.

In RLHF training, a KL penalty is added to prevent the policy model from deviating too far from the base model while maximizing reward:

Reward maximization − KL penalty. β controls the tradeoff. Without KL, models "reward hack" by producing text that scores high but isn't actually useful.

Practice questions

Why is cross-entropy loss preferred over MSE for classification problems? (Answer: MSE with sigmoid output creates a loss landscape with near-zero gradients when predictions are near 0 or 1 — the derivative of (y - sigmoid(z))² w.r.t. z approaches zero in both correct and incorrect cases. Cross-entropy loss with sigmoid has gradient = (sigmoid(z) - y) — proportional to prediction error, no saturation. Cross-entropy learning is fast even for confidently wrong predictions; MSE learning slows dramatically.)
What is the difference between binary cross-entropy and categorical cross-entropy? (Answer: Binary CE (BCELoss): for binary classification (0 or 1 output). Uses sigmoid activation. Loss = -y·log(p) - (1-y)·log(1-p). Categorical CE (CrossEntropyLoss): for multi-class (K classes). Internally applies log-softmax and NLLLoss. Loss = -log(p_correct_class). BCELoss can also handle multi-label classification where each example can have multiple labels simultaneously — each output is an independent binary classification.)
Focal Loss was designed for object detection (RetinaNet). What problem does it solve? (Answer: In object detection, 99%+ of anchor boxes contain background — the model quickly learns to predict 'background' with high confidence, achieving low loss without learning to detect objects. Standard CE loss is dominated by easy negatives. Focal loss: L = -(1-p)^γ · log(p). The factor (1-p)^γ down-weights easy examples (high p) and focuses training on hard, misclassified examples. With γ=2, an example correctly predicted with p=0.9 contributes 0.01× the loss of a hard example.)
What is label smoothing and how does it prevent overconfidence? (Answer: Instead of one-hot targets [0, 0, 1, 0], use smoothed targets [ε/K, ε/K, 1-ε+ε/K, ε/K] with ε=0.1. The model can never achieve loss=0 (which would require p_correct=1.0 exactly). This prevents the model from becoming overconfident (pushing logits to ±∞ which saturates softmax). Improves calibration (predicted probabilities better reflect actual correctness rates) and often improves generalization by 0.5–2% on classification tasks.)
Triplet loss is used in face recognition (FaceNet). What does it optimize? (Answer: Triplet loss: given an anchor (a person's face), positive (same person, different photo), and negative (different person): L = max(0, ||f(a)-f(p)||² - ||f(a)-f(n)||² + margin). It optimizes an embedding space where: same-person faces are close (small anchor-positive distance), different-person faces are far (large anchor-negative distance) by at least 'margin'. This directly optimizes for the downstream similarity search task rather than cross-entropy on softmax logits.)

Loss Function

Mean Squared Error (MSE) — for regression

Cross-Entropy — for classification

Next-token prediction: the LLM training loss

KL Divergence and RLHF losses

Practice questions

Loss Function

Mean Squared Error (MSE) — for regression

Cross-Entropy — for classification

Next-token prediction: the LLM training loss

KL Divergence and RLHF losses

Practice questions

Practice what you just learned

Related Terms