A loss function (also called cost or objective function) quantifies the discrepancy between the network's predictions and the true targets. The choice of loss function defines WHAT the network optimises: Cross-Entropy Loss for classification, Mean Squared Error for regression, Binary Cross-Entropy for binary outputs, Categorical Cross-Entropy for multi-class. The loss function's gradient drives backpropagation — it is the signal that teaches the network. The wrong loss function leads to a well-optimised model that solves the wrong problem.
Real-life analogy: The score on a test
A loss function is the exam grade that tells a student how wrong their answers are. MSE is like an exam where each point off is squared — small mistakes are penalised gently, but major mistakes are penalised severely. Cross-entropy is like a logarithmic scoring rule — saying '70% confident' on a wrong answer is penalised less than saying '99% confident' on that same wrong answer. The student (network) learns to improve their confidence calibration.
Classification losses
Binary Cross-Entropy (BCE): for binary classification (0 or 1 output). Penalises confident wrong predictions logarithmically. If y=1 and p̂=0.01: loss = -log(0.01) = 4.6 (huge penalty). If y=1 and p̂=0.99: loss = -log(0.99) ≈ 0.01 (tiny penalty).
All common deep learning loss functions
import torch
import torch.nn as nn
import torch.nn.functional as F
# ── CLASSIFICATION LOSSES ──
# Binary Cross-Entropy: sigmoid output + BCE (2-class)
bce_loss = nn.BCELoss() # Expects sigmoid output (0-1)
bce_logit = nn.BCEWithLogitsLoss() # Combines sigmoid + BCE (numerically stable)
y_pred = torch.tensor([0.9, 0.2, 0.8, 0.1]) # Sigmoid output
y_true = torch.tensor([1.0, 0.0, 1.0, 0.0])
print(f"BCE loss: {bce_loss(y_pred, y_true):.4f}")
# Categorical Cross-Entropy: for multi-class (K classes)
ce_loss = nn.CrossEntropyLoss() # Takes raw logits (NOT softmax output)
logits = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]]) # Raw scores
y_classes = torch.tensor([0, 1]) # True class indices
print(f"Cross-entropy loss: {ce_loss(logits, y_classes):.4f}")
# IMPORTANT: nn.CrossEntropyLoss = Softmax + NLLLoss internally
# Never apply softmax BEFORE CrossEntropyLoss — double softmax!
# NLL Loss (Negative Log Likelihood): pair with log_softmax
nll = nn.NLLLoss()
log_probs = F.log_softmax(logits, dim=1)
print(f"NLL loss: {nll(log_probs, y_classes):.4f}") # Same as CE
# Label Smoothing: prevents overconfident predictions
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
# Replaces one-hot target [0,1,0] with smoothed [0.033, 0.933, 0.033]
# Reduces overfitting, improves calibration — used in modern vision models
# ── REGRESSION LOSSES ──
y_pred_reg = torch.tensor([2.5, 0.0, 2.0, 8.0])
y_true_reg = torch.tensor([3.0, -0.5, 2.0, 7.0])
mse = nn.MSELoss()(y_pred_reg, y_true_reg) # (y - ŷ)² mean
mae = nn.L1Loss()(y_pred_reg, y_true_reg) # |y - ŷ| mean
huber= nn.HuberLoss(delta=1.0)(y_pred_reg, y_true_reg) # Smooth L1: MAE for large errors, MSE for small
print(f"MSE: {mse:.4f} MAE: {mae:.4f} Huber: {huber:.4f}")
# ── SPECIALISED LOSSES ──
# Focal Loss: for extreme class imbalance (object detection)
# Reduces loss for easy examples, focuses on hard ones
# L = -α(1-p)^γ log(p) — γ>0 down-weights easy examples
# Triplet Loss: for metric learning (face recognition, embeddings)
triplet = nn.TripletMarginLoss(margin=1.0)
anchor = torch.randn(32, 128)
positive = anchor + 0.1 * torch.randn(32, 128) # Similar
negative = torch.randn(32, 128) # Different
print(f"Triplet loss: {triplet(anchor, positive, negative):.4f}")
# KL Divergence: measure distribution difference (VAEs)
kl = nn.KLDivLoss(reduction='batchmean')
# Used in VAEs: KL(q(z|x) || p(z)) regularises latent spaceLoss function landscape and convergence
| Loss function | Use case | Output activation | Sensitive to outliers? |
|---|---|---|---|
| MSE (L2) | Regression | Linear | Yes — squares errors |
| MAE (L1) | Regression (robust) | Linear | No — linear errors |
| Huber / Smooth L1 | Regression + outliers | Linear | Partially — switches at delta |
| Binary Cross-Entropy | Binary classification | Sigmoid | No |
| Categorical Cross-Entropy | Multi-class classification | Softmax (or none) | No |
| Focal Loss | Imbalanced classification | Sigmoid | No — down-weights easy examples |
| Triplet Loss | Metric learning, embeddings | L2-normalised | Partially |
| KL Divergence | Generative models (VAE) | Softmax / Gaussian | No |
Critical: Never apply softmax before nn.CrossEntropyLoss
PyTorch's nn.CrossEntropyLoss internally applies log-softmax + NLLLoss. If you pass softmax-activated outputs, you apply softmax twice — the model trains on the wrong objective and performance will be terrible. Always pass raw logits (pre-softmax) to CrossEntropyLoss. Only apply softmax explicitly if using NLLLoss or for inference output display.
Practice questions
- True label y=1, predicted probability p̂=0.01. Compute BCE loss for this example. (Answer: BCE = -[1×log(0.01) + 0×log(0.99)] = -log(0.01) = -(-4.605) = 4.605. Very high — confident and wrong.)
- Why does cross-entropy penalise confident wrong predictions so harshly? (Answer: The log function: -log(p̂) → ∞ as p̂ → 0. If the model says 99% confidence on the wrong class (p̂_correct = 0.01), loss = -log(0.01) = 4.6. If it says 50% (p̂_correct = 0.5), loss = 0.69. This forces the model to be well-calibrated, not just correct.)
- When would you use Huber loss instead of MSE? (Answer: When your regression targets contain outliers. MSE squares the error — an outlier with error 100 contributes 10,000 to MSE. Huber loss uses squared error for small errors (|e| < δ) but linear error for large ones — drastically reducing the outlier's influence.)
- What is label smoothing and why does it help? (Answer: Replace hard one-hot labels [0,1,0] with soft labels [ε/K, 1-ε+ε/K, ε/K]. Prevents the model from becoming overconfident (predicting probability 1.0 for one class). Reduces overfitting, improves test accuracy and calibration. Common in image classification (ε=0.1).)
- MSE vs Cross-Entropy for classification — why never use MSE for classification? (Answer: MSE creates a non-convex loss surface for sigmoid/softmax outputs — prone to getting stuck in local minima. Cross-entropy with sigmoid/softmax creates a convex loss surface (for logistic regression). Also, MSE saturates gradient near 0 and 1 with sigmoid — very slow learning. Cross-entropy gradient is proportional to error with no saturation.)
On LumiChats
When you interact with LumiChats, the underlying model was trained to minimise cross-entropy over its next-token predictions across trillions of examples. Understanding the loss function explains why LLMs can be confidently wrong — cross-entropy optimises probability calibration, not just accuracy.
Try it free