What is classification losses?

Loss Functions in Deep Learning — Cross-Entropy, MSE & Beyond: Classification losses. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/dl-loss-functions

What is loss function landscape and convergence?

Loss Functions in Deep Learning — Cross-Entropy, MSE & Beyond: Loss function landscape and convergence. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/dl-loss-functions

What is practice questions?

Loss Functions in Deep Learning — Cross-Entropy, MSE & Beyond: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/dl-loss-functions

Deep Learning Loss Functions — Cross-Entropy, MSE, Focal

Loss Functions in Deep Learning — Cross-Entropy, MSE & Beyond

A loss function (also called cost or objective function) quantifies the discrepancy between the network's predictions and the true targets. The choice of loss function defines WHAT the network optimizes: Cross-Entropy Loss for classification, Mean Squared Error for regression, Binary Cross-Entropy for binary outputs, Categorical Cross-Entropy for multi-class. The loss function's gradient drives backpropagation — it is the signal that teaches the network. The wrong loss function leads to a well-optimized model that solves the wrong problem.

The score that tells the network how wrong it is — and which direction to improve.

Category: Deep Learning & Neural Networks

Real-life analogy: The score on a test

A loss function is the exam grade that tells a student how wrong their answers are. MSE is like an exam where each point off is squared — small mistakes are penalized gently, but major mistakes are penalized severely. Cross-entropy is like a logarithmic scoring rule — saying '70% confident' on a wrong answer is penalized less than saying '99% confident' on that same wrong answer. The student (network) learns to improve their confidence calibration.

Classification losses

\mathcal{L}_{BCE} = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── CLASSIFICATION LOSSES ──

# Binary Cross-Entropy: sigmoid output + BCE (2-class)
bce_loss = nn.BCELoss()           # Expects sigmoid output (0-1)
bce_logit = nn.BCEWithLogitsLoss() # Combines sigmoid + BCE (numerically stable)
y_pred = torch.tensor([0.9, 0.2, 0.8, 0.1])   # Sigmoid output
y_true = torch.tensor([1.0, 0.0, 1.0, 0.0])
print(f"BCE loss: {bce_loss(y_pred, y_true):.4f}")

# Categorical Cross-Entropy: for multi-class (K classes)
ce_loss   = nn.CrossEntropyLoss()  # Takes raw logits (NOT softmax output)
logits    = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])  # Raw scores
y_classes = torch.tensor([0, 1])   # True class indices
print(f"Cross-entropy loss: {ce_loss(logits, y_classes):.4f}")
# IMPORTANT: nn.CrossEntropyLoss = Softmax + NLLLoss internally
# Never apply softmax BEFORE CrossEntropyLoss — double softmax!

# NLL Loss (Negative Log Likelihood): pair with log_softmax
nll = nn.NLLLoss()
log_probs = F.log_softmax(logits, dim=1)
print(f"NLL loss: {nll(log_probs, y_classes):.4f}")  # Same as CE

# Label Smoothing: prevents overconfident predictions
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
# Replaces one-hot target [0,1,0] with smoothed [0.033, 0.933, 0.033]
# Reduces overfitting, improves calibration — used in modern vision models

# ── REGRESSION LOSSES ──
y_pred_reg = torch.tensor([2.5, 0.0, 2.0, 8.0])
y_true_reg = torch.tensor([3.0, -0.5, 2.0, 7.0])

mse  = nn.MSELoss()(y_pred_reg, y_true_reg)       # (y - ŷ)² mean
mae  = nn.L1Loss()(y_pred_reg, y_true_reg)         # |y - ŷ| mean
huber= nn.HuberLoss(delta=1.0)(y_pred_reg, y_true_reg)  # Smooth L1: MAE for large errors, MSE for small

print(f"MSE: {mse:.4f}  MAE: {mae:.4f}  Huber: {huber:.4f}")

# ── SPECIALISED LOSSES ──
# Focal Loss: for extreme class imbalance (object detection)
# Reduces loss for easy examples, focuses on hard ones
# L = -α(1-p)^γ log(p) — γ>0 down-weights easy examples

# Triplet Loss: for metric learning (face recognition, embeddings)
triplet = nn.TripletMarginLoss(margin=1.0)
anchor   = torch.randn(32, 128)
positive = anchor + 0.1 * torch.randn(32, 128)  # Similar
negative = torch.randn(32, 128)                   # Different
print(f"Triplet loss: {triplet(anchor, positive, negative):.4f}")

# KL Divergence: measure distribution difference (VAEs)
kl = nn.KLDivLoss(reduction='batchmean')
# Used in VAEs: KL(q(z|x) || p(z)) regularizes latent space

Loss function landscape and convergence

Loss function	Use case	Output activation	Sensitive to outliers?
MSE (L2)	Regression	Linear	Yes — squares errors
MAE (L1)	Regression (robust)	Linear	No — linear errors
Huber / Smooth L1	Regression + outliers	Linear	Partially — switches at delta
Binary Cross-Entropy	Binary classification	Sigmoid	No
Categorical Cross-Entropy	Multi-class classification	Softmax (or none)	No
Focal Loss	Imbalanced classification	Sigmoid	No — down-weights easy examples
Triplet Loss	Metric learning, embeddings	L2-normalized	Partially
KL Divergence	Generative models (VAE)	Softmax / Gaussian	No

Critical: Never apply softmax before nn.CrossEntropyLoss: PyTorch's nn.CrossEntropyLoss internally applies log-softmax + NLLLoss. If you pass softmax-activated outputs, you apply softmax twice — the model trains on the wrong objective and performance will be terrible. Always pass raw logits (pre-softmax) to CrossEntropyLoss. Only apply softmax explicitly if using NLLLoss or for inference output display.

Practice questions

True label y=1, predicted probability p̂=0.01. Compute BCE loss for this example. (Answer: BCE = -[1×log(0.01) + 0×log(0.99)] = -log(0.01) = -(-4.605) = 4.605. Very high — confident and wrong.)
Why does cross-entropy penalize confident wrong predictions so harshly? (Answer: The log function: -log(p̂) → ∞ as p̂ → 0. If the model says 99% confidence on the wrong class (p̂_correct = 0.01), loss = -log(0.01) = 4.6. If it says 50% (p̂_correct = 0.5), loss = 0.69. This forces the model to be well-calibrated, not just correct.)
When would you use Huber loss instead of MSE? (Answer: When your regression targets contain outliers. MSE squares the error — an outlier with error 100 contributes 10,000 to MSE. Huber loss uses squared error for small errors (|e| < δ) but linear error for large ones — drastically reducing the outlier's influence.)
What is label smoothing and why does it help? (Answer: Replace hard one-hot labels [0,1,0] with soft labels [ε/K, 1-ε+ε/K, ε/K]. Prevents the model from becoming overconfident (predicting probability 1.0 for one class). Reduces overfitting, improves test accuracy and calibration. Common in image classification (ε=0.1).)
MSE vs Cross-Entropy for classification — why never use MSE for classification? (Answer: MSE creates a non-convex loss surface for sigmoid/softmax outputs — prone to getting stuck in local minima. Cross-entropy with sigmoid/softmax creates a convex loss surface (for logistic regression). Also, MSE saturates gradient near 0 and 1 with sigmoid — very slow learning. Cross-entropy gradient is proportional to error with no saturation.)

When you interact with LumiChats, the underlying model was trained to minimize cross-entropy over its next-token predictions across trillions of examples. Understanding the loss function explains why LLMs can be confidently wrong — cross-entropy optimizes probability calibration, not just accuracy.

Definition

Real-life analogy: The score on a test

Classification losses

Binary Cross-Entropy (BCE): for binary classification (0 or 1 output). Penalizes confident wrong predictions logarithmically. If y=1 and p̂=0.01: loss = -log(0.01) = 4.6 (huge penalty). If y=1 and p̂=0.99: loss = -log(0.99) ≈ 0.01 (tiny penalty).

All common deep learning loss functions

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── CLASSIFICATION LOSSES ──

# Binary Cross-Entropy: sigmoid output + BCE (2-class)
bce_loss = nn.BCELoss()           # Expects sigmoid output (0-1)
bce_logit = nn.BCEWithLogitsLoss() # Combines sigmoid + BCE (numerically stable)
y_pred = torch.tensor([0.9, 0.2, 0.8, 0.1])   # Sigmoid output
y_true = torch.tensor([1.0, 0.0, 1.0, 0.0])
print(f"BCE loss: {bce_loss(y_pred, y_true):.4f}")

# Categorical Cross-Entropy: for multi-class (K classes)
ce_loss   = nn.CrossEntropyLoss()  # Takes raw logits (NOT softmax output)
logits    = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])  # Raw scores
y_classes = torch.tensor([0, 1])   # True class indices
print(f"Cross-entropy loss: {ce_loss(logits, y_classes):.4f}")
# IMPORTANT: nn.CrossEntropyLoss = Softmax + NLLLoss internally
# Never apply softmax BEFORE CrossEntropyLoss — double softmax!

# NLL Loss (Negative Log Likelihood): pair with log_softmax
nll = nn.NLLLoss()
log_probs = F.log_softmax(logits, dim=1)
print(f"NLL loss: {nll(log_probs, y_classes):.4f}")  # Same as CE

# Label Smoothing: prevents overconfident predictions
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
# Replaces one-hot target [0,1,0] with smoothed [0.033, 0.933, 0.033]
# Reduces overfitting, improves calibration — used in modern vision models

# ── REGRESSION LOSSES ──
y_pred_reg = torch.tensor([2.5, 0.0, 2.0, 8.0])
y_true_reg = torch.tensor([3.0, -0.5, 2.0, 7.0])

mse  = nn.MSELoss()(y_pred_reg, y_true_reg)       # (y - ŷ)² mean
mae  = nn.L1Loss()(y_pred_reg, y_true_reg)         # |y - ŷ| mean
huber= nn.HuberLoss(delta=1.0)(y_pred_reg, y_true_reg)  # Smooth L1: MAE for large errors, MSE for small

print(f"MSE: {mse:.4f}  MAE: {mae:.4f}  Huber: {huber:.4f}")

# ── SPECIALISED LOSSES ──
# Focal Loss: for extreme class imbalance (object detection)
# Reduces loss for easy examples, focuses on hard ones
# L = -α(1-p)^γ log(p) — γ>0 down-weights easy examples

# Triplet Loss: for metric learning (face recognition, embeddings)
triplet = nn.TripletMarginLoss(margin=1.0)
anchor   = torch.randn(32, 128)
positive = anchor + 0.1 * torch.randn(32, 128)  # Similar
negative = torch.randn(32, 128)                   # Different
print(f"Triplet loss: {triplet(anchor, positive, negative):.4f}")

# KL Divergence: measure distribution difference (VAEs)
kl = nn.KLDivLoss(reduction='batchmean')
# Used in VAEs: KL(q(z|x) || p(z)) regularizes latent space

Loss function landscape and convergence

Loss function	Use case	Output activation	Sensitive to outliers?
MSE (L2)	Regression	Linear	Yes — squares errors
MAE (L1)	Regression (robust)	Linear	No — linear errors
Huber / Smooth L1	Regression + outliers	Linear	Partially — switches at delta
Binary Cross-Entropy	Binary classification	Sigmoid	No
Categorical Cross-Entropy	Multi-class classification	Softmax (or none)	No
Focal Loss	Imbalanced classification	Sigmoid	No — down-weights easy examples
Triplet Loss	Metric learning, embeddings	L2-normalized	Partially
KL Divergence	Generative models (VAE)	Softmax / Gaussian	No

Critical: Never apply softmax before nn.CrossEntropyLoss

PyTorch's nn.CrossEntropyLoss internally applies log-softmax + NLLLoss. If you pass softmax-activated outputs, you apply softmax twice — the model trains on the wrong objective and performance will be terrible. Always pass raw logits (pre-softmax) to CrossEntropyLoss. Only apply softmax explicitly if using NLLLoss or for inference output display.

Practice questions

True label y=1, predicted probability p̂=0.01. Compute BCE loss for this example. (Answer: BCE = -[1×log(0.01) + 0×log(0.99)] = -log(0.01) = -(-4.605) = 4.605. Very high — confident and wrong.)
Why does cross-entropy penalize confident wrong predictions so harshly? (Answer: The log function: -log(p̂) → ∞ as p̂ → 0. If the model says 99% confidence on the wrong class (p̂_correct = 0.01), loss = -log(0.01) = 4.6. If it says 50% (p̂_correct = 0.5), loss = 0.69. This forces the model to be well-calibrated, not just correct.)
When would you use Huber loss instead of MSE? (Answer: When your regression targets contain outliers. MSE squares the error — an outlier with error 100 contributes 10,000 to MSE. Huber loss uses squared error for small errors (|e| < δ) but linear error for large ones — drastically reducing the outlier's influence.)
What is label smoothing and why does it help? (Answer: Replace hard one-hot labels [0,1,0] with soft labels [ε/K, 1-ε+ε/K, ε/K]. Prevents the model from becoming overconfident (predicting probability 1.0 for one class). Reduces overfitting, improves test accuracy and calibration. Common in image classification (ε=0.1).)
MSE vs Cross-Entropy for classification — why never use MSE for classification? (Answer: MSE creates a non-convex loss surface for sigmoid/softmax outputs — prone to getting stuck in local minima. Cross-entropy with sigmoid/softmax creates a convex loss surface (for logistic regression). Also, MSE saturates gradient near 0 and 1 with sigmoid — very slow learning. Cross-entropy gradient is proportional to error with no saturation.)

On LumiChats

Try it free

Loss Functions in Deep Learning — Cross-Entropy, MSE & Beyond

Real-life analogy: The score on a test

Classification losses

Loss function landscape and convergence

Practice questions

Loss Functions in Deep Learning — Cross-Entropy, MSE & Beyond

Real-life analogy: The score on a test

Classification losses

Loss function landscape and convergence

Practice questions

Practice what you just learned

Related Terms