Evaluation metrics quantify model performance beyond simple accuracy. For classification: the confusion matrix breaks predictions into TP, FP, TN, FN; precision measures prediction quality, recall measures coverage, F1 is their harmonic mean, ROC-AUC measures discrimination ability across all thresholds. For regression: MAE measures average absolute error, MSE penalises large errors, RMSE is in original units, R² is proportion of variance explained. Choosing the right metric for your problem is as important as choosing the right algorithm.
Real-life analogy: The medical test
A COVID test with 95% accuracy sounds great — but if only 1% of people have COVID, a model that always predicts 'negative' gets 99% accuracy! Precision answers: 'Of everyone I flagged positive, how many actually had COVID?' Recall answers: 'Of everyone who actually had COVID, how many did I catch?' For medical diagnosis, recall is critical — missing a disease (false negative) is far worse than a false alarm.
Confusion matrix and derived metrics
TP = True Positive (correctly predicted positive). FP = False Positive (predicted positive, actually negative). TN = True Negative (correctly predicted negative). FN = False Negative (predicted negative, actually positive — the dangerous miss in medical diagnosis).
Complete classification evaluation with all metrics
from sklearn.metrics import (confusion_matrix, classification_report,
precision_score, recall_score, f1_score, accuracy_score,
roc_auc_score, roc_curve, average_precision_score)
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
# Create imbalanced dataset (90% negative, 10% positive)
X, y = make_classification(n_samples=1000, n_features=20,
weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1] # Probability of positive class
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"Confusion Matrix:")
print(f" TN={tn} FP={fp}")
print(f" FN={fn} TP={tp}")
# Individual metrics
print(f"
Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
# Full classification report (all classes)
print("
Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative','Positive']))
# For multi-class: macro vs weighted average
# macro: unweighted mean of per-class metrics (treats all classes equally)
# weighted: weighted by support (number of true instances per class)
# micro: aggregate TP/FP/FN across all classes before computing
# Threshold tuning: find threshold that maximises F1
thresholds = np.linspace(0.1, 0.9, 50)
f1_scores = [f1_score(y_test, (y_proba > t).astype(int)) for t in thresholds]
best_thresh = thresholds[np.argmax(f1_scores)]
print(f"
Optimal threshold for F1: {best_thresh:.2f}")
print(f"F1 at optimal threshold: {max(f1_scores):.3f}")ROC Curve and AUC
ROC curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) vs False Positive Rate at every possible classification threshold. AUC (Area Under Curve) summarises the entire ROC curve in one number: 0.5 = random classifier (diagonal line), 1.0 = perfect classifier. AUC measures a model's ability to rank positives above negatives regardless of threshold — threshold-independent evaluation.
ROC curve, AUC, Precision-Recall curve
from sklearn.metrics import roc_curve, auc, precision_recall_curve
# ROC Curve
fpr, tpr, roc_thresh = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
print(f"AUC-ROC: {roc_auc:.4f}")
# fpr[i] = FP Rate at threshold roc_thresh[i]
# tpr[i] = TP Rate (Recall) at threshold roc_thresh[i]
# Precision-Recall curve (better for imbalanced datasets)
precision, recall, pr_thresh = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)
print(f"Average Precision (AUPRC): {avg_precision:.4f}")
# Interpretation:
# AUC-ROC = 0.5: random — model has no discrimination ability
# AUC-ROC = 0.7: acceptable
# AUC-ROC = 0.8: good
# AUC-ROC = 0.9: excellent
# AUC-ROC = 1.0: perfect (likely overfitting!)
# For imbalanced data, use AUPRC (Area Under Precision-Recall Curve)
# AUPRC random baseline = positive class proportion (not 0.5)
# Always compare AUC and AUPRC to understand if your model is truly good| Metric | Answers | When to prioritise |
|---|---|---|
| Accuracy | What % of all predictions are correct? | Balanced classes, symmetric error costs |
| Precision | Of predicted positives, what % are truly positive? | FP is costly (spam filter — missing ham is worse than missing spam) |
| Recall (Sensitivity) | Of actual positives, what % did we find? | FN is costly (cancer detection — missing a cancer is catastrophic) |
| F1 Score | Harmonic mean of Precision and Recall | Imbalanced classes, need balance of P and R |
| AUC-ROC | How well does model rank positives above negatives? | Comparing models regardless of threshold |
| AUPRC | Precision-Recall area (better for imbalance) | Highly imbalanced datasets (fraud, rare disease) |
Regression metrics
MAE: robust to outliers, same units as y. MSE: penalises large errors quadratically, sensitive to outliers. RMSE: same units as y, more common than MSE for reporting. R²: proportion of variance explained (0 to 1, higher is better).
All regression metrics comparison
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
y_true = np.array([10, 20, 30, 40, 50, 200]) # Note: 200 is an outlier
y_pred = np.array([12, 18, 28, 42, 52, 150])
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100 # Mean Absolute % Error
print(f"MAE: {mae:.2f} (robust to 200 outlier)")
print(f"MSE: {mse:.2f} (200 outlier dominates: (200-150)^2 = 2500)")
print(f"RMSE: {rmse:.2f} (same units as y)")
print(f"R²: {r2:.3f} (proportion of variance explained)")
print(f"MAPE: {mape:.1f}% (percentage error — scale-independent)")
# When to use each:
# MAE: when outliers exist and should not dominate (house prices, revenue)
# MSE/RMSE: when large errors are especially bad (safety-critical systems)
# R²: understanding how much variance the model explains
# MAPE: when relative % error matters more than absolute (forecasting)Practice questions
- Model predicts cancer: 95% accuracy on a dataset where 95% are cancer-free. Why is this useless? (Answer: A model predicting "no cancer" for everyone achieves 95% accuracy but 0% recall — it catches zero actual cancer cases. Always report precision, recall, F1, and AUC alongside accuracy for imbalanced classification.)
- Precision = 0.90, Recall = 0.50. Compute F1 score. (Answer: F1 = 2 × (0.90 × 0.50) / (0.90 + 0.50) = 2 × 0.45 / 1.40 = 0.90 / 1.40 ≈ 0.643)
- AUC-ROC = 0.5 means: (Answer: The model is no better than random guessing — it ranks a randomly chosen positive example above a randomly chosen negative example exactly 50% of the time. The ROC curve is the diagonal line.)
- When would you prefer AUPRC over AUC-ROC? (Answer: When the dataset is highly imbalanced. AUC-ROC can be misleadingly optimistic for imbalanced datasets because it includes TN performance (which is large for rare positive class). AUPRC focuses only on the positive class performance.)
- Your regression model has MSE=1000 and MAE=20. What does this suggest? (Answer: MSE >> MAE² suggests the model makes some very large errors (outlier predictions). MSE penalises large errors quadratically so it is inflated by a few big mistakes. If MAE is acceptable but MSE is large, investigate the extreme prediction errors.)
On LumiChats
LumiChats can generate the full evaluation report for any ML model — confusion matrix, all metrics, ROC curve, threshold optimisation — from your predictions. Paste your y_true and y_pred arrays and ask for a complete evaluation analysis.
Try it free