What is confusion matrix and derived metrics?

ML Evaluation Metrics — Confusion Matrix, Precision, Recall, F1, ROC-AUC: Confusion matrix and derived metrics. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ml-evaluation-metrics

What is regression metrics?

ML Evaluation Metrics — Confusion Matrix, Precision, Recall, F1, ROC-AUC: Regression metrics. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ml-evaluation-metrics

What is practice questions?

ML Evaluation Metrics — Confusion Matrix, Precision, Recall, F1, ROC-AUC: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ml-evaluation-metrics

ML Evaluation Metrics — Confusion Matrix, Precision, Recall, F1, ROC-AUC

Evaluation metrics quantify model performance beyond simple accuracy. For classification: the confusion matrix breaks predictions into TP, FP, TN, FN; precision measures prediction quality, recall measures coverage, F1 is their harmonic mean, ROC-AUC measures discrimination ability across all thresholds. For regression: MAE measures average absolute error, MSE penalizes large errors, RMSE is in original units, R² is proportion of variance explained. Choosing the right metric for your problem is as important as choosing the right algorithm.

Measuring how good your model actually is — beyond simple accuracy.

Category: Machine Learning

Real-life analogy: The medical test

A COVID test with 95% accuracy sounds great — but if only 1% of people have COVID, a model that always predicts 'negative' gets 99% accuracy! Precision answers: 'Of everyone I flagged positive, how many actually had COVID?' Recall answers: 'Of everyone who actually had COVID, how many did I catch?' For medical diagnosis, recall is critical — missing a disease (false negative) is far worse than a false alarm.

Confusion matrix and derived metrics

\text{Precision} = \frac{TP}{TP+FP} \quad \text{Recall} = \frac{TP}{TP+FN} \quad F_1 = \frac{2 \cdot P \cdot R}{P+R} \quad \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}

from sklearn.metrics import (confusion_matrix, classification_report,
    precision_score, recall_score, f1_score, accuracy_score,
    roc_auc_score, roc_curve, average_precision_score)
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Create imbalanced dataset (90% negative, 10% positive)
X, y = make_classification(n_samples=1000, n_features=20,
                           weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred  = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]  # Probability of positive class

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"Confusion Matrix:")
print(f"  TN={tn}  FP={fp}")
print(f"  FN={fn}  TP={tp}")

# Individual metrics
print(f"
Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"AUC-ROC:   {roc_auc_score(y_test, y_proba):.3f}")

# Full classification report (all classes)
print("
Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative','Positive']))

# For multi-class: macro vs weighted average
# macro: unweighted mean of per-class metrics (treats all classes equally)
# weighted: weighted by support (number of true instances per class)
# micro: aggregate TP/FP/FN across all classes before computing

# Threshold tuning: find threshold that maximizes F1
thresholds = np.linspace(0.1, 0.9, 50)
f1_scores  = [f1_score(y_test, (y_proba > t).astype(int)) for t in thresholds]
best_thresh = thresholds[np.argmax(f1_scores)]
print(f"
Optimal threshold for F1: {best_thresh:.2f}")
print(f"F1 at optimal threshold:  {max(f1_scores):.3f}")

ROC Curve and AUC

ROC curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) vs False Positive Rate at every possible classification threshold. AUC (Area Under Curve) summarizes the entire ROC curve in one number: 0.5 = random classifier (diagonal line), 1.0 = perfect classifier. AUC measures a model's ability to rank positives above negatives regardless of threshold — threshold-independent evaluation.

from sklearn.metrics import roc_curve, auc, precision_recall_curve

# ROC Curve
fpr, tpr, roc_thresh = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
print(f"AUC-ROC: {roc_auc:.4f}")
# fpr[i] = FP Rate at threshold roc_thresh[i]
# tpr[i] = TP Rate (Recall) at threshold roc_thresh[i]

# Precision-Recall curve (better for imbalanced datasets)
precision, recall, pr_thresh = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)
print(f"Average Precision (AUPRC): {avg_precision:.4f}")

# Interpretation:
# AUC-ROC = 0.5: random — model has no discrimination ability
# AUC-ROC = 0.7: acceptable
# AUC-ROC = 0.8: good
# AUC-ROC = 0.9: excellent
# AUC-ROC = 1.0: perfect (likely overfitting!)

# For imbalanced data, use AUPRC (Area Under Precision-Recall Curve)
# AUPRC random baseline = positive class proportion (not 0.5)
# Always compare AUC and AUPRC to understand if your model is truly good

Metric	Answers	When to prioritize
Accuracy	What % of all predictions are correct?	Balanced classes, symmetric error costs
Precision	Of predicted positives, what % are truly positive?	FP is costly (spam filter — missing ham is worse than missing spam)
Recall (Sensitivity)	Of actual positives, what % did we find?	FN is costly (cancer detection — missing a cancer is catastrophic)
F1 Score	Harmonic mean of Precision and Recall	Imbalanced classes, need balance of P and R
AUC-ROC	How well does model rank positives above negatives?	Comparing models regardless of threshold
AUPRC	Precision-Recall area (better for imbalance)	Highly imbalanced datasets (fraud, rare disease)

Regression metrics

\text{MAE} = \frac{1}{n}\sum|y_i - \hat{y}_i| \quad \text{MSE} = \frac{1}{n}\sum(y_i - \hat{y}_i)^2 \quad \text{RMSE} = \sqrt{\text{MSE}} \quad R^2 = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = np.array([10, 20, 30, 40, 50, 200])  # Note: 200 is an outlier
y_pred = np.array([12, 18, 28, 42, 52, 150])

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100   # Mean Absolute % Error

print(f"MAE:  {mae:.2f}  (robust to 200 outlier)")
print(f"MSE:  {mse:.2f}  (200 outlier dominates: (200-150)^2 = 2500)")
print(f"RMSE: {rmse:.2f} (same units as y)")
print(f"R²:   {r2:.3f}  (proportion of variance explained)")
print(f"MAPE: {mape:.1f}% (percentage error — scale-independent)")

# When to use each:
# MAE: when outliers exist and should not dominate (house prices, revenue)
# MSE/RMSE: when large errors are especially bad (safety-critical systems)
# R²: understanding how much variance the model explains
# MAPE: when relative % error matters more than absolute (forecasting)

Practice questions

Model predicts cancer: 95% accuracy on a dataset where 95% are cancer-free. Why is this useless? (Answer: A model predicting "no cancer" for everyone achieves 95% accuracy but 0% recall — it catches zero actual cancer cases. Always report precision, recall, F1, and AUC alongside accuracy for imbalanced classification.)
Precision = 0.90, Recall = 0.50. Compute F1 score. (Answer: F1 = 2 × (0.90 × 0.50) / (0.90 + 0.50) = 2 × 0.45 / 1.40 = 0.90 / 1.40 ≈ 0.643)
AUC-ROC = 0.5 means: (Answer: The model is no better than random guessing — it ranks a randomly chosen positive example above a randomly chosen negative example exactly 50% of the time. The ROC curve is the diagonal line.)
When would you prefer AUPRC over AUC-ROC? (Answer: When the dataset is highly imbalanced. AUC-ROC can be misleadingly optimistic for imbalanced datasets because it includes TN performance (which is large for rare positive class). AUPRC focuses only on the positive class performance.)
Your regression model has MSE=1000 and MAE=20. What does this suggest? (Answer: MSE >> MAE² suggests the model makes some very large errors (outlier predictions). MSE penalizes large errors quadratically so it is inflated by a few big mistakes. If MAE is acceptable but MSE is large, investigate the extreme prediction errors.)

LumiChats can generate the full evaluation report for any ML model — confusion matrix, all metrics, ROC curve, threshold optimization — from your predictions. Paste your y_true and y_pred arrays and ask for a complete evaluation analysis.

Definition

Real-life analogy: The medical test

Confusion matrix and derived metrics

TP = True Positive (correctly predicted positive). FP = False Positive (predicted positive, actually negative). TN = True Negative (correctly predicted negative). FN = False Negative (predicted negative, actually positive — the dangerous miss in medical diagnosis).

Complete classification evaluation with all metrics

from sklearn.metrics import (confusion_matrix, classification_report,
    precision_score, recall_score, f1_score, accuracy_score,
    roc_auc_score, roc_curve, average_precision_score)
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Create imbalanced dataset (90% negative, 10% positive)
X, y = make_classification(n_samples=1000, n_features=20,
                           weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred  = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]  # Probability of positive class

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"Confusion Matrix:")
print(f"  TN={tn}  FP={fp}")
print(f"  FN={fn}  TP={tp}")

# Individual metrics
print(f"
Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"AUC-ROC:   {roc_auc_score(y_test, y_proba):.3f}")

# Full classification report (all classes)
print("
Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative','Positive']))

# For multi-class: macro vs weighted average
# macro: unweighted mean of per-class metrics (treats all classes equally)
# weighted: weighted by support (number of true instances per class)
# micro: aggregate TP/FP/FN across all classes before computing

# Threshold tuning: find threshold that maximizes F1
thresholds = np.linspace(0.1, 0.9, 50)
f1_scores  = [f1_score(y_test, (y_proba > t).astype(int)) for t in thresholds]
best_thresh = thresholds[np.argmax(f1_scores)]
print(f"
Optimal threshold for F1: {best_thresh:.2f}")
print(f"F1 at optimal threshold:  {max(f1_scores):.3f}")

ROC Curve and AUC

ROC curve, AUC, Precision-Recall curve

from sklearn.metrics import roc_curve, auc, precision_recall_curve

# ROC Curve
fpr, tpr, roc_thresh = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
print(f"AUC-ROC: {roc_auc:.4f}")
# fpr[i] = FP Rate at threshold roc_thresh[i]
# tpr[i] = TP Rate (Recall) at threshold roc_thresh[i]

# Precision-Recall curve (better for imbalanced datasets)
precision, recall, pr_thresh = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)
print(f"Average Precision (AUPRC): {avg_precision:.4f}")

# Interpretation:
# AUC-ROC = 0.5: random — model has no discrimination ability
# AUC-ROC = 0.7: acceptable
# AUC-ROC = 0.8: good
# AUC-ROC = 0.9: excellent
# AUC-ROC = 1.0: perfect (likely overfitting!)

# For imbalanced data, use AUPRC (Area Under Precision-Recall Curve)
# AUPRC random baseline = positive class proportion (not 0.5)
# Always compare AUC and AUPRC to understand if your model is truly good

Metric	Answers	When to prioritize
Accuracy	What % of all predictions are correct?	Balanced classes, symmetric error costs
Precision	Of predicted positives, what % are truly positive?	FP is costly (spam filter — missing ham is worse than missing spam)
Recall (Sensitivity)	Of actual positives, what % did we find?	FN is costly (cancer detection — missing a cancer is catastrophic)
F1 Score	Harmonic mean of Precision and Recall	Imbalanced classes, need balance of P and R
AUC-ROC	How well does model rank positives above negatives?	Comparing models regardless of threshold
AUPRC	Precision-Recall area (better for imbalance)	Highly imbalanced datasets (fraud, rare disease)

Regression metrics

MAE: robust to outliers, same units as y. MSE: penalizes large errors quadratically, sensitive to outliers. RMSE: same units as y, more common than MSE for reporting. R²: proportion of variance explained (0 to 1, higher is better).

All regression metrics comparison

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = np.array([10, 20, 30, 40, 50, 200])  # Note: 200 is an outlier
y_pred = np.array([12, 18, 28, 42, 52, 150])

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100   # Mean Absolute % Error

print(f"MAE:  {mae:.2f}  (robust to 200 outlier)")
print(f"MSE:  {mse:.2f}  (200 outlier dominates: (200-150)^2 = 2500)")
print(f"RMSE: {rmse:.2f} (same units as y)")
print(f"R²:   {r2:.3f}  (proportion of variance explained)")
print(f"MAPE: {mape:.1f}% (percentage error — scale-independent)")

# When to use each:
# MAE: when outliers exist and should not dominate (house prices, revenue)
# MSE/RMSE: when large errors are especially bad (safety-critical systems)
# R²: understanding how much variance the model explains
# MAPE: when relative % error matters more than absolute (forecasting)

Practice questions

Model predicts cancer: 95% accuracy on a dataset where 95% are cancer-free. Why is this useless? (Answer: A model predicting "no cancer" for everyone achieves 95% accuracy but 0% recall — it catches zero actual cancer cases. Always report precision, recall, F1, and AUC alongside accuracy for imbalanced classification.)
Precision = 0.90, Recall = 0.50. Compute F1 score. (Answer: F1 = 2 × (0.90 × 0.50) / (0.90 + 0.50) = 2 × 0.45 / 1.40 = 0.90 / 1.40 ≈ 0.643)
AUC-ROC = 0.5 means: (Answer: The model is no better than random guessing — it ranks a randomly chosen positive example above a randomly chosen negative example exactly 50% of the time. The ROC curve is the diagonal line.)
When would you prefer AUPRC over AUC-ROC? (Answer: When the dataset is highly imbalanced. AUC-ROC can be misleadingly optimistic for imbalanced datasets because it includes TN performance (which is large for rare positive class). AUPRC focuses only on the positive class performance.)
Your regression model has MSE=1000 and MAE=20. What does this suggest? (Answer: MSE >> MAE² suggests the model makes some very large errors (outlier predictions). MSE penalizes large errors quadratically so it is inflated by a few big mistakes. If MAE is acceptable but MSE is large, investigate the extreme prediction errors.)

On LumiChats

Try it free

ML Evaluation Metrics — Confusion Matrix, Precision, Recall, F1, ROC-AUC

Real-life analogy: The medical test

Confusion matrix and derived metrics

ROC Curve and AUC

Regression metrics

Practice questions

ML Evaluation Metrics — Confusion Matrix, Precision, Recall, F1, ROC-AUC

Real-life analogy: The medical test

Confusion matrix and derived metrics

ROC Curve and AUC

Regression metrics

Practice questions

Practice what you just learned

Related Terms