Supervised learning is the most common ML paradigm, where a model learns from a dataset of input-output pairs (labeled examples). The model learns to map inputs to outputs by minimizing the difference between its predictions and the correct labels. Examples: classifying emails as spam/not-spam, predicting house prices from features, recognizing handwritten digits.
Classification vs regression
Supervised learning covers two fundamental task types, distinguished by the output type:
| Task | Output type | Loss function | Examples |
|---|---|---|---|
| Binary classification | One of two classes (0 or 1) | Binary cross-entropy | Spam detection, fraud detection |
| Multi-class classification | One of K classes | Categorical cross-entropy | Image recognition (1000 classes), digit recognition |
| Multi-label classification | Any subset of K classes | Binary cross-entropy per label | Emotion detection, topic tagging |
| Regression | Continuous numerical value | MSE or MAE | House price, stock prediction, age estimation |
Categorical cross-entropy loss for multi-class classification. y_c = 1 for the correct class, 0 otherwise. Minimizing this maximizes the predicted probability of the correct class.
Mean Squared Error for regression. Penalizes large errors quadratically — a 2× larger error gives 4× the penalty. Use MAE instead when outliers should have less influence.
Core algorithms
The workhorse supervised learning algorithms and their key math:
Logistic Regression: linear model + sigmoid output. Despite the name, it is a binary classifier. w and b are learned by maximizing log-likelihood (minimizing binary cross-entropy).
Ensemble method (e.g. Random Forest, Gradient Boosting): prediction is a weighted sum of K base learners h_k. Each base learner is a shallow decision tree.
Comparing key supervised learning algorithms on a classification task
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import numpy as np
X, y = make_classification(n_samples=2000, n_features=20,
n_informative=10, random_state=42)
models = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(n_estimators=100),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100),
"SVM (RBF kernel)": SVC(kernel='rbf', C=1.0),
}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"{name:25s}: {scores.mean():.3f} ± {scores.std():.3f}")
# Logistic Regression : 0.854 ± 0.018 ← fast, interpretable, linear
# Random Forest : 0.907 ± 0.013 ← handles nonlinearity, robust
# Gradient Boosting : 0.921 ± 0.009 ← usually best on tabular data
# SVM (RBF kernel) : 0.893 ± 0.014 ← strong with proper tuningPractical rule
For tabular data, always try XGBoost or LightGBM first — they're state-of-the-art and require minimal preprocessing. For text, images, or audio, use pretrained neural networks. For very small datasets (<500 samples), logistic regression or SVM often beats complex models.
The training / validation / test split
Properly evaluating models requires strict data separation. The golden rule: the test set must never influence any training or model selection decision.
Proper train/val/test split with sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
# ── Step 1: split FIRST, preprocess second ──────────────
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)
# 0.176 of 0.85 ≈ 0.15 of total → final split: 70% train, 15% val, 15% test
# ── Step 2: fit scaler on TRAINING data only ─────────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_val_scaled = scaler.transform(X_val) # transform only (no fit!)
X_test_scaled = scaler.transform(X_test) # transform only (no fit!)
# WRONG (data leakage): scaler.fit_transform(X) before splitting
# This leaks test set statistics into training — optimistic estimates
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")| Split | Typical size | Purpose |
|---|---|---|
| Training | 70–80% | Model learns from this. Hyperparameters affect training. |
| Validation | 10–15% | Tune hyperparameters, select model, early stopping. |
| Test | 10–15% | Final unbiased evaluation. Touch only once at the very end. |
Key metrics for evaluation
Accuracy alone is misleading for imbalanced datasets. A model predicting 'no fraud' always gets 99.9% accuracy on fraud data — but has zero utility. Use task-appropriate metrics:
Precision: of all predicted positives, how many are correct? Recall: of all actual positives, how many did we find? There is always a precision-recall tradeoff.
F1 score: harmonic mean of precision and recall. F_β generalizes this: higher β weights recall more (useful when false negatives are more costly).
Comprehensive classification report with all key metrics
from sklearn.metrics import (classification_report, roc_auc_score,
confusion_matrix, average_precision_score)
import numpy as np
# Assume y_test (true labels) and y_pred, y_prob (predicted labels + probabilities)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1] # probability of positive class
# Full classification report
print(classification_report(y_test, y_pred))
# Output:
# precision recall f1-score support
# class 0 0.93 0.95 0.94 150
# class 1 0.91 0.87 0.89 100
# accuracy 0.92 250
# Additional metrics for imbalanced classes
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")| Metric | Best for | Interpretation |
|---|---|---|
| Accuracy | Balanced datasets | % of all predictions that are correct |
| Precision | When false positives are costly (spam filter) | % of predicted positives that are truly positive |
| Recall | When false negatives are costly (cancer screening) | % of actual positives that are detected |
| F1 | Imbalanced, equal FP/FN cost | Harmonic mean of precision and recall |
| AUC-ROC | Ranking quality, any imbalance | Prob. that model ranks positive above negative (1.0 = perfect) |
| RMSE | Regression | Error in original units; penalizes outliers heavily |
Data quality is more important than algorithm choice
A common misconception: the algorithm drives model quality. In practice, data quality dominates by a wide margin. Clean, representative, correctly-labeled data with a 'good' algorithm consistently beats complex algorithms trained on poor data.
| Data quality issue | Effect | Detection |
|---|---|---|
| Label noise (5% mislabeled) | Significant performance drop, high variance | Cleanlab, manual review of confident wrong predictions |
| Distribution shift (train ≠ test) | Models fails silently in production | Compare feature distributions with KS test or MMD |
| Class imbalance (99:1) | Model ignores minority class | Check per-class metrics, not just accuracy |
| Selection bias | Model learns spurious correlations | Audit data collection process; holdout from different source |
| Target leakage | Artificially inflated test scores | Feature importance audit; temporal split for time data |
The 80/20 rule of ML
In production ML projects, ~80% of time is spent on data — collection, cleaning, labeling, feature engineering. The model itself is 20%. Invest heavily in data quality before experimenting with model complexity.
Practice questions
- What is the difference between classification and regression as supervised learning tasks? (Answer: Classification: target Y is categorical — predict which class an example belongs to. Binary (spam/not spam) or multi-class (10 digits). Loss: cross-entropy. Output layer: softmax (multi-class) or sigmoid (binary). Regression: target Y is continuous — predict a real-valued quantity (house price, temperature). Loss: MSE or MAE. Output layer: linear (no activation). Some tasks are borderline: predicting a score (1–5 stars) can be classification or ordinal regression depending on the modelling assumption.)
- What is empirical risk minimisation (ERM) and what are its limitations? (Answer: ERM: minimise the average loss on the training set: θ* = argmin_θ (1/n)Σℓ(f_θ(xᵢ), yᵢ). Simple and computationally tractable. Limitations: (1) Overfitting: minimising training loss ≠ minimising test loss. (2) Distribution shift: training and test distributions may differ. (3) Label noise: ERM directly fits noisy labels. (4) Memorisation: on small datasets, ERM may memorise rather than generalise. Regularisation (add ||θ||² to loss) extends ERM to penalise complexity.)
- What is the train/validation/test split and why is it critical not to use the test set for model selection? (Answer: Train (60–70%): fit model parameters. Validation (15–20%): select hyperparameters, compare models, choose architecture. Test (15–20%): final unbiased estimate of generalisation. If you use the test set to select models (compare multiple models, choose the best), you are leaking test information into model selection — the test set is no longer held-out. The selected model will appear better than it truly is. The test set should be used EXACTLY ONCE — after all development decisions are made.)
- What is k-fold cross-validation and when should you use it instead of a fixed train/test split? (Answer: K-fold CV: split data into k equal parts; train on k-1 parts, test on the remaining part; rotate k times; average k test scores. Use when: dataset is small (<5000 examples) and a fixed test set would have high variance estimates. Provides: lower-variance performance estimate, uses all data for both training and evaluation. Computationally expensive (k× training runs). For large datasets (>100k examples): fixed split is sufficient and cross-validation adds unnecessary compute cost.)
- What is the difference between online learning and batch learning in supervised settings? (Answer: Batch learning: train on the entire training set simultaneously. Model is static after deployment. Common for most ML. Online learning: update the model one example (or mini-batch) at a time as data arrives. Handles non-stationary distributions (model adapts as the world changes). Examples: online gradient descent, Vowpal Wabbit, streaming perceptron. Required for: real-time systems (fraud detection adapting to new fraud patterns), very large datasets that don't fit in memory, systems where training data arrives continuously.)