Linear Discriminant Analysis (LDA) is both a supervised classification algorithm and a dimensionality reduction technique. It finds a projection of the data that maximises the ratio of between-class variance to within-class variance — maximally separating the classes while keeping each class compact. LDA assumes features are normally distributed with equal covariance matrices across classes (unlike QDA which allows different covariances). Tested in GATE DS&AI alongside PCA as the supervised counterpart to unsupervised dimensionality reduction.
Real-life analogy: Shining a flashlight
Imagine two groups of coloured balls scattered in 3D space. You want to find the angle to shine a flashlight so the shadows of the two groups on a wall are as separated as possible. LDA finds exactly this optimal projection direction — where the shadow separation between groups is maximum relative to how spread out each group shadow is.
Fisher criterion — what LDA maximises
Fisher criterion: maximise the ratio of between-class scatter (S_B) to within-class scatter (S_W) along the projection direction w. S_B = Σₖ nₖ(μₖ−μ)(μₖ−μ)ᵀ. S_W = ΣₖΣᵢ∈class_k (xᵢ−μₖ)(xᵢ−μₖ)ᵀ. Optimal w is the eigenvector of S_W⁻¹S_B with the largest eigenvalue.
LDA for classification and dimensionality reduction
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris.data, iris.target # 4 features, 3 classes
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# ── LDA as classifier ──
lda_clf = LinearDiscriminantAnalysis()
lda_clf.fit(X_train, y_train)
print(f"LDA accuracy: {accuracy_score(y_test, lda_clf.predict(X_test)):.3f}")
# ── LDA as dimensionality reduction (K-1 components for K classes) ──
# 3 classes → max 2 LDA components
lda_2d = LinearDiscriminantAnalysis(n_components=2)
X_train_2d = lda_2d.fit_transform(X_train, y_train)
X_test_2d = lda_2d.transform(X_test)
print(f"Shape: {X_train_2d.shape}") # (120, 2) — 4D → 2D
# Explained variance ratio
print(f"Explained variance: {lda_2d.explained_variance_ratio_}")
# Compare with PCA (unsupervised)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
print(f"PCA explained variance: {pca.explained_variance_ratio_}")
# LDA finds better separating directions than PCA for classification tasks| Property | LDA | PCA |
|---|---|---|
| Supervision | Supervised (uses class labels) | Unsupervised (no labels needed) |
| Objective | Maximise class separation | Maximise variance |
| Max components | K-1 (K = number of classes) | min(n, p) components |
| Assumption | Gaussian classes, equal covariance | None (linear projections) |
| Best for | Classification + dimensionality reduction | Compression, visualisation, pre-processing |
QDA — Quadratic Discriminant Analysis
LDA assumes all classes share the same covariance matrix (Σ). QDA (Quadratic Discriminant Analysis) relaxes this — each class k has its own covariance matrix Σₖ. This creates quadratic decision boundaries (curved). QDA has more parameters and needs more data. LDA is a special case of QDA when all Σₖ are equal.
Practice questions (GATE-style)
- LDA for a 5-class problem can produce at most how many discriminant components? (Answer: K−1 = 4 components. The between-class scatter matrix S_B has rank at most K−1.)
- What assumption does LDA make that QDA does not? (Answer: LDA assumes all classes have the same covariance matrix (Σ). QDA allows each class to have its own covariance matrix Σₖ, leading to quadratic decision boundaries.)
- LDA maximises: (Answer: The Fisher criterion J(w) = wᵀS_B w / wᵀS_W w — the ratio of between-class variance to within-class variance in the projected space.)
- When would you choose PCA over LDA for preprocessing? (Answer: When you have no class labels (unsupervised setting), or when you want to preserve general variance for non-classification tasks like compression or anomaly detection.)
- LDA assumes Gaussian distributions. What happens when this assumption is violated? (Answer: The linear decision boundary may be suboptimal — non-linear classifiers (RBF SVM, neural networks) might outperform LDA. Kernel LDA can handle non-Gaussian data.)
On LumiChats
LDA is directly related to the concept of embedding separation in LLMs: fine-tuning objectives often try to increase between-class separation in the embedding space while keeping within-class embeddings compact — the same principle as Fisher's criterion.
Try it free