Unsupervised learning finds patterns, structures, and relationships in data without using labeled examples. Without being told what the 'right answer' is, algorithms discover natural groupings, reduce dimensionality, or identify anomalies purely from the statistical properties of the data itself.
Clustering
Clustering assigns data points to groups (clusters) based on similarity — without using labels. The goal: intra-cluster similarity high, inter-cluster similarity low. Different algorithms encode different notions of 'cluster':
K-Means objective: minimize total within-cluster sum of squared distances to cluster centroids μ_k. This is NP-hard in general — K-Means uses Lloyd's algorithm (greedy iterative refinement) which converges to a local optimum.
Comparing K-Means, DBSCAN, and Hierarchical Clustering
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs, make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Dataset 1: spherical clusters → K-Means works well
X_blobs, _ = make_blobs(n_samples=500, centers=4, random_state=42)
# Dataset 2: arbitrary shapes → K-Means fails, DBSCAN wins
X_moons, _ = make_moons(n_samples=500, noise=0.05, random_state=42)
X_moons = StandardScaler().fit_transform(X_moons)
# ── K-Means ───────────────────────────────────────────────────────
km = KMeans(n_clusters=4, n_init=10, random_state=42)
km.fit(X_blobs)
print(f"K-Means silhouette (blobs): {silhouette_score(X_blobs, km.labels_):.3f}")
# ── DBSCAN ────────────────────────────────────────────────────────
# eps = neighborhood radius; min_samples = min points to form a core point
db = DBSCAN(eps=0.3, min_samples=10)
db.fit(X_moons)
n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise = list(db.labels_).count(-1)
print(f"DBSCAN clusters found: {n_clusters}, noise points: {n_noise}")
# ── Hierarchical (Agglomerative) ──────────────────────────────────
hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
hc.fit(X_blobs)
print(f"Hierarchical silhouette: {silhouette_score(X_blobs, hc.labels_):.3f}")| Algorithm | Cluster shape | Must specify k? | Handles outliers? | Scale |
|---|---|---|---|---|
| K-Means | Spherical, equal-size | Yes | No (outliers distort centroids) | O(n·k·iter) — fast |
| DBSCAN | Arbitrary | No (uses eps/min_samples) | Yes (marks as noise, label=-1) | O(n·log n) with index |
| Hierarchical | Arbitrary | Yes (post-hoc from dendrogram) | No | O(n²) — slow for large n |
| Gaussian Mixture (GMM) | Ellipsoidal | Yes | Soft assignments | O(n·k·d²) per EM step |
Dimensionality reduction
High-dimensional data is difficult to visualize, store, and model — the 'curse of dimensionality' makes distances meaningless in very high dimensions. Dimensionality reduction projects data into fewer dimensions while preserving important structure:
PCA: linear projection W is chosen so that the k principal components (columns of W) maximize variance. Equivalent to the top-k eigenvectors of the data covariance matrix Σ = X^T X / n.
PCA and t-SNE for visualizing high-dimensional embedding spaces
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Example: visualize 1536-dim text embeddings in 2D
# (Replace with your actual embeddings)
embeddings = np.random.randn(500, 1536) # 500 embeddings, dim=1536
labels = np.random.randint(0, 5, 500) # 5 topic clusters
# ── Step 1: Reduce to 50D with PCA (fast pre-processing for t-SNE) ──
pca = PCA(n_components=50)
embeddings_50d = pca.fit_transform(embeddings)
print(f"Variance explained by 50 PCs: {pca.explained_variance_ratio_.sum():.1%}")
# ── Step 2: t-SNE for 2D visualization ──────────────────────────────
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000,
random_state=42, init='pca')
embeddings_2d = tsne.fit_transform(embeddings_50d)
# ── Alternative: UMAP (faster, preserves global structure better) ────
# import umap
# reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
# embeddings_2d_umap = reducer.fit_transform(embeddings)
print(f"t-SNE output shape: {embeddings_2d.shape}") # (500, 2)| Method | Type | Preserves | Best for | Speed |
|---|---|---|---|---|
| PCA | Linear | Global variance, distances (approx) | Pre-processing, interpretable features | Very fast O(min(n,d)²·max(n,d)) |
| t-SNE | Nonlinear | Local structure (neighborhoods) | 2D/3D visualization | Slow O(n²) or O(n log n) |
| UMAP | Nonlinear | Local + global structure | Visualization + general purpose | Fast O(n^1.14) |
| Autoencoder | Nonlinear (learned) | Task-relevant features | Learned representations, generation | GPU needed for large models |
Anomaly detection
Anomaly detection identifies data points that deviate significantly from the norm — typically without labeled examples of anomalies (since anomalies are rare and hard to collect labels for). The key challenge is defining 'normal' purely from normal data:
Three anomaly detection algorithms compared
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report
# Generate data: mostly normal with 5% anomalies
X_normal, _ = make_blobs(n_samples=950, centers=1, cluster_std=0.5, random_state=42)
X_anomaly = np.random.uniform(-4, 4, size=(50, 2)) # outliers
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([1]*950 + [-1]*50) # 1=normal, -1=anomaly
# ── Isolation Forest ──────────────────────────────────────────────
# Anomalies are easier to isolate → shorter average path length in random trees
iso = IsolationForest(contamination=0.05, random_state=42)
y_iso = iso.fit_predict(X)
print("Isolation Forest:")
print(classification_report(y_true, y_iso, target_names=["anomaly","normal"]))
# ── One-Class SVM ──────────────────────────────────────────────────
oc_svm = OneClassSVM(kernel='rbf', nu=0.05) # nu ≈ fraction of outliers
y_svm = oc_svm.fit_predict(X)
# ── Local Outlier Factor ──────────────────────────────────────────
# Compares local density of a point to its neighbors
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_lof = lof.fit_predict(X)| Algorithm | Key idea | Best for | Limitation |
|---|---|---|---|
| Isolation Forest | Anomalies are isolated by fewer random splits | High-dimensional tabular data; fast | Struggles with very small anomaly clusters |
| One-Class SVM | Learn boundary around normal data | Low-dimensional, well-defined normal region | Slow at scale; sensitive to hyperparameters |
| Local Outlier Factor (LOF) | Compare local density to neighbors | Clustered data with varying density | Not scalable for >100K points |
| Autoencoder | Normal data reconstructs well; anomalies don't | High-dimensional, complex data (images, sequences) | Needs training data; threshold selection hard |
Autoencoder reconstruction error
Train an autoencoder on normal data only. At inference, compute the reconstruction error ||x - decode(encode(x))||². Anomalies, being out-of-distribution, reconstruct poorly and have high error. Set a threshold on the 95th–99th percentile of training reconstruction errors.
Self-supervised learning — bridging supervised and unsupervised
Self-supervised learning generates labels automatically from the structure of unlabeled data — providing the scalability of unsupervised learning with the performance of supervised learning. All modern LLMs are built on self-supervised pretraining objectives:
| Objective | How labels are generated | Model | Domain |
|---|---|---|---|
| Next-token prediction | Predict next word from preceding words | GPT, LLaMA, Claude (base) | Text generation, LLMs |
| Masked Language Modeling | Predict randomly masked tokens from context | BERT, RoBERTa, DeBERTa | Text understanding, embeddings |
| Masked Image Modeling | Reconstruct randomly masked image patches | MAE (Masked Autoencoder), BEiT | Vision, image features |
| Contrastive learning | Same image under different augmentations should be similar | SimCLR, MoCo, CLIP | Vision, multimodal alignment |
| Next frame prediction | Predict next video frame from prior frames | VideoGPT, Sora (partial) | Video understanding |
Why self-supervised learning dominates
Human-labeled datasets are expensive to create and limited in scale. The entire internet provides billions of self-labeled examples for next-token prediction (every document predicts every word). This is why LLMs can scale to 15 trillion training tokens — something impossible with human annotation.
Generative unsupervised models
Generative models learn the data distribution p(x) and can sample new examples from it. Three major families dominate in 2025:
| Model family | Training objective | Sampling | Quality | Stability | Examples |
|---|---|---|---|---|---|
| VAE | Maximize ELBO = reconstruction - KL(q||p) | Single forward pass (fast) | Blurry for images | Stable | First generative DL models |
| GAN | Adversarial min-max game | Single forward pass (fast) | High for specific domains | Unstable (mode collapse) | StyleGAN3, BigGAN |
| Diffusion | Predict noise at random timestep t | 20–1000 denoising steps | State-of-the-art | Very stable | Stable Diffusion, DALL-E 3, Midjourney |
VAE Evidence Lower Bound (ELBO): maximize reconstruction likelihood while keeping the learned latent distribution q(z|x) close to the prior p(z) = N(0,I). The KL term forces a smooth, continuous latent space.
Why diffusion models won
Diffusion models (DDPM, 2020) achieved stable training and diverse, high-quality generation — solving GAN's mode collapse and VAE's blurriness problems simultaneously. By 2023, diffusion models (Stable Diffusion, Midjourney, DALL-E 3) had largely displaced GANs for state-of-the-art image generation. The tradeoff: sampling requires many denoising steps (slower than GAN's single pass).
Practice questions
- What is the difference between clustering, density estimation, and dimensionality reduction as unsupervised learning tasks? (Answer: Clustering: assign each point to a group (cluster) — hard or soft assignment. Reveals grouping structure. K-means, DBSCAN, hierarchical clustering. Density estimation: model the probability distribution P(x) — generate new samples, detect outliers. GMM, KDE, VAE, normalising flows. Dimensionality reduction: find a low-dimensional representation that preserves structure. PCA, t-SNE, UMAP, autoencoders. All are unsupervised: no labels. They address different questions: who belongs together? what is likely? what is the compact representation?)
- What is the elbow method for choosing k in k-means and what are its limitations? (Answer: Elbow method: run k-means for k=1,...,K, plot within-cluster sum of squares (WCSS) vs k. Look for the 'elbow' — where WCSS stops decreasing sharply and starts diminishing gradually. The elbow suggests the optimal k. Limitations: (1) No sharp elbow for many real datasets — curve is smooth. (2) WCSS always decreases with k — always improving artificially. (3) k-means assumes spherical clusters of similar size. Better alternative: silhouette score (measures cluster cohesion and separation); BIC/AIC for model selection; gap statistic for reference distribution comparison.)
- What is the difference between PCA and autoencoders for dimensionality reduction? (Answer: PCA: linear dimensionality reduction. Finds orthogonal directions of maximum variance. Closed-form solution (SVD). Reconstruction = linear transformation. Best for Gaussian-distributed data, when linear structure captures most variance. Autoencoder: non-linear dimensionality reduction via neural encoder-decoder. Can learn complex, curved manifolds that PCA cannot represent. Requires training. Better for complex high-dimensional data (images, text). Trade-off: PCA is interpretable, fast, exact; autoencoders are powerful but black-box, require more data.)
- What is the inductive bias of k-means and why does it fail for non-spherical clusters? (Answer: K-means minimises within-cluster L2 distance — implicitly assumes clusters are spherical (Voronoi cells are hyperplane boundaries — equal covariance). Fails for: elongated clusters (distance to centroid misleadingly large along minor axis), varying-density clusters, non-convex clusters. Alternatives: GMM (allows elliptical clusters via full covariance matrices), DBSCAN (density-based — no shape assumption), spectral clustering (graph Laplacian eigenvectors — handles non-convex shapes).)
- What is contrastive learning and how has it enabled self-supervised pretraining? (Answer: Contrastive learning: learn representations by attracting augmented views of the same input ('positives') and repelling different inputs ('negatives') in embedding space. No labels needed — augmentation (crop, flip, colour jitter) provides positive pairs. Loss: InfoNCE = -log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ)). SimCLR, MoCo, CLIP use this principle. Achieves near-supervised ImageNet performance. CLIP: uses image-text pairs as positives — one of the most impactful self-supervised methods, enabling zero-shot classification.)