t-SNE (t-distributed Stochastic Neighbour Embedding) and UMAP (Uniform Manifold Approximation and Projection) are non-linear dimensionality reduction techniques used to visualise high-dimensional data in 2D or 3D. Unlike PCA (which is linear), they preserve local neighbourhood structure — similar points stay close in the low-dimensional embedding. Used extensively to visualise word embeddings, image embeddings, single-cell gene expression data, and model representations. UMAP is the modern successor to t-SNE — faster and better at preserving global structure.
Real-life analogy: Flattening a globe to a map
A globe has 3D structure — continents, distances, topology. Flattening it to a 2D map always distorts something. t-SNE is like a map projection that perfectly preserves local distances (cities near each other stay near) but distorts global distances (Europe and Asia may look closer or farther than reality). UMAP is a better projection that preserves both local and global structure more faithfully.
t-SNE — the classic visualisation method
t-SNE works in two steps: (1) Compute pairwise similarities in the high-dimensional space using a Gaussian kernel — nearby points get high similarity. (2) Place points in 2D space and compute Student-t distribution similarities. Minimise KL divergence between the two distributions using gradient descent. The t-distribution is used in 2D to avoid the 'crowding problem' — it has heavier tails than Gaussian.
t-SNE and UMAP for embedding visualisation
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits, fetch_openml
import numpy as np
# Load high-dimensional data: MNIST digits (784 dimensions)
digits = load_digits() # 1797 samples, 64 features (8x8 pixel images)
X, y = digits.data, digits.target
# Step 1: Reduce with PCA first (recommended preprocessing for t-SNE)
# t-SNE is O(n²) — slow for high dimensions; PCA speeds it up
pca = PCA(n_components=30, random_state=42)
X_pca = pca.fit_transform(X)
print(f"PCA reduced: {X.shape} → {X_pca.shape}")
# Step 2: t-SNE for visualisation
tsne = TSNE(
n_components=2,
perplexity=30, # Roughly = number of close neighbours (5-50, try 30)
learning_rate=200, # Adam LR for the optimisation (100-1000)
n_iter=1000, # More iterations = better quality
random_state=42,
init='pca' # PCA initialisation is more stable than random
)
X_tsne = tsne.fit_transform(X_pca)
print(f"t-SNE output: {X_tsne.shape}") # (1797, 2) — 64D → 2D
# Verify clustering: digits 0-9 should form 10 clusters
from sklearn.metrics import silhouette_score
sil = silhouette_score(X_tsne, y)
print(f"Silhouette score in t-SNE space: {sil:.3f}")
# UMAP: faster and better global structure
try:
import umap
reducer = umap.UMAP(
n_components=2,
n_neighbors=15, # Local neighbourhood size (analogous to perplexity)
min_dist=0.1, # Minimum distance between points in 2D (0.0-0.99)
metric='euclidean', # Distance metric in original space
random_state=42
)
X_umap = reducer.fit_transform(X)
print(f"UMAP output: {X_umap.shape}") # (1797, 2)
# UMAP can also transform new points (t-SNE cannot!)
X_new = np.random.randn(10, 64)
X_new_umap = reducer.transform(X_new) # Project new points
except ImportError:
print("pip install umap-learn")
# Visualise word embeddings from a language model
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')
# embeddings = model.encode(list_of_sentences) # (n, 384) embeddings
# X_tsne = TSNE(perplexity=15).fit_transform(embeddings) # (n, 2)t-SNE vs UMAP vs PCA comparison
| Property | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear projection | Non-linear manifold | Non-linear manifold |
| Speed | Very fast O(min(n,d)²) | Slow O(n²) | Fast O(n log n) |
| Global structure | Preserved (by variance) | Lost (only local) | Better preserved |
| New points | Yes (transform()) | No (refit required) | Yes (transform()) |
| Deterministic | Yes (up to sign) | No (random init) | Semi (random_state) |
| Hyperparameters | n_components | perplexity, learning_rate | n_neighbors, min_dist |
| Best for | Linear structure, preprocessing | Visualisation, cluster exploration | Visualisation + downstream tasks |
Critical t-SNE misinterpretation pitfalls
NEVER draw these conclusions from t-SNE plots: (1) "Cluster A is bigger than cluster B" — cluster sizes in t-SNE are meaningless. (2) "Cluster A is far from cluster B in 2D therefore they are dissimilar" — global distances are distorted. (3) "There are 5 clusters because I see 5 blobs" — t-SNE can create artificial sub-clusters or merge real clusters depending on perplexity. ONLY local neighbourhood membership is trustworthy.
Practice questions
- Why is t-SNE not used for dimensionality reduction before ML training? (Answer: t-SNE is non-deterministic, cannot project new/unseen points, is slow O(n²), and distances in the 2D embedding are not interpretable. PCA or UMAP are used for preprocessing because they are deterministic, can transform new points with transform(), and preserve more geometric meaning.)
- What does the perplexity parameter control in t-SNE? (Answer: Perplexity roughly controls the number of close neighbours each point considers. Low perplexity (5): very local structure, many small clusters. High perplexity (50): more global structure, fewer larger clusters. Typical range: 5-50. Try multiple values and compare.)
- UMAP's min_dist parameter: what happens with min_dist=0.0 vs min_dist=0.99? (Answer: min_dist=0.0: points are packed as tightly as possible — very compact clusters, good for cluster separation. min_dist=0.99: points spread out more uniformly — better for seeing overall data topology and continuous structures.)
- Can t-SNE be used to check if class labels are meaningful? (Answer: Yes — if same-class examples cluster together in t-SNE without using the labels, it suggests the features genuinely discriminate between classes. If t-SNE shows no class separation, it suggests the features may not contain enough signal for the classification task.)
- Why is PCA often applied before t-SNE? (Answer: Two reasons: (1) Speed — t-SNE is O(n²) in dimensions for distance computation; PCA to 30-50 dims first reduces computation dramatically. (2) Noise removal — PCA discards low-variance dimensions which are often noise. The t-SNE then operates on the clean signal.)
On LumiChats
t-SNE and UMAP are used to visualise how LLMs internally represent knowledge. Researchers plot word embeddings, sentence embeddings, and neuron activations in 2D to understand what the model has learned. LumiChats can write the code to visualise your model representations.
Try it free