Glossary/DBSCAN — Density-Based Spatial Clustering
Machine Learning

DBSCAN — Density-Based Spatial Clustering

Clustering by density — finds arbitrary shapes and automatically labels noise as outliers.


Definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together (high density) and marks points in low-density regions as noise (outliers). Unlike K-Means, DBSCAN does not require specifying K in advance, discovers clusters of arbitrary shape, and explicitly identifies outliers. It has only two hyperparameters: epsilon (neighbourhood radius) and min_samples (minimum points to form a dense region). Widely used for geospatial clustering, anomaly detection, and customer segmentation.

Real-life analogy: The city map

Imagine points as people standing in a park. DBSCAN looks at each person: if at least 5 people are within 10 metres (epsilon=10, min_samples=5), they form a 'dense neighbourhood' = a cluster. People who are isolated with fewer than 5 neighbours within 10m are 'noise' = outliers. The shapes of the clusters emerge naturally from the density — they can be L-shaped, ring-shaped, or any other form.

DBSCAN algorithm and point types

DBSCAN classifies each point into one of three types: Core point: has at least min_samples points within epsilon distance (including itself). Border point: within epsilon of a core point but has fewer than min_samples neighbours itself. Noise point (outlier): not within epsilon of any core point. Clusters = connected components of core points + their border points.

DBSCAN vs K-Means on non-spherical data

from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_moons, make_blobs, make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
import numpy as np

# Dataset 1: Two interleaved half-moons (DBSCAN excels, K-Means fails)
X_moons, y_moons = make_moons(n_samples=300, noise=0.08, random_state=42)

# Dataset 2: Concentric circles
X_circles, y_circles = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# DBSCAN — no need to specify K
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels_db = dbscan.fit_predict(X_moons)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise    = (labels_db == -1).sum()
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")
# DBSCAN correctly finds 2 moon-shaped clusters!

# K-Means — must specify K=2
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_km = kmeans.fit_predict(X_moons)
print(f"K-Means ARI (moons): {adjusted_rand_score(y_moons, labels_km):.3f}")  # Low!
print(f"DBSCAN  ARI (moons): {adjusted_rand_score(y_moons[labels_db!=-1], labels_db[labels_db!=-1]):.3f}")

# Finding optimal epsilon: k-distance plot
from sklearn.neighbors import NearestNeighbors

# For optimal eps: plot k-th nearest neighbour distances (k=min_samples-1)
# The "elbow" in the plot is the optimal eps
nbrs = NearestNeighbors(n_neighbors=5).fit(X_moons)
distances, _ = nbrs.kneighbors(X_moons)
kth_distances = np.sort(distances[:, -1])  # Distance to 4th nearest neighbour
# Plot kth_distances — choose eps at the "elbow" (sudden jump in distance)
print(f"Suggested eps range: {np.percentile(kth_distances, 90):.3f} - {np.percentile(kth_distances, 95):.3f}")

# Anomaly detection with DBSCAN
X_normal = np.random.randn(200, 2)
X_anomaly = np.random.uniform(-5, 5, (20, 2))  # Outliers scattered uniformly
X_all = np.vstack([X_normal, X_anomaly])

dbscan_ad = DBSCAN(eps=0.5, min_samples=5)
labels_ad = dbscan_ad.fit_predict(X_all)
print(f"Anomalies detected: {(labels_ad == -1).sum()}")  # Should be ~20

DBSCAN hyperparameters and comparison

PropertyK-MeansDBSCANHierarchical
Need to specify K?Yes — alwaysNo — discovers automaticallyNo — cut dendrogram post-hoc
Cluster shapeSpherical onlyArbitrary (any shape)Depends on linkage
Handles outliers?No — forces into clustersYes — labels as noise (-1)Partially
High dimensions?Degrades slowlyDegrades badly (distance meaningless)Degrades slowly
HyperparametersKepsilon, min_sampleslinkage type, distance metric
Time complexityO(nKdT)O(n log n) with indexO(n² log n)

DBSCAN pitfalls

DBSCAN struggles when: (1) Clusters have very different densities — one epsilon cannot work for all. HDBSCAN (hierarchical DBSCAN) solves this. (2) High dimensions — all distances converge (curse of dimensionality), making epsilon meaningless. (3) Large datasets — O(n²) without spatial index. Use HDBSCAN or OPTICS as modern alternatives.

Practice questions

  1. What does DBSCAN return for a point with label -1? (Answer: It is a noise point — an outlier. It has fewer than min_samples points within epsilon distance and is not a border point of any cluster. DBSCAN explicitly identifies these rather than forcing them into a cluster.)
  2. How do you choose epsilon (eps) for DBSCAN? (Answer: K-distance plot — compute the distance to each point's (min_samples-1)th nearest neighbour, sort these distances, and plot them. Choose eps at the "elbow" — where the curve bends sharply, indicating the transition from dense (cluster) to sparse (noise) regions.)
  3. DBSCAN with eps=0.1 finds 50 clusters. With eps=2.0 it finds 1 cluster. What is happening? (Answer: eps=0.1: very small neighbourhood — most points are isolated, over-fragmented. eps=2.0: very large neighbourhood — all points merge into one cluster. The correct eps is at the "elbow" of the k-distance plot.)
  4. Why can DBSCAN detect anomalies? (Answer: Anomalies are isolated points in low-density regions — they have fewer than min_samples neighbours within epsilon. DBSCAN labels these as noise (-1) rather than assigning them to a cluster. The noise points ARE the detected anomalies.)
  5. Can DBSCAN find ring-shaped clusters? (Answer: Yes — this is one of DBSCAN's key advantages over K-Means. A ring has high density throughout but is not convex. DBSCAN's reachability-based approach connects points along the ring, correctly identifying it as one cluster. K-Means would split it in half.)

On LumiChats

DBSCAN is used in production systems for customer segmentation (clusters emerge naturally), geospatial analysis, and anomaly detection in system logs. LumiChats can help tune DBSCAN parameters and visualise cluster quality using silhouette scores.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms