Support Vector Machine (SVM) is a powerful supervised learning algorithm that finds the optimal hyperplane separating two classes by maximising the margin — the gap between the hyperplane and the nearest training points (support vectors). For non-linearly separable data, the kernel trick implicitly maps data to higher dimensions where linear separation is possible. SVMs are effective in high-dimensional spaces, memory-efficient, and robust against overfitting. One of the highest-weightage GATE DS&AI classification topics.
Real-life analogy: The widest road between two neighbourhoods
Imagine two housing estates with a road between them. You want to build the widest possible road that still clearly separates the two estates. The edges of the road touch the nearest houses from each estate — these are the support vectors. SVM finds exactly this: the widest margin hyperplane. A wider road means drivers (new data points) can be misclassified less easily — better generalisation.
Hard margin SVM — linearly separable case
SVM primal form: maximise 2/||w|| (or equivalently minimise (1/2)||w||²) subject to all points being correctly classified with margin ≥ 1. The hyperplane is wᵀx + b = 0; support vectors satisfy yᵢ(wᵀxᵢ + b) = 1 exactly.
Support vectors are the data points that lie exactly on the margin boundaries. All other points — even if you remove them — do not change the decision boundary. This is a key property: the SVM decision is determined by a small subset of training examples.
Soft margin SVM — the C parameter
Soft margin SVM allows some misclassifications (slack variables ξᵢ ≥ 0). C controls the trade-off: large C = hard margin (low tolerance for misclassification, may overfit). Small C = wide margin (allows more misclassifications, better generalisation).
| C value | Margin width | Training errors | Risk |
|---|---|---|---|
| Very large (e.g., 1000) | Narrow | Very few | Overfitting — memorises training data |
| Medium (e.g., 1) | Moderate | Some allowed | Good generalisation — typical default |
| Very small (e.g., 0.001) | Wide | Many allowed | Underfitting — misclassifies training data |
Kernel trick — non-linear SVM
Most real-world data is not linearly separable. The kernel trick computes the dot product in a high-dimensional feature space without explicitly transforming the data — saving enormous computation.
| Kernel | Formula | When to use |
|---|---|---|
| Linear | K(x,x') = xᵀx' | Already linearly separable; high-dimensional text data |
| Polynomial | K(x,x') = (xᵀx' + c)^d | Polynomial decision boundaries; image recognition |
| RBF (Gaussian) | K(x,x') = exp(−γ||x−x'||²) | General-purpose; default choice; creates circular boundaries |
| Sigmoid | K(x,x') = tanh(αxᵀx' + c) | Similar to neural network; rarely used in practice |
SVM with RBF kernel and hyperparameter tuning
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
# Non-linearly separable data (two interleaved half-moons)
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Grid search over C and gamma
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf']
}
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(classification_report(y_test, grid.predict(X_test)))
# Visualise decision boundary (key insight)
best_svm = grid.best_estimator_
support_vecs = best_svm.support_vectors_
print(f"Number of support vectors: {len(support_vecs)}")GATE insight: SVM vs Logistic Regression
Both SVM and logistic regression find linear decision boundaries. Key differences: SVM only cares about support vectors (the margin); logistic regression uses ALL data points in the loss. SVM output is a class label (hard decision); logistic regression outputs probabilities. SVM with kernel handles non-linear boundaries; logistic regression requires feature engineering for non-linearity. SVM scales better to high-dimensional data (text, genomics).
Practice questions (GATE-style)
- What are support vectors in SVM? (Answer: The training data points closest to the decision boundary — they lie on the margin boundaries yᵢ(wᵀxᵢ + b) = 1. The decision boundary depends only on these points; removing all other training points leaves the boundary unchanged.)
- SVM with C=∞ is equivalent to: (Answer: Hard margin SVM — zero tolerance for misclassification. Works only when data is linearly separable; fails with noisy or overlapping data.)
- The RBF kernel parameter γ controls: (Answer: The influence radius of each support vector. High γ = small radius, complex wiggly boundary (overfitting). Low γ = large radius, smooth boundary (underfitting). Must be tuned via cross-validation.)
- Why must features be scaled before using SVM? (Answer: The margin 2/||w|| depends on the scale of w, which depends on feature magnitudes. Unscaled features with large ranges dominate the margin calculation, giving those features too much influence.)
- For a dataset with 1000 features and 500 samples, which kernel SVM would you prefer? (Answer: Linear kernel SVM — in high-dimensional spaces, data is often already linearly separable. RBF kernel tends to overfit with p >> n. Linear SVM also scales better.)
On LumiChats
SVM intuition (maximising margin, kernel trick) directly connects to modern deep learning: neural network training also involves finding good decision boundaries via gradient descent. Kernels inspired attention mechanisms — transformers can be viewed as kernel machines over token sequences.
Try it free