What is soft margin SVM — the C parameter?

Support Vector Machine (SVM): Soft margin SVM — the C parameter. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/svm-algorithm

What is practice questions (GATE-style)?

Support Vector Machine (SVM): Practice questions (GATE-style). Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/svm-algorithm

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful supervised learning algorithm that finds the optimal hyperplane separating two classes by maximizing the margin — the gap between the hyperplane and the nearest training points (support vectors). For non-linearly separable data, the kernel trick implicitly maps data to higher dimensions where linear separation is possible. SVMs are effective in high-dimensional spaces, memory-efficient, and robust against overfitting. One of the highest-weightage GATE DS&AI classification topics.

Finding the widest possible boundary between classes — with or without kernels.

Category: Machine Learning

Real-life analogy: The widest road between two neighborhoods

Imagine two housing estates with a road between them. You want to build the widest possible road that still clearly separates the two estates. The edges of the road touch the nearest houses from each estate — these are the support vectors. SVM finds exactly this: the widest margin hyperplane. A wider road means drivers (new data points) can be misclassified less easily — better generalization.

Hard margin SVM — linearly separable case

\text{Maximize margin} = \frac{2}{\|\mathbf{w}\|} \quad \text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \; \forall i

Support vectors are the data points that lie exactly on the margin boundaries. All other points — even if you remove them — do not change the decision boundary. This is a key property: the SVM decision is determined by a small subset of training examples.

Soft margin SVM — the C parameter

\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i, \; \xi_i \geq 0

C value	Margin width	Training errors	Risk
Very large (e.g., 1000)	Narrow	Very few	Overfitting — memorizes training data
Medium (e.g., 1)	Moderate	Some allowed	Good generalization — typical default
Very small (e.g., 0.001)	Wide	Many allowed	Underfitting — misclassifies training data

Kernel trick — non-linear SVM

Most real-world data is not linearly separable. The kernel trick computes the dot product in a high-dimensional feature space without explicitly transforming the data — saving enormous computation.

Kernel	Formula	When to use
Linear	K(x,x') = xᵀx'	Already linearly separable; high-dimensional text data
Polynomial	K(x,x') = (xᵀx' + c)^d	Polynomial decision boundaries; image recognition
RBF (Gaussian)	K(x,x') = exp(−γ\|\|x−x'\|\|²)	General-purpose; default choice; creates circular boundaries
Sigmoid	K(x,x') = tanh(αxᵀx' + c)	Similar to neural network; rarely used in practice

import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Non-linearly separable data (two interleaved half-moons)
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Grid search over C and gamma
param_grid = {
    'C':     [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(classification_report(y_test, grid.predict(X_test)))

# Visualise decision boundary (key insight)
best_svm = grid.best_estimator_
support_vecs = best_svm.support_vectors_
print(f"Number of support vectors: {len(support_vecs)}")

GATE insight: SVM vs Logistic Regression: Both SVM and logistic regression find linear decision boundaries. Key differences: SVM only cares about support vectors (the margin); logistic regression uses ALL data points in the loss. SVM output is a class label (hard decision); logistic regression outputs probabilities. SVM with kernel handles non-linear boundaries; logistic regression requires feature engineering for non-linearity. SVM scales better to high-dimensional data (text, genomics).

Practice questions (GATE-style)

What are support vectors in SVM? (Answer: The training data points closest to the decision boundary — they lie on the margin boundaries yᵢ(wᵀxᵢ + b) = 1. The decision boundary depends only on these points; removing all other training points leaves the boundary unchanged.)
SVM with C=∞ is equivalent to: (Answer: Hard margin SVM — zero tolerance for misclassification. Works only when data is linearly separable; fails with noisy or overlapping data.)
The RBF kernel parameter γ controls: (Answer: The influence radius of each support vector. High γ = small radius, complex wiggly boundary (overfitting). Low γ = large radius, smooth boundary (underfitting). Must be tuned via cross-validation.)
Why must features be scaled before using SVM? (Answer: The margin 2/||w|| depends on the scale of w, which depends on feature magnitudes. Unscaled features with large ranges dominate the margin calculation, giving those features too much influence.)
For a dataset with 1000 features and 500 samples, which kernel SVM would you prefer? (Answer: Linear kernel SVM — in high-dimensional spaces, data is often already linearly separable. RBF kernel tends to overfit with p >> n. Linear SVM also scales better.)

SVM intuition (maximizing margin, kernel trick) directly connects to modern deep learning: neural network training also involves finding good decision boundaries via gradient descent. Kernels inspired attention mechanisms — transformers can be viewed as kernel machines over token sequences.

Definition

Real-life analogy: The widest road between two neighborhoods

Hard margin SVM — linearly separable case

SVM primal form: maximize 2/||w|| (or equivalently minimize (1/2)||w||²) subject to all points being correctly classified with margin ≥ 1. The hyperplane is wᵀx + b = 0; support vectors satisfy yᵢ(wᵀxᵢ + b) = 1 exactly.

Soft margin SVM — the C parameter

Soft margin SVM allows some misclassifications (slack variables ξᵢ ≥ 0). C controls the trade-off: large C = hard margin (low tolerance for misclassification, may overfit). Small C = wide margin (allows more misclassifications, better generalization).

C value	Margin width	Training errors	Risk
Very large (e.g., 1000)	Narrow	Very few	Overfitting — memorizes training data
Medium (e.g., 1)	Moderate	Some allowed	Good generalization — typical default
Very small (e.g., 0.001)	Wide	Many allowed	Underfitting — misclassifies training data

Kernel trick — non-linear SVM

Kernel	Formula	When to use
Linear	K(x,x') = xᵀx'	Already linearly separable; high-dimensional text data
Polynomial	K(x,x') = (xᵀx' + c)^d	Polynomial decision boundaries; image recognition
RBF (Gaussian)	K(x,x') = exp(−γ\|\|x−x'\|\|²)	General-purpose; default choice; creates circular boundaries
Sigmoid	K(x,x') = tanh(αxᵀx' + c)	Similar to neural network; rarely used in practice

SVM with RBF kernel and hyperparameter tuning

import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Non-linearly separable data (two interleaved half-moons)
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Grid search over C and gamma
param_grid = {
    'C':     [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(classification_report(y_test, grid.predict(X_test)))

# Visualise decision boundary (key insight)
best_svm = grid.best_estimator_
support_vecs = best_svm.support_vectors_
print(f"Number of support vectors: {len(support_vecs)}")

GATE insight: SVM vs Logistic Regression

Both SVM and logistic regression find linear decision boundaries. Key differences: SVM only cares about support vectors (the margin); logistic regression uses ALL data points in the loss. SVM output is a class label (hard decision); logistic regression outputs probabilities. SVM with kernel handles non-linear boundaries; logistic regression requires feature engineering for non-linearity. SVM scales better to high-dimensional data (text, genomics).

Practice questions (GATE-style)

What are support vectors in SVM? (Answer: The training data points closest to the decision boundary — they lie on the margin boundaries yᵢ(wᵀxᵢ + b) = 1. The decision boundary depends only on these points; removing all other training points leaves the boundary unchanged.)
SVM with C=∞ is equivalent to: (Answer: Hard margin SVM — zero tolerance for misclassification. Works only when data is linearly separable; fails with noisy or overlapping data.)
The RBF kernel parameter γ controls: (Answer: The influence radius of each support vector. High γ = small radius, complex wiggly boundary (overfitting). Low γ = large radius, smooth boundary (underfitting). Must be tuned via cross-validation.)
Why must features be scaled before using SVM? (Answer: The margin 2/||w|| depends on the scale of w, which depends on feature magnitudes. Unscaled features with large ranges dominate the margin calculation, giving those features too much influence.)
For a dataset with 1000 features and 500 samples, which kernel SVM would you prefer? (Answer: Linear kernel SVM — in high-dimensional spaces, data is often already linearly separable. RBF kernel tends to overfit with p >> n. Linear SVM also scales better.)

On LumiChats

Try it free

Support Vector Machine (SVM)

Real-life analogy: The widest road between two neighborhoods

Hard margin SVM — linearly separable case

Soft margin SVM — the C parameter

Kernel trick — non-linear SVM

Practice questions (GATE-style)

Support Vector Machine (SVM)

Real-life analogy: The widest road between two neighborhoods

Hard margin SVM — linearly separable case

Soft margin SVM — the C parameter

Kernel trick — non-linear SVM

Practice questions (GATE-style)

Practice what you just learned

Related Terms