Glossary/Ridge Regression & Lasso — L2 and L1 Regularisation
Machine Learning

Ridge Regression & Lasso — L2 and L1 Regularisation

Preventing overfitting by adding a penalty that shrinks large weights.


Definition

Ridge regression (L2 regularisation) and Lasso (L1 regularisation) are extensions of linear regression that add a penalty term to the loss function to prevent overfitting. Ridge shrinks all coefficients toward zero without setting any to exactly zero. Lasso can set coefficients to exactly zero — performing automatic feature selection. Both solve the multicollinearity problem that breaks ordinary OLS. Regularisation is a high-weightage GATE DS&AI topic — typically 2–3 marks every year.

Real-life analogy: The strict budget

Imagine you are hiring a team and have a fixed salary budget. OLS (no regularisation) can pay some employees extremely high salaries. Ridge says: 'Keep the total salary bill reasonable — no single person can get too much.' Lasso says: 'Keep the total bill under a fixed limit — and fire people who are not contributing enough (their salary goes to zero).' Regularisation is exactly this budget constraint on the model weights.

Ridge regression — L2 penalty

Ridge objective. λ (lambda) controls regularisation strength. λ=0 → ordinary OLS. λ→∞ → all β→0. The L2 penalty shrinks coefficients but never sets them exactly to zero.

Ridge closed-form solution. Adding λI to XᵀX makes the matrix invertible even when XᵀX is singular (multicollinearity). This is why Ridge is used when features are highly correlated.

Ridge vs Lasso comparison with cross-validated lambda selection

import numpy as np
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

# Generate data with 20 features, only 5 truly relevant
X, y, true_coef = make_regression(
    n_samples=200, n_features=20, n_informative=5,
    noise=20, coef=True, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Cross-validated lambda selection
lambdas = np.logspace(-3, 3, 50)

ridge_cv = RidgeCV(alphas=lambdas, cv=5)
lasso_cv = LassoCV(alphas=lambdas, cv=5, max_iter=5000)
ridge_cv.fit(X_train, y_train)
lasso_cv.fit(X_train, y_train)

print(f"Best Ridge λ: {ridge_cv.alpha_:.4f}")
print(f"Best Lasso λ: {lasso_cv.alpha_:.4f}")
print(f"Ridge R² test: {r2_score(y_test, ridge_cv.predict(X_test)):.3f}")
print(f"Lasso R² test: {r2_score(y_test, lasso_cv.predict(X_test)):.3f}")

# Lasso zeroes out irrelevant features!
non_zero = np.sum(lasso_cv.coef_ != 0)
print(f"Lasso non-zero coefficients: {non_zero}/20")  # → ~5 (correct!)

Lasso — L1 penalty and feature selection

Lasso uses the L1 norm (absolute values). Unlike Ridge, Lasso produces sparse solutions — some β_j = 0 exactly. Geometrically, the L1 constraint set is a diamond; the MSE contours touch it at a corner where one coordinate is zero.

PropertyOLS (no reg)Ridge (L2)Lasso (L1)Elastic Net (L1+L2)
PenaltyNoneλΣβⱼ²λΣ|βⱼ|λ₁Σ|βⱼ| + λ₂Σβⱼ²
Coefficients = 0?No (unless data)No (shrinks to near 0)Yes (sparse)Yes (sparser than Ridge)
Feature selectionNoNoYes (auto)Yes
Handles multicollinearityNo (OLS fails)Yes (best)Picks one arbitrarilyYes
Closed-form solutionYesYesNo (coordinate descent)No
Best forLow features, no correlationMany correlated featuresSparse true modelCorrelated + sparse model

GATE key: why L1 gives sparsity but L2 does not

Geometrically: the L2 constraint region is a smooth sphere — MSE contours touch it at a non-corner point, so no coefficient is forced to exactly zero. The L1 constraint region is a diamond (in 2D) with corners on the axes — MSE contours almost always touch the diamond at a corner where one or more β = 0. This is the fundamental geometric reason Lasso does feature selection but Ridge does not.

Practice questions (GATE-style)

  1. As λ increases in Ridge regression, what happens to the model bias and variance? (Answer: Bias increases (model is constrained to smaller weights), variance decreases (weights are more stable across different training sets). This is the bias-variance trade-off controlled by λ.)
  2. Why does Ridge regression outperform OLS when features are highly correlated? (Answer: OLS requires (XᵀX)⁻¹ to exist. With multicollinearity, XᵀX is near-singular and the inverse is unstable. Ridge adds λI making (XᵀX + λI) always invertible.)
  3. You have 1000 features but suspect only 20 are relevant. Which regularisation should you use? (Answer: Lasso — it performs automatic feature selection by setting irrelevant feature coefficients to exactly zero.)
  4. Ridge regression with λ=0 is equivalent to: (Answer: Ordinary Least Squares (OLS) — no penalty is applied.)
  5. In the Lasso solution, coefficient β₃ = 0. What does this mean? (Answer: Feature 3 is not contributing to the prediction — Lasso has effectively removed it from the model. This is automatic feature selection.)

On LumiChats

Understanding Ridge and Lasso regularisation helps you explain why LLMs like GPT and Claude use weight decay (L2 regularisation) during training — it is the same principle applied to billions of parameters instead of a few regression weights.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms