What is ridge regression — L2 penalty?

Ridge Regression & Lasso — L2 and L1 Regularization: Ridge regression — L2 penalty. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ridge-regression

What is lasso — L1 penalty and feature selection?

Ridge Regression & Lasso — L2 and L1 Regularization: Lasso — L1 penalty and feature selection. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ridge-regression

What is practice questions (GATE-style)?

Ridge Regression & Lasso — L2 and L1 Regularization: Practice questions (GATE-style). Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ridge-regression

Ridge Regression & Lasso — L1 L2 Regularization Explained

Ridge Regression & Lasso — L2 and L1 Regularization

Ridge regression (L2 regularization) and Lasso (L1 regularization) are extensions of linear regression that add a penalty term to the loss function to prevent overfitting. Ridge shrinks all coefficients toward zero without setting any to exactly zero. Lasso can set coefficients to exactly zero — performing automatic feature selection. Both solve the multicollinearity problem that breaks ordinary OLS. Regularization is a high-weightage GATE DS&AI topic — typically 2–3 marks every year.

Preventing overfitting by adding a penalty that shrinks large weights.

Category: Machine Learning

Real-life analogy: The strict budget

Imagine you are hiring a team and have a fixed salary budget. OLS (no regularization) can pay some employees extremely high salaries. Ridge says: 'Keep the total salary bill reasonable — no single person can get too much.' Lasso says: 'Keep the total bill under a fixed limit — and fire people who are not contributing enough (their salary goes to zero).' Regularization is exactly this budget constraint on the model weights.

Ridge regression — L2 penalty

\mathcal{L}_{Ridge} = \underbrace{\sum_{i=1}^n (y_i - \hat{y}_i)^2}_{\text{MSE loss}} + \underbrace{\lambda \sum_{j=1}^p \beta_j^2}_{\text{L2 penalty}}

\boldsymbol{\hat{\beta}}_{Ridge} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}

import numpy as np
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

# Generate data with 20 features, only 5 truly relevant
X, y, true_coef = make_regression(
    n_samples=200, n_features=20, n_informative=5,
    noise=20, coef=True, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Cross-validated lambda selection
lambdas = np.logspace(-3, 3, 50)

ridge_cv = RidgeCV(alphas=lambdas, cv=5)
lasso_cv = LassoCV(alphas=lambdas, cv=5, max_iter=5000)
ridge_cv.fit(X_train, y_train)
lasso_cv.fit(X_train, y_train)

print(f"Best Ridge λ: {ridge_cv.alpha_:.4f}")
print(f"Best Lasso λ: {lasso_cv.alpha_:.4f}")
print(f"Ridge R² test: {r2_score(y_test, ridge_cv.predict(X_test)):.3f}")
print(f"Lasso R² test: {r2_score(y_test, lasso_cv.predict(X_test)):.3f}")

# Lasso zeroes out irrelevant features!
non_zero = np.sum(lasso_cv.coef_ != 0)
print(f"Lasso non-zero coefficients: {non_zero}/20")  # → ~5 (correct!)

Lasso — L1 penalty and feature selection

\mathcal{L}_{Lasso} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\beta_j|

Property	OLS (no reg)	Ridge (L2)	Lasso (L1)	Elastic Net (L1+L2)
Penalty	None	λΣβⱼ²	λΣ\|βⱼ\|	λ₁Σ\|βⱼ\| + λ₂Σβⱼ²
Coefficients = 0?	No (unless data)	No (shrinks to near 0)	Yes (sparse)	Yes (sparser than Ridge)
Feature selection	No	No	Yes (auto)	Yes
Handles multicollinearity	No (OLS fails)	Yes (best)	Picks one arbitrarily	Yes
Closed-form solution	Yes	Yes	No (coordinate descent)	No
Best for	Low features, no correlation	Many correlated features	Sparse true model	Correlated + sparse model

GATE key: why L1 gives sparsity but L2 does not: Geometrically: the L2 constraint region is a smooth sphere — MSE contours touch it at a non-corner point, so no coefficient is forced to exactly zero. The L1 constraint region is a diamond (in 2D) with corners on the axes — MSE contours almost always touch the diamond at a corner where one or more β = 0. This is the fundamental geometric reason Lasso does feature selection but Ridge does not.

Practice questions (GATE-style)

As λ increases in Ridge regression, what happens to the model bias and variance? (Answer: Bias increases (model is constrained to smaller weights), variance decreases (weights are more stable across different training sets). This is the bias-variance trade-off controlled by λ.)
Why does Ridge regression outperform OLS when features are highly correlated? (Answer: OLS requires (XᵀX)⁻¹ to exist. With multicollinearity, XᵀX is near-singular and the inverse is unstable. Ridge adds λI making (XᵀX + λI) always invertible.)
You have 1000 features but suspect only 20 are relevant. Which regularization should you use? (Answer: Lasso — it performs automatic feature selection by setting irrelevant feature coefficients to exactly zero.)
Ridge regression with λ=0 is equivalent to: (Answer: Ordinary Least Squares (OLS) — no penalty is applied.)
In the Lasso solution, coefficient β₃ = 0. What does this mean? (Answer: Feature 3 is not contributing to the prediction — Lasso has effectively removed it from the model. This is automatic feature selection.)

Understanding Ridge and Lasso regularization helps you explain why LLMs like GPT and Claude use weight decay (L2 regularization) during training — it is the same principle applied to billions of parameters instead of a few regression weights.

Definition

Real-life analogy: The strict budget

Ridge regression — L2 penalty

Ridge objective. λ (lambda) controls regularization strength. λ=0 → ordinary OLS. λ→∞ → all β→0. The L2 penalty shrinks coefficients but never sets them exactly to zero.

Ridge closed-form solution. Adding λI to XᵀX makes the matrix invertible even when XᵀX is singular (multicollinearity). This is why Ridge is used when features are highly correlated.

Ridge vs Lasso comparison with cross-validated lambda selection

import numpy as np
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

# Generate data with 20 features, only 5 truly relevant
X, y, true_coef = make_regression(
    n_samples=200, n_features=20, n_informative=5,
    noise=20, coef=True, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Cross-validated lambda selection
lambdas = np.logspace(-3, 3, 50)

ridge_cv = RidgeCV(alphas=lambdas, cv=5)
lasso_cv = LassoCV(alphas=lambdas, cv=5, max_iter=5000)
ridge_cv.fit(X_train, y_train)
lasso_cv.fit(X_train, y_train)

print(f"Best Ridge λ: {ridge_cv.alpha_:.4f}")
print(f"Best Lasso λ: {lasso_cv.alpha_:.4f}")
print(f"Ridge R² test: {r2_score(y_test, ridge_cv.predict(X_test)):.3f}")
print(f"Lasso R² test: {r2_score(y_test, lasso_cv.predict(X_test)):.3f}")

# Lasso zeroes out irrelevant features!
non_zero = np.sum(lasso_cv.coef_ != 0)
print(f"Lasso non-zero coefficients: {non_zero}/20")  # → ~5 (correct!)

Lasso — L1 penalty and feature selection

Lasso uses the L1 norm (absolute values). Unlike Ridge, Lasso produces sparse solutions — some β_j = 0 exactly. Geometrically, the L1 constraint set is a diamond; the MSE contours touch it at a corner where one coordinate is zero.

Property	OLS (no reg)	Ridge (L2)	Lasso (L1)	Elastic Net (L1+L2)
Penalty	None	λΣβⱼ²	λΣ\|βⱼ\|	λ₁Σ\|βⱼ\| + λ₂Σβⱼ²
Coefficients = 0?	No (unless data)	No (shrinks to near 0)	Yes (sparse)	Yes (sparser than Ridge)
Feature selection	No	No	Yes (auto)	Yes
Handles multicollinearity	No (OLS fails)	Yes (best)	Picks one arbitrarily	Yes
Closed-form solution	Yes	Yes	No (coordinate descent)	No
Best for	Low features, no correlation	Many correlated features	Sparse true model	Correlated + sparse model

GATE key: why L1 gives sparsity but L2 does not

Geometrically: the L2 constraint region is a smooth sphere — MSE contours touch it at a non-corner point, so no coefficient is forced to exactly zero. The L1 constraint region is a diamond (in 2D) with corners on the axes — MSE contours almost always touch the diamond at a corner where one or more β = 0. This is the fundamental geometric reason Lasso does feature selection but Ridge does not.

Practice questions (GATE-style)

As λ increases in Ridge regression, what happens to the model bias and variance? (Answer: Bias increases (model is constrained to smaller weights), variance decreases (weights are more stable across different training sets). This is the bias-variance trade-off controlled by λ.)
Why does Ridge regression outperform OLS when features are highly correlated? (Answer: OLS requires (XᵀX)⁻¹ to exist. With multicollinearity, XᵀX is near-singular and the inverse is unstable. Ridge adds λI making (XᵀX + λI) always invertible.)
You have 1000 features but suspect only 20 are relevant. Which regularization should you use? (Answer: Lasso — it performs automatic feature selection by setting irrelevant feature coefficients to exactly zero.)
Ridge regression with λ=0 is equivalent to: (Answer: Ordinary Least Squares (OLS) — no penalty is applied.)
In the Lasso solution, coefficient β₃ = 0. What does this mean? (Answer: Feature 3 is not contributing to the prediction — Lasso has effectively removed it from the model. This is automatic feature selection.)

On LumiChats

Try it free

Ridge Regression & Lasso — L2 and L1 Regularization

Real-life analogy: The strict budget

Ridge regression — L2 penalty

Lasso — L1 penalty and feature selection

Practice questions (GATE-style)

Ridge Regression & Lasso — L2 and L1 Regularization

Real-life analogy: The strict budget

Ridge regression — L2 penalty

Lasso — L1 penalty and feature selection

Practice questions (GATE-style)

Practice what you just learned

Related Terms