What is loss function: binary cross-entropy (log loss)?

Logistic Regression: Loss function: binary cross-entropy (log loss). Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/logistic-regression

What is multiclass logistic regression (Softmax)?

Logistic Regression: Multiclass logistic regression (Softmax). Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/logistic-regression

What is practice questions (GATE-style)?

Logistic Regression: Practice questions (GATE-style). Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/logistic-regression

Logistic Regression — Sigmoid, Cross-Entropy, Softmax

Logistic Regression

Logistic regression is a classification algorithm (despite its name) that models the probability that an input belongs to a class. It applies the sigmoid function to a linear combination of features to output a value between 0 and 1. Trained using Maximum Likelihood Estimation (MLE) with cross-entropy loss, optimized via gradient descent. Logistic regression is one of the most important GATE DS&AI topics — tested almost every year. It is also the building block for neural network output layers.

Predicting probabilities and class labels — the workhorse of binary classification.

Category: Machine Learning

Real-life analogy: The doctor's diagnosis

A doctor examines blood pressure, cholesterol, and age to decide if a patient has heart disease (yes/no). Logistic regression does exactly this: it combines multiple factors with learned weights, passes the result through a sigmoid function to get a probability (e.g., 0.82 = 82% chance of heart disease), and then classifies above 0.5 as positive. The doctor's threshold (50%) can be adjusted — if the disease is dangerous, you might use 0.3 to catch more cases.

The sigmoid function and decision boundary

P(y=1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{where} \quad z = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p = \boldsymbol{\beta}^T \mathbf{x}

The log-odds (logit) interpretation: log(P/(1−P)) = βᵀx. Each unit increase in feature xⱼ multiplies the odds by e^βⱼ. If β₁ = 0.5, then every unit of x₁ increases the odds of the positive class by e^0.5 ≈ 1.65× — a 65% increase in odds.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# ── From scratch (binary logistic regression) ──
def sigmoid(z): return 1 / (1 + np.exp(-z))

def logistic_gradient_descent(X, y, lr=0.01, epochs=1000):
    n, p = X.shape
    beta = np.zeros(p + 1)
    X_aug = np.c_[np.ones(n), X]   # Add bias column
    for _ in range(epochs):
        z     = X_aug @ beta
        y_hat = sigmoid(z)
        grad  = X_aug.T @ (y_hat - y) / n
        beta -= lr * grad           # Gradient descent step
    return beta

# Generate binary classification data
X, y = make_classification(n_samples=500, n_features=4,
                            random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# From scratch
beta = logistic_gradient_descent(X_train, y_train, lr=0.1, epochs=1000)
X_test_aug = np.c_[np.ones(len(X_test)), X_test]
probs       = sigmoid(X_test_aug @ beta)
preds       = (probs >= 0.5).astype(int)
print(f"Scratch accuracy: {accuracy_score(y_test, preds):.3f}")

# sklearn
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
print(f"sklearn accuracy: {accuracy_score(y_test, clf.predict(X_test)):.3f}")
print(classification_report(y_test, clf.predict(X_test)))

Loss function: binary cross-entropy (log loss)

\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]

Why not use MSE for logistic regression?: MSE applied to sigmoid outputs creates a non-convex loss landscape with many local minima — gradient descent may not converge. Cross-entropy with sigmoid creates a convex loss function — gradient descent always finds the global minimum. This is why cross-entropy is the standard loss for classification.

Multiclass logistic regression (Softmax)

P(y=k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}

Practice questions (GATE-style)

What does the sigmoid output of 0.72 mean in logistic regression? (Answer: The model predicts a 72% probability of the input belonging to class 1. With threshold 0.5, it classifies as class 1.)
Why is logistic regression called "regression" when it does classification? (Answer: It models the log-odds as a linear regression: log(P/(1-P)) = βᵀx. The "regression" refers to modeling the log-odds, not the binary class label directly.)
A logistic regression model has coefficient β₁ = 1.2 for feature "hours studied". What is the odds ratio? (Answer: e^1.2 ≈ 3.32. Each additional hour of study multiplies the odds of passing by 3.32×.)
For a 3-class problem, logistic regression uses: (Answer: Softmax (multinomial logistic regression) with 3 weight vectors — one per class. Output is a probability vector summing to 1.)
What is the gradient of cross-entropy loss with respect to weights? (Answer: ∇_β L = (1/n) Xᵀ(ŷ − y) — identical in form to linear regression gradient but with ŷ = sigmoid(Xβ) instead of Xβ.)

Logistic regression is the mathematical core of neural network output layers. When a language model outputs a probability distribution over vocabulary tokens, it uses softmax — the multi-class generalization of logistic regression — applied to the final hidden state.

Definition

Real-life analogy: The doctor's diagnosis

The sigmoid function and decision boundary

The sigmoid squashes any real number z ∈ (-∞, +∞) to (0, 1). The decision boundary is where P(y=1|x) = 0.5, which occurs at z = 0, i.e., β₀ + β₁x₁ + … = 0 — a hyperplane in feature space.

Logistic regression from scratch (gradient descent) + sklearn

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# ── From scratch (binary logistic regression) ──
def sigmoid(z): return 1 / (1 + np.exp(-z))

def logistic_gradient_descent(X, y, lr=0.01, epochs=1000):
    n, p = X.shape
    beta = np.zeros(p + 1)
    X_aug = np.c_[np.ones(n), X]   # Add bias column
    for _ in range(epochs):
        z     = X_aug @ beta
        y_hat = sigmoid(z)
        grad  = X_aug.T @ (y_hat - y) / n
        beta -= lr * grad           # Gradient descent step
    return beta

# Generate binary classification data
X, y = make_classification(n_samples=500, n_features=4,
                            random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# From scratch
beta = logistic_gradient_descent(X_train, y_train, lr=0.1, epochs=1000)
X_test_aug = np.c_[np.ones(len(X_test)), X_test]
probs       = sigmoid(X_test_aug @ beta)
preds       = (probs >= 0.5).astype(int)
print(f"Scratch accuracy: {accuracy_score(y_test, preds):.3f}")

# sklearn
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
print(f"sklearn accuracy: {accuracy_score(y_test, clf.predict(X_test)):.3f}")
print(classification_report(y_test, clf.predict(X_test)))

Loss function: binary cross-entropy (log loss)

Binary cross-entropy loss. When y=1: loss = -log(p̂) — penalizes low predicted probability for true positives. When y=0: loss = -log(1-p̂) — penalizes high predicted probability for true negatives. Unlike linear regression, there is no closed-form solution — gradient descent is used.

Why not use MSE for logistic regression?

MSE applied to sigmoid outputs creates a non-convex loss landscape with many local minima — gradient descent may not converge. Cross-entropy with sigmoid creates a convex loss function — gradient descent always finds the global minimum. This is why cross-entropy is the standard loss for classification.

Multiclass logistic regression (Softmax)

Softmax regression extends logistic regression to K classes. Each class has its own weight vector wₖ. Softmax outputs a probability distribution over K classes that sums to 1. This is the output layer of most neural network classifiers.

Practice questions (GATE-style)

What does the sigmoid output of 0.72 mean in logistic regression? (Answer: The model predicts a 72% probability of the input belonging to class 1. With threshold 0.5, it classifies as class 1.)
Why is logistic regression called "regression" when it does classification? (Answer: It models the log-odds as a linear regression: log(P/(1-P)) = βᵀx. The "regression" refers to modeling the log-odds, not the binary class label directly.)
A logistic regression model has coefficient β₁ = 1.2 for feature "hours studied". What is the odds ratio? (Answer: e^1.2 ≈ 3.32. Each additional hour of study multiplies the odds of passing by 3.32×.)
For a 3-class problem, logistic regression uses: (Answer: Softmax (multinomial logistic regression) with 3 weight vectors — one per class. Output is a probability vector summing to 1.)
What is the gradient of cross-entropy loss with respect to weights? (Answer: ∇_β L = (1/n) Xᵀ(ŷ − y) — identical in form to linear regression gradient but with ŷ = sigmoid(Xβ) instead of Xβ.)

On LumiChats

Try it free

Logistic Regression

Real-life analogy: The doctor's diagnosis

The sigmoid function and decision boundary

Loss function: binary cross-entropy (log loss)

Multiclass logistic regression (Softmax)

Practice questions (GATE-style)

Logistic Regression

Real-life analogy: The doctor's diagnosis

The sigmoid function and decision boundary

Loss function: binary cross-entropy (log loss)

Multiclass logistic regression (Softmax)

Practice questions (GATE-style)

Practice what you just learned

Related Terms