Part II ยท Chapter 5

Logistic Regression

Logistic regression is the canonical probabilistic binary classifier. Despite its name, it is a classification model โ€” deriving the sigmoid function from log-odds, the cross-entropy loss from Bernoulli MLE, and an elegant closed-form gradient.

1. The Sigmoid Function โ€” Derived from Log-Odds

We model the log-odds (logit) of the probability \(P(y=1|\mathbf{x})\) as a linear function of the features:

\[ \log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \mathbf{w}^\top\mathbf{x} = z \]

Solving for \(p = P(y=1|\mathbf{x})\):

\[ \frac{p}{1-p} = e^z \quad \Rightarrow \quad p = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}} \]
\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

Useful properties of the sigmoid:

  • \(\sigma(z) \in (0,1)\) โ€” always a valid probability
  • \(\sigma(-z) = 1 - \sigma(z)\) โ€” symmetric around 0.5
  • \(\sigma'(z) = \sigma(z)(1-\sigma(z))\) โ€” clean derivative, crucial for backpropagation
0.5-60601decisionboundaryClass 0p < 0.5Class 1p > 0.5zฯƒ(z)

The sigmoid maps any real-valued score z to a probability. The decision boundary is at z = 0 (p = 0.5).

2. Cross-Entropy Loss โ€” Derived from Bernoulli MLE

For binary labels \(y_i \in \{0,1\}\), the predicted probability is \(\hat{p}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i)\). The Bernoulli likelihood for one example is:

\[ P(y_i | \mathbf{x}_i; \mathbf{w}) = \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i} \]

Assuming i.i.d. data, the log-likelihood is:

\[ \ell(\mathbf{w}) = \sum_{i=1}^n \Big[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big] \]

Maximising the log-likelihood is equivalent to minimising the negative log-likelihood โ€” the binary cross-entropy loss:

\[ \mathcal{L}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \Big[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big] \]

3. Gradient Derivation

We derive the gradient \(\nabla_{\mathbf{w}}\mathcal{L}\) step by step. Let \(z_i = \mathbf{w}^\top\mathbf{x}_i\) and \(\hat{p}_i = \sigma(z_i)\).

Step 1: Derivative with respect to \(z_i\):

\[ \frac{\partial \ell_i}{\partial z_i} = y_i(1-\hat{p}_i) - (1-y_i)\hat{p}_i = y_i - \hat{p}_i \]

Using \(\partial \log\hat p_i / \partial z_i = 1 - \hat p_i\) and \(\partial \log(1-\hat p_i)/\partial z_i = -\hat p_i\) (from \(\sigma' = \sigma(1-\sigma)\)).

Step 2: Chain rule to \(\mathbf{w}\): \(\partial z_i / \partial \mathbf{w} = \mathbf{x}_i\):

\[ \frac{\partial \ell_i}{\partial \mathbf{w}} = (y_i - \hat{p}_i)\mathbf{x}_i \]

Step 3: Sum over all examples (with sign flip for loss):

\[ \nabla_{\mathbf{w}}\mathcal{L} = \frac{1}{n}\mathbf{X}^\top\big(\boldsymbol{\sigma}(\mathbf{X}\mathbf{w}) - \mathbf{y}\big) \]

The gradient has the same elegant form as the OLS gradient โ€” a weighted residual. The gradient descent update is \(\mathbf{w} \leftarrow \mathbf{w} - \eta\,\mathbf{X}^\top(\hat{\mathbf{p}} - \mathbf{y})/n\).

4. Newton's Method / IRLS

Newton's method uses the Hessian for second-order updates. The Hessian of the logistic loss is:

\[ \mathbf{H} = \nabla^2\mathcal{L} = \frac{1}{n}\mathbf{X}^\top\mathbf{W}\mathbf{X}, \quad \mathbf{W} = \mathrm{diag}\!\left(\hat{p}_i(1-\hat{p}_i)\right) \]

The Newton update is:

\[ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \mathbf{H}^{-1}\nabla\mathcal{L} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{z} \]

where \(\mathbf{z} = \mathbf{X}\mathbf{w}^{(t)} + \mathbf{W}^{-1}(\mathbf{y} - \hat{\mathbf{p}})\) is the adjusted response. This is exactly Iteratively Reweighted Least Squares (IRLS) โ€” solving a weighted OLS problem at each step. Newton's method converges quadratically vs linearly for gradient descent.

5. Multi-Class: Softmax Derivation

For \(K\) classes, generalise logistic regression to model log-odds relative to a reference class. The class probabilities are given by the softmax function:

\[ P(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^\top\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^\top\mathbf{x}}} \]

Softmax is derived by requiring that all \(K\) probabilities sum to 1 and are proportional to exponentiated scores. For \(K=2\), softmax reduces to sigmoid. The loss is the multi-class cross-entropy (categorical NLL):

\[ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^n\sum_{k=1}^K \mathbf{1}[y_i=k]\log P(y=k|\mathbf{x}_i) \]

The gradient for class \(k\) is \(\nabla_{\mathbf{w}_k}\mathcal{L} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{p}}_k - \mathbf{y}_k)\) where \(\hat{\mathbf{p}}_k\) is the vector of predicted probabilities for class \(k\).

Python: Logistic Regression from Scratch

We implement gradient descent for logistic regression from scratch, plot the decision boundary on 2D data, and show convergence of the cross-entropy loss.

Python
script.py94 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

  • โœ“ The sigmoid arises naturally from modelling the log-odds as a linear function of features.
  • โœ“ Cross-entropy loss is the negative Bernoulli log-likelihood โ€” minimising it is equivalent to MLE.
  • โœ“ The gradient \(X^\top(\hat{p} - y)/n\) has the same elegant residual form as OLS, derived via \(\sigma' = \sigma(1-\sigma)\).
  • โœ“ IRLS (Newton's method) converges quadratically by solving a sequence of weighted least squares problems.
  • โœ“ Softmax generalises sigmoid to \(K\) classes and is the output layer of most neural network classifiers.