Part II Β· Chapter 4

Linear Regression

Linear regression is the simplest supervised learning algorithm, yet it contains the essential ideas of loss minimisation, closed-form solutions, regularisation, and the bias-variance tradeoff that recur throughout machine learning.

1. The Linear Model

We model a response \(y \in \mathbb{R}\) as a linear function of features \(\mathbf{x} \in \mathbb{R}^p\) plus Gaussian noise:

\[ y = \boldsymbol{\beta}^\top \mathbf{x} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

In matrix form, for \(n\) observations collected into a design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) (rows are data points) and response vector \(\mathbf{y} \in \mathbb{R}^n\):

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) \]

2. Ordinary Least Squares β€” Step-by-Step Derivation

OLS minimises the sum of squared residuals. Define the loss:

\[ \mathcal{L}(\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]

Step 1: Expand

\[ \mathcal{L}(\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \]

Step 2: Take gradient with respect to \(\boldsymbol{\beta}\) (using \(\nabla_\beta (\beta^\top A \beta) = 2A\beta\) for symmetric \(A\)):

\[ \nabla_{\boldsymbol{\beta}}\mathcal{L} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \]

Step 3: Set gradient to zero

\[ \mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y} \quad \text{(Normal Equations)} \]

Step 4: Solve (when \(\mathbf{X}^\top\mathbf{X}\) is invertible, i.e., \(\mathbf{X}\) has full column rank):

\[ \hat{\boldsymbol{\beta}}_{\mathrm{OLS}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \]

Step 5: Verify it is a minimum β€” the Hessian \(\nabla^2 \mathcal{L} = 2\mathbf{X}^\top\mathbf{X} \succeq 0\) is positive semi-definite, confirming a global minimum. It is strictly PD when \(\mathbf{X}\) has full column rank.

3. Geometric Interpretation: Projection

The fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\):

\[ \hat{\mathbf{y}} = \underbrace{\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top}_{\mathbf{H} \text{ (hat matrix)}}\,\mathbf{y} \]

The hat matrix \(\mathbf{H}\) is an orthogonal projector: \(\mathbf{H}^2 = \mathbf{H}\), \(\mathbf{H}^\top = \mathbf{H}\). The residuals \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}\) are orthogonal to the column space: \(\mathbf{X}^\top \mathbf{e} = \mathbf{0}\).

col(X)OΕ· = Hyye = y βˆ’ Ε·x₁xβ‚‚e βŠ₯ col(X), so Xα΅€XΞ²Μ‚ = Xα΅€y

OLS finds the closest point in the column space of X to y, with residuals orthogonal to every column of X.

4. Ridge Regression (L2 Regularisation)

When \(\mathbf{X}^\top\mathbf{X}\) is near-singular or \(p \gg n\), OLS is ill-conditioned. Ridge adds a squared penalty:

\[ \mathcal{L}_{\mathrm{Ridge}} = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|^2 \]

Taking the gradient and setting to zero:

\[ -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} + 2\lambda\boldsymbol{\beta} = \mathbf{0} \]
\[ \hat{\boldsymbol{\beta}}_{\mathrm{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y} \]

Adding \(\lambda \mathbf{I}\) shifts all eigenvalues of \(\mathbf{X}^\top\mathbf{X}\) up by \(\lambda\), ensuring the matrix is invertible. Ridge is equivalent to MAP estimation with a Gaussian prior \(\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2/\lambda\,\mathbf{I})\).

Lasso (L1 Regularisation)

Lasso uses an L1 penalty, which induces sparsity β€” many coefficients are driven to exactly zero:

\[ \mathcal{L}_{\mathrm{Lasso}} = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|_1 \]

The L1 penalty has a corner at zero (not differentiable), so solutions with many zeros are geometrically favoured. Coordinate descent solves Lasso by the soft-thresholding operator: \(\hat{\beta}_j = \mathrm{sign}(\rho_j)\max(|\rho_j| - \lambda n, 0) / \|\mathbf{x}_j\|^2\) where \(\rho_j = \mathbf{x}_j^\top(\mathbf{y} - \mathbf{X}_{-j}\boldsymbol{\beta}_{-j})\).

5. Bias-Variance Tradeoff β€” Full Derivation

For a new test point \(\mathbf{x}\) with true value \(y = f(\mathbf{x}) + \varepsilon\), the expected MSE of an estimator \(\hat{f}\) decomposes as:

\[ \mathbb{E}\!\left[(y - \hat{f}(\mathbf{x}))^2\right] = \mathbb{E}\!\left[(\hat{f} - f)^2\right] + \sigma^2 \]

Expanding \(\mathbb{E}[(\hat{f}-f)^2]\) by adding and subtracting \(\mathbb{E}[\hat{f}]\):

\[ = \underbrace{\left(\mathbb{E}[\hat{f}] - f\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\!\left[\left(\hat{f} - \mathbb{E}[\hat{f}]\right)^2\right]}_{\text{Variance}} \]
\[ \underbrace{\mathbb{E}[(y-\hat{f})^2]}_{\text{Test MSE}} = \underbrace{(\mathbb{E}[\hat{f}]-f)^2}_{\text{Bias}^2} + \underbrace{\mathrm{Var}(\hat{f})}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}} \]

Simple models (low complexity) have high bias but low variance. Complex models (high degree, small regularisation) have low bias but high variance. Regularisation (Ridge, Lasso) trades some extra bias for a large reduction in variance, often lowering total MSE.

Python: OLS vs Ridge vs Lasso & Bias-Variance Tradeoff

We fit polynomial regression models of increasing degree, compare OLS, Ridge, and Lasso fits, and empirically plot the bias-variance decomposition by averaging over many training sets.

Python
script.py126 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

  • βœ“ OLS normal equations \(\hat\beta = (X^\top X)^{-1}X^\top y\) follow from setting \(\nabla_\beta \|y-X\beta\|^2 = 0\).
  • βœ“ Geometrically, OLS projects \(y\) onto the column space of \(X\); residuals are orthogonal to all columns.
  • βœ“ Ridge adds \(\lambda I\) to make \(X^\top X\) invertible and shrinks coefficients β€” equivalent to a Gaussian prior (MAP).
  • βœ“ Lasso's L1 penalty promotes sparsity via soft-thresholding, performing implicit feature selection.
  • βœ“ Test MSE = BiasΒ² + Variance + Noise; regularisation reduces variance at the cost of a small increase in bias.