Linear Regression
Linear regression is the simplest supervised learning algorithm, yet it contains the essential ideas of loss minimisation, closed-form solutions, regularisation, and the bias-variance tradeoff that recur throughout machine learning.
1. The Linear Model
We model a response \(y \in \mathbb{R}\) as a linear function of features \(\mathbf{x} \in \mathbb{R}^p\) plus Gaussian noise:
In matrix form, for \(n\) observations collected into a design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) (rows are data points) and response vector \(\mathbf{y} \in \mathbb{R}^n\):
2. Ordinary Least Squares β Step-by-Step Derivation
OLS minimises the sum of squared residuals. Define the loss:
Step 1: Expand
Step 2: Take gradient with respect to \(\boldsymbol{\beta}\) (using \(\nabla_\beta (\beta^\top A \beta) = 2A\beta\) for symmetric \(A\)):
Step 3: Set gradient to zero
Step 4: Solve (when \(\mathbf{X}^\top\mathbf{X}\) is invertible, i.e., \(\mathbf{X}\) has full column rank):
Step 5: Verify it is a minimum β the Hessian \(\nabla^2 \mathcal{L} = 2\mathbf{X}^\top\mathbf{X} \succeq 0\) is positive semi-definite, confirming a global minimum. It is strictly PD when \(\mathbf{X}\) has full column rank.
3. Geometric Interpretation: Projection
The fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\):
The hat matrix \(\mathbf{H}\) is an orthogonal projector: \(\mathbf{H}^2 = \mathbf{H}\), \(\mathbf{H}^\top = \mathbf{H}\). The residuals \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}\) are orthogonal to the column space: \(\mathbf{X}^\top \mathbf{e} = \mathbf{0}\).
OLS finds the closest point in the column space of X to y, with residuals orthogonal to every column of X.
4. Ridge Regression (L2 Regularisation)
When \(\mathbf{X}^\top\mathbf{X}\) is near-singular or \(p \gg n\), OLS is ill-conditioned. Ridge adds a squared penalty:
Taking the gradient and setting to zero:
Adding \(\lambda \mathbf{I}\) shifts all eigenvalues of \(\mathbf{X}^\top\mathbf{X}\) up by \(\lambda\), ensuring the matrix is invertible. Ridge is equivalent to MAP estimation with a Gaussian prior \(\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2/\lambda\,\mathbf{I})\).
Lasso (L1 Regularisation)
Lasso uses an L1 penalty, which induces sparsity β many coefficients are driven to exactly zero:
The L1 penalty has a corner at zero (not differentiable), so solutions with many zeros are geometrically favoured. Coordinate descent solves Lasso by the soft-thresholding operator: \(\hat{\beta}_j = \mathrm{sign}(\rho_j)\max(|\rho_j| - \lambda n, 0) / \|\mathbf{x}_j\|^2\) where \(\rho_j = \mathbf{x}_j^\top(\mathbf{y} - \mathbf{X}_{-j}\boldsymbol{\beta}_{-j})\).
5. Bias-Variance Tradeoff β Full Derivation
For a new test point \(\mathbf{x}\) with true value \(y = f(\mathbf{x}) + \varepsilon\), the expected MSE of an estimator \(\hat{f}\) decomposes as:
Expanding \(\mathbb{E}[(\hat{f}-f)^2]\) by adding and subtracting \(\mathbb{E}[\hat{f}]\):
Simple models (low complexity) have high bias but low variance. Complex models (high degree, small regularisation) have low bias but high variance. Regularisation (Ridge, Lasso) trades some extra bias for a large reduction in variance, often lowering total MSE.
Python: OLS vs Ridge vs Lasso & Bias-Variance Tradeoff
We fit polynomial regression models of increasing degree, compare OLS, Ridge, and Lasso fits, and empirically plot the bias-variance decomposition by averaging over many training sets.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Key Takeaways
- β OLS normal equations \(\hat\beta = (X^\top X)^{-1}X^\top y\) follow from setting \(\nabla_\beta \|y-X\beta\|^2 = 0\).
- β Geometrically, OLS projects \(y\) onto the column space of \(X\); residuals are orthogonal to all columns.
- β Ridge adds \(\lambda I\) to make \(X^\top X\) invertible and shrinks coefficients β equivalent to a Gaussian prior (MAP).
- β Lasso's L1 penalty promotes sparsity via soft-thresholding, performing implicit feature selection.
- β Test MSE = BiasΒ² + Variance + Noise; regularisation reduces variance at the cost of a small increase in bias.