Part II · Chapter 4

Linear Regression

Linear regression is the simplest supervised learning algorithm, yet it contains the essential ideas of loss minimisation, closed-form solutions, regularisation, and the bias-variance tradeoff that recur throughout machine learning.

1. The Linear Model

We model a response \(y \in \mathbb{R}\) as a linear function of features \(\mathbf{x} \in \mathbb{R}^p\) plus Gaussian noise:

\[ y = \boldsymbol{\beta}^\top \mathbf{x} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

In matrix form, for \(n\) observations collected into a design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) (rows are data points) and response vector \(\mathbf{y} \in \mathbb{R}^n\):

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) \]

2. Ordinary Least Squares — Step-by-Step Derivation

OLS minimises the sum of squared residuals. Define the loss:

\[ \mathcal{L}(\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]

Step 1: Expand

\[ \mathcal{L}(\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \]

Step 2: Take gradient with respect to \(\boldsymbol{\beta}\) (using \(\nabla_\beta (\beta^\top A \beta) = 2A\beta\) for symmetric \(A\)):

\[ \nabla_{\boldsymbol{\beta}}\mathcal{L} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \]

Step 3: Set gradient to zero

\[ \mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y} \quad \text{(Normal Equations)} \]

Step 4: Solve (when \(\mathbf{X}^\top\mathbf{X}\) is invertible, i.e., \(\mathbf{X}\) has full column rank):

\[ \hat{\boldsymbol{\beta}}_{\mathrm{OLS}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \]

Step 5: Verify it is a minimum — the Hessian \(\nabla^2 \mathcal{L} = 2\mathbf{X}^\top\mathbf{X} \succeq 0\) is positive semi-definite, confirming a global minimum. It is strictly PD when \(\mathbf{X}\) has full column rank.

3. Geometric Interpretation: Projection

The fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\):

\[ \hat{\mathbf{y}} = \underbrace{\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top}_{\mathbf{H} \text{ (hat matrix)}}\,\mathbf{y} \]

The hat matrix \(\mathbf{H}\) is an orthogonal projector: \(\mathbf{H}^2 = \mathbf{H}\), \(\mathbf{H}^\top = \mathbf{H}\). The residuals \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}\) are orthogonal to the column space: \(\mathbf{X}^\top \mathbf{e} = \mathbf{0}\).

OLS finds the closest point in the column space of X to y, with residuals orthogonal to every column of X.

4. Ridge Regression (L2 Regularisation)

When \(\mathbf{X}^\top\mathbf{X}\) is near-singular or \(p \gg n\), OLS is ill-conditioned. Ridge adds a squared penalty:

\[ \mathcal{L}_{\mathrm{Ridge}} = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|^2 \]

Taking the gradient and setting to zero:

\[ -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} + 2\lambda\boldsymbol{\beta} = \mathbf{0} \]

\[ \hat{\boldsymbol{\beta}}_{\mathrm{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y} \]

Adding \(\lambda \mathbf{I}\) shifts all eigenvalues of \(\mathbf{X}^\top\mathbf{X}\) up by \(\lambda\), ensuring the matrix is invertible. Ridge is equivalent to MAP estimation with a Gaussian prior \(\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2/\lambda\,\mathbf{I})\).

Lasso (L1 Regularisation)

Lasso uses an L1 penalty, which induces sparsity — many coefficients are driven to exactly zero:

\[ \mathcal{L}_{\mathrm{Lasso}} = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|_1 \]

The L1 penalty has a corner at zero (not differentiable), so solutions with many zeros are geometrically favoured. Coordinate descent solves Lasso by the soft-thresholding operator: \(\hat{\beta}_j = \mathrm{sign}(\rho_j)\max(|\rho_j| - \lambda n, 0) / \|\mathbf{x}_j\|^2\) where \(\rho_j = \mathbf{x}_j^\top(\mathbf{y} - \mathbf{X}_{-j}\boldsymbol{\beta}_{-j})\).

5. Bias-Variance Tradeoff — Full Derivation

For a new test point \(\mathbf{x}\) with true value \(y = f(\mathbf{x}) + \varepsilon\), the expected MSE of an estimator \(\hat{f}\) decomposes as:

\[ \mathbb{E}\!\left[(y - \hat{f}(\mathbf{x}))^2\right] = \mathbb{E}\!\left[(\hat{f} - f)^2\right] + \sigma^2 \]

Expanding \(\mathbb{E}[(\hat{f}-f)^2]\) by adding and subtracting \(\mathbb{E}[\hat{f}]\):

\[ = \underbrace{\left(\mathbb{E}[\hat{f}] - f\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\!\left[\left(\hat{f} - \mathbb{E}[\hat{f}]\right)^2\right]}_{\text{Variance}} \]

\[ \underbrace{\mathbb{E}[(y-\hat{f})^2]}_{\text{Test MSE}} = \underbrace{(\mathbb{E}[\hat{f}]-f)^2}_{\text{Bias}^2} + \underbrace{\mathrm{Var}(\hat{f})}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}} \]

Simple models (low complexity) have high bias but low variance. Complex models (high degree, small regularisation) have low bias but high variance. Regularisation (Ridge, Lasso) trades some extra bias for a large reduction in variance, often lowering total MSE.

Python: OLS vs Ridge vs Lasso & Bias-Variance Tradeoff

We fit polynomial regression models of increasing degree, compare OLS, Ridge, and Lasso fits, and empirically plot the bias-variance decomposition by averaging over many training sets.

Python

script.py126 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

rng = np.random.default_rng(7)

# ── Generate data with polynomial features ──
n = 80
x = np.linspace(-3, 3, n)
y_true = 0.5 * x ** 2 - x + 2
y = y_true + rng.normal(0, 1.5, n)

def poly_features(x, deg):
    return np.column_stack([x ** d for d in range(deg + 1)])

# ── Ridge regression closed form ──
def ridge_fit(X, y, lam):
    n, p = X.shape
    return np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)

# ── Bias-variance tradeoff: use polynomial degree as complexity ──
degrees = list(range(1, 12))
n_trials = 100
x_eval = np.linspace(-3, 3, 200)
y_eval_true = 0.5 * x_eval ** 2 - x_eval + 2

train_mse_all = []
bias2_all = []
var_all = []

for deg in degrees:
    preds = []
    train_mses = []
    for _ in range(n_trials):
        x_tr = np.linspace(-3, 3, n)
        y_tr = 0.5 * x_tr ** 2 - x_tr + 2 + rng.normal(0, 1.5, n)
        X_tr = poly_features(x_tr, deg)
        X_ev = poly_features(x_eval, deg)
        beta = ridge_fit(X_tr, y_tr, lam=1e-4)
        y_pred = X_ev @ beta
        preds.append(y_pred)
        y_tr_pred = X_tr @ beta
        train_mses.append(np.mean((y_tr - y_tr_pred) ** 2))
    preds = np.array(preds)
    mean_pred = preds.mean(0)
    bias2 = np.mean((mean_pred - y_eval_true) ** 2)
    var = np.mean(preds.var(0))
    bias2_all.append(bias2)
    var_all.append(var)
    train_mse_all.append(np.mean(train_mses))

bias2_all = np.array(bias2_all)
var_all = np.array(var_all)

# ── Compare OLS vs Ridge vs Lasso ──
from numpy.linalg import norm

def lasso_coord_descent(X, y, lam, max_iter=2000, tol=1e-6):
    n, p = X.shape
    beta = np.zeros(p)
    for _ in range(max_iter):
        beta_old = beta.copy()
        for j in range(p):
            r = y - X @ beta + X[:, j] * beta[j]
            rho = X[:, j] @ r
            beta[j] = np.sign(rho) * max(abs(rho) - lam * n, 0) / (X[:, j] @ X[:, j])
        if norm(beta - beta_old) < tol:
            break
    return beta

deg = 6
X_feat = poly_features(x, deg)
X_eval_feat = poly_features(x_eval, deg)
beta_ols = ridge_fit(X_feat, y, lam=1e-8)
beta_ridge = ridge_fit(X_feat, y, lam=1.0)
beta_lasso = lasso_coord_descent(X_feat, y, lam=0.15)

y_ols   = X_eval_feat @ beta_ols
y_ridge = X_eval_feat @ beta_ridge
y_lasso = X_eval_feat @ beta_lasso

print('OLS  ||beta||:', np.round(norm(beta_ols), 3))
print('Ridge||beta||:', np.round(norm(beta_ridge), 3))
print('Lasso||beta||:', np.round(norm(beta_lasso), 3))
print('Lasso nonzero:', np.sum(np.abs(beta_lasso) > 1e-4))

fig, axes = plt.subplots(1, 2, figsize=(14, 6), facecolor='#0f0f1a')

# Left: OLS vs Ridge vs Lasso fits
ax = axes[0]
ax.set_facecolor('#0f0f1a')
ax.scatter(x, y, s=15, alpha=0.5, color='#94a3b8', label='data')
ax.plot(x_eval, y_eval_true, '--', color='#34d399', linewidth=2, label='true function')
ax.plot(x_eval, y_ols,   color='#f472b6', linewidth=2, label='OLS (deg 6)')
ax.plot(x_eval, y_ridge, color='#a78bfa', linewidth=2, label='Ridge (lam=1)')
ax.plot(x_eval, y_lasso, color='#fb923c', linewidth=2, label='Lasso (lam=0.15)')
ax.set_ylim(-5, 15)
ax.set_xlabel('x', color='#e2e8f0')
ax.set_ylabel('y', color='#e2e8f0')
ax.set_title('OLS vs Ridge vs Lasso Fits', color='#c4b5fd', fontsize=12, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# Right: Bias-variance tradeoff
ax = axes[1]
ax.set_facecolor('#0f0f1a')
ax.plot(degrees, bias2_all, 'o-', color='#f472b6', linewidth=2, markersize=5, label='Bias²')
ax.plot(degrees, var_all,   's-', color='#a78bfa', linewidth=2, markersize=5, label='Variance')
ax.plot(degrees, bias2_all + var_all, '^-', color='#fb923c', linewidth=2.5, markersize=5, label='Bias²+Variance (test MSE)')
ax.axvline(2, color='#34d399', linestyle='--', linewidth=1.5, label='Optimal complexity')
ax.set_xlabel('Polynomial degree (complexity)', color='#e2e8f0')
ax.set_ylabel('Error', color='#e2e8f0')
ax.set_title('Bias-Variance Tradeoff', color='#c4b5fd', fontsize=12, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

plt.suptitle('Linear Regression: Regularisation and Bias-Variance Tradeoff',
             color='#ddd6fe', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print('Saved output.png')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

✓ OLS normal equations \(\hat\beta = (X^\top X)^{-1}X^\top y\) follow from setting \(\nabla_\beta \|y-X\beta\|^2 = 0\).
✓ Geometrically, OLS projects \(y\) onto the column space of \(X\); residuals are orthogonal to all columns.
✓ Ridge adds \(\lambda I\) to make \(X^\top X\) invertible and shrinks coefficients — equivalent to a Gaussian prior (MAP).
✓ Lasso's L1 penalty promotes sparsity via soft-thresholding, performing implicit feature selection.
✓ Test MSE = Bias² + Variance + Noise; regularisation reduces variance at the cost of a small increase in bias.

Share:X Reddit LinkedIn

← Optimization Theory Next: Logistic Regression