Part III: Methods | Chapter 4

Model Selection

Balancing model complexity against predictive accuracy through information criteria, cross-validation, and regularization

Historical Context

The problem of model selection is as old as statistics itself, but the modern theory crystallized in the 1970s. Hirotugu Akaike’s 1973 paper introduced the Akaike Information Criterion (AIC), providing the first principled way to trade off goodness-of-fit against model complexity. Akaike’s key insight was that the log-likelihood evaluated at the MLE overestimates the expected log-likelihood by approximately the number of parameters—a bias correction that leads directly to AIC.

Gideon Schwarz proposed the Bayesian Information Criterion (BIC) in 1978, deriving a different penalty from the Laplace approximation to the marginal likelihood. While AIC targets prediction, BIC targets consistent model identification. Cross-validation, with roots in the work of Seymour Geisser (1975) and Mervyn Stone (1974), provided a distribution-free alternative. The LASSO, introduced by Robert Tibshirani in 1996 (building on earlier work in geophysics by Claerbout and Muir), revolutionized model selection by performing variable selection and estimation simultaneously through $\ell_1$ penalization.

4.1 The Bias-Variance Tradeoff

For a prediction problem with $Y = f(x) + \varepsilon$, $E[\varepsilon] = 0$,$\text{Var}(\varepsilon) = \sigma^2$, the expected prediction error of an estimator $\hat{f}(x)$ at a point $x_0$ decomposes as:

$$E\!\left[(Y - \hat{f}(x_0))^2\right] = \underbrace{\sigma^2}_{\text{irreducible}} + \underbrace{\bigl(\text{Bias}[\hat{f}(x_0)]\bigr)^2}_{\text{bias}^2} + \underbrace{\text{Var}[\hat{f}(x_0)]}_{\text{variance}}$$

The Fundamental Tradeoff

Simple models (few parameters) have high bias but low variance—they underfit. Complex models (many parameters) have low bias but high variance—they overfit. The optimal model complexity minimizes total prediction error by finding the sweet spot between these two sources of error.

This tradeoff is the conceptual foundation for all model selection criteria: each tries to estimate the prediction error, or a quantity proportional to it, from the training data alone.

For linear regression with $p$ predictors fit by OLS, the in-sample training error $\overline{\text{err}} = \frac{1}{n}\sum(y_i - \hat{y}_i)^2$ is an optimistically biased estimate of prediction error. Specifically, the optimism is $\frac{2p\sigma^2}{n}$, leading to Mallows' $C_p$statistic $C_p = \overline{\text{err}} + \frac{2p\hat{\sigma}^2}{n}$ as a bias-corrected estimate.

4.2 Information Criteria: AIC and BIC

Definition: Akaike Information Criterion (AIC)

The AIC estimates the expected Kullback-Leibler divergence from the true model:

$$\text{AIC} = -2\ell(\hat{\theta}) + 2p$$

where $\ell(\hat{\theta})$ is the maximized log-likelihood and $p$ is the number of estimated parameters. Select the model with the smallest AIC. For small samples, the corrected version $\text{AIC}_c = \text{AIC} + \frac{2p(p+1)}{n-p-1}$ provides a better approximation.

Definition: Bayesian Information Criterion (BIC)

The BIC approximates the log marginal likelihood under a unit-information prior:

$$\text{BIC} = -2\ell(\hat{\theta}) + p\log n$$

BIC penalizes complexity more heavily than AIC when $n \geq 8$(since $\log n > 2$). It is consistent: if the true model is among the candidates, BIC selects it with probability approaching 1 as $n \to \infty$. AIC is not consistent but tends to select models with better predictive performance.

The choice between AIC and BIC depends on the goal. If the aim is prediction, AIC (or its corrected variant) is preferred. If the aim is identifying the true data-generating model (assuming it exists among the candidates), BIC is preferred. In practice, both criteria should be examined, and substantial disagreement between them signals that the sample size is too small to distinguish between competing models.

4.3 Cross-Validation

Cross-validation directly estimates prediction error by repeatedly splitting the data into training and validation sets.

K-Fold Cross-Validation

Partition the data into $K$ roughly equal folds. For each fold $k$: fit the model on the remaining $K-1$ folds, predict the held-out fold, and compute the validation error. The CV estimate is:

$$\text{CV}(K) = \frac{1}{n}\sum_{k=1}^{K}\sum_{i \in \text{fold } k} L(y_i, \hat{f}^{-k}(x_i))$$

Common choices are $K = 5$ or $K = 10$. These provide a good tradeoff between bias (which decreases with $K$) and variance (which increases with $K$).

Leave-One-Out Cross-Validation (LOOCV)

LOOCV uses $K = n$, fitting the model $n$ times, each time leaving out one observation. For linear regression, LOOCV has a closed-form shortcut:

$$\text{CV}(n) = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{e_i}{1 - h_{ii}}\right)^2$$

where $e_i$ are residuals and $h_{ii}$ are leverages from the full-data fit. This requires fitting the model only once. LOOCV is approximately unbiased for prediction error but can have high variance.

4.4 Bayesian Model Selection

The Bayesian approach compares models $\mathcal{M}_1, \ldots, \mathcal{M}_K$ through their marginal likelihoods (also called model evidence):

$$p(\mathbf{y} \mid \mathcal{M}_k) = \int p(\mathbf{y} \mid \theta, \mathcal{M}_k)\,p(\theta \mid \mathcal{M}_k)\,d\theta$$

Bayes Factors

The Bayes factor comparing $\mathcal{M}_1$ to $\mathcal{M}_2$ is:

$$B_{12} = \frac{p(\mathbf{y} \mid \mathcal{M}_1)}{p(\mathbf{y} \mid \mathcal{M}_2)}$$

$B_{12} > 1$ favors $\mathcal{M}_1$. Kass and Raftery (1995) suggest interpreting $2\log B_{12}$ on the scale: 0–2 (not worth more than a bare mention), 2–6 (positive), 6–10 (strong), $>10$ (very strong). The marginal likelihood naturally penalizes complexity because diffuse priors spread probability mass over regions of parameter space that are inconsistent with the data.

A practical challenge is computing the marginal likelihood, which involves a high-dimensional integral. Methods include the Laplace approximation (which gives BIC as a first-order approximation), harmonic mean estimation (simple but unstable), bridge sampling, and thermodynamic integration. For nested models, the Savage-Dickey density ratio provides a simple alternative: $B_{12} = p(\psi_0 \mid \mathbf{y}, \mathcal{M}_2) / p(\psi_0 \mid \mathcal{M}_2)$, where $\psi_0$ is the restricted parameter value under $\mathcal{M}_1$.

4.5 Regularization: LASSO and Elastic Net

Regularization methods perform model selection implicitly by shrinking or zeroing out coefficient estimates. They solve penalized least squares problems:

Penalized Regression Methods

Ridge regression ($\ell_2$ penalty):

$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \left\{\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda\|\boldsymbol{\beta}\|_2^2\right\}$$

Shrinks coefficients toward zero but never sets them exactly to zero.

LASSO ($\ell_1$ penalty, Tibshirani 1996):

$$\hat{\boldsymbol{\beta}}_{\text{lasso}} = \arg\min_{\boldsymbol{\beta}} \left\{\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda\|\boldsymbol{\beta}\|_1\right\}$$

The geometry of the $\ell_1$ ball (a diamond in 2D) causes some coefficients to be set exactly to zero, performing automatic variable selection.

Elastic net (Zou and Hastie 2005):

$$\hat{\boldsymbol{\beta}}_{\text{enet}} = \arg\min_{\boldsymbol{\beta}} \left\{\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda\bigl(\alpha\|\boldsymbol{\beta}\|_1 + (1-\alpha)\|\boldsymbol{\beta}\|_2^2\bigr)\right\}$$

Combines the sparsity of LASSO with the grouping effect of ridge, handling correlated predictors more gracefully. The mixing parameter $\alpha \in [0,1]$ interpolates between ridge ($\alpha = 0$) and LASSO ($\alpha = 1$).

The regularization parameter $\lambda$ is typically chosen by cross-validation. The LASSO solution path—the set of coefficient estimates as $\lambda$ varies from $\infty$ (all coefficients zero) to 0 (full OLS)—is piecewise linear and can be computed efficiently by the LARS (Least Angle Regression) algorithm. From a Bayesian perspective, ridge regression corresponds to a Gaussian prior on $\boldsymbol{\beta}$and LASSO to a Laplace (double-exponential) prior.

Computational Lab: Model Selection in Practice

We visualize the bias-variance decomposition, compare AIC and BIC for polynomial regression, implement k-fold cross-validation, and trace the LASSO coefficient path.

Model Selection: AIC, BIC, CV, and LASSO

Python
script.py158 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Rate this chapter: