Part III: Methods | Chapter 2

Regression Analysis

Modeling conditional relationships between variables through least-squares estimation and generalized linear models

Historical Context

The method of least squares was independently invented by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss around 1795 (published in 1809). Gauss used it to predict the orbit of the asteroid Ceres from sparse observations—a triumph that cemented the method’s reputation. The term “regression” itself was coined by Francis Galton in 1886, who observed that children’s heights tended to “regress” toward the population mean relative to their parents’ heights.

The modern matrix formulation of multiple regression was developed by R.A. Fisher and others in the early twentieth century, while the Gauss-Markov theorem (proved by Gauss and formalized by Andrey Markov) established the optimality of ordinary least squares. The generalization to non-normal responses through generalized linear models (GLMs) came in 1972 with the seminal paper by John Nelder and Robert Wedderburn, unifying logistic regression, Poisson regression, and classical linear models within a single framework based on exponential family distributions and link functions.

2.1 Simple Linear Regression and OLS

The simple linear regression model posits that the conditional mean of $Y$ given $x$ is a linear function:

$$Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \stackrel{\text{iid}}{\sim} (0, \sigma^2)$$

Definition: Ordinary Least Squares

The OLS estimators minimize the residual sum of squares $\text{RSS} = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_i)^2$:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}, \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x}$$

The coefficient of determination $R^2 = 1 - \text{RSS}/\text{TSS}$ measures the fraction of variance in $Y$ explained by $x$, where $\text{TSS} = \sum(Y_i - \bar{Y})^2$.

Under the model assumptions, the OLS estimators are unbiased with $\text{Var}(\hat{\beta}_1) = \sigma^2 / S_{xx}$. If we further assume $\varepsilon_i \sim N(0, \sigma^2)$, the OLS estimators coincide with the maximum likelihood estimators, and the $t$-statistic $t = \hat{\beta}_1 / \text{SE}(\hat{\beta}_1)$ follows a $t_{n-2}$ distribution under $H_0: \beta_1 = 0$.

2.2 Multiple Regression in Matrix Form

With $p$ predictors, the model becomes $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$, where $\mathbf{X}$ is the $n \times (p+1)$ design matrix (including an intercept column of ones) and $\boldsymbol{\varepsilon} \sim (0, \sigma^2 \mathbf{I}_n)$.

The OLS Solution and Hat Matrix

The OLS estimator is

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{Y}$$

The fitted values are $\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}$, where the hat matrix is

$$\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$$

$\mathbf{H}$ is an orthogonal projection onto the column space of $\mathbf{X}$. Its diagonal elements $h_{ii}$ are called leverages, satisfying $0 \leq h_{ii} \leq 1$ and $\sum h_{ii} = p + 1$. Points with $h_{ii} > 2(p+1)/n$ are considered high-leverage.

The residual vector is $\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{Y}$, and the unbiased variance estimate is $s^2 = \mathbf{e}^\top\mathbf{e}/(n - p - 1)$. The covariance of $\hat{\boldsymbol{\beta}}$ is $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$, estimated by $s^2(\mathbf{X}^\top\mathbf{X})^{-1}$.

2.3 The Gauss-Markov Theorem

Theorem (Gauss-Markov)

Under the assumptions $E[\boldsymbol{\varepsilon}] = \mathbf{0}$ and $\text{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}$ (no normality required), the OLS estimator $\hat{\boldsymbol{\beta}}$ is the Best Linear Unbiased Estimator (BLUE).

That is, for any linear estimator $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{Y}$ that is unbiased for $\boldsymbol{\beta}$, we have $\text{Var}(\mathbf{a}^\top\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\mathbf{a}^\top\hat{\boldsymbol{\beta}})$ for all $\mathbf{a} \in \mathbb{R}^{p+1}$.

The Gauss-Markov theorem is powerful but has important limitations. It applies only to linear, unbiased estimators. A biased estimator such as ridge regression can achieve lower MSE by trading bias for reduced variance—this is the bias-variance tradeoff that motivates regularization methods.

When the error covariance is $\text{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \boldsymbol{\Omega}$ for known $\boldsymbol{\Omega} \neq \mathbf{I}$ (heteroscedasticity or correlation), the BLUE becomes the generalized least squares (GLS) estimator:

$$\hat{\boldsymbol{\beta}}_{\text{GLS}} = (\mathbf{X}^\top \boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top \boldsymbol{\Omega}^{-1}\mathbf{Y}$$

2.4 Regression Diagnostics

Fitting a model is only half the work; we must also verify that the model assumptions hold. Key diagnostic tools include residual analysis, leverage, and influence measures.

Key Diagnostic Quantities

Standardized residuals: $r_i = e_i / (s\sqrt{1 - h_{ii}})$, where $\text{Var}(e_i) = \sigma^2(1 - h_{ii})$. These should be approximately $N(0,1)$.

Studentized residuals: $t_i = e_i / (s_{(i)}\sqrt{1 - h_{ii}})$, where $s_{(i)}$ is the residual standard error computed without observation $i$. Under the model, $t_i \sim t_{n-p-2}$.

Leverage: $h_{ii}$ measures how far observation $i$ is from the center of the predictor space. High leverage points have outsized influence on the fitted line.

Cook’s distance: measures the overall influence of observation $i$ on all fitted values:

$$D_i = \frac{(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})^\top(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})}{(p+1)s^2} = \frac{r_i^2}{p+1}\cdot\frac{h_{ii}}{1-h_{ii}}$$

Values of $D_i > 1$ (or $D_i > 4/n$ as a more sensitive threshold) indicate influential observations.

2.5 Generalized Linear Models

GLMs extend regression to non-normal responses by specifying three components: a random component (an exponential family distribution for $Y$), a systematic component (the linear predictor $\eta = \mathbf{X}\boldsymbol{\beta}$), and a link function $g$ such that $g(\mu) = \eta$, where $\mu = E[Y]$.

Common GLMs

Logistic regression: $Y \sim \text{Bernoulli}(p)$, link $g(p) = \log\frac{p}{1-p}$ (logit). Used for binary classification.

Poisson regression: $Y \sim \text{Poisson}(\mu)$, link $g(\mu) = \log\mu$. Used for count data.

Gamma regression: $Y \sim \text{Gamma}(\alpha, \beta)$, link $g(\mu) = 1/\mu$ (inverse). Used for positive continuous responses with constant CV.

All GLMs are fitted by iteratively reweighted least squares (IRLS), which is equivalent to Fisher scoring applied to the log-likelihood:

$$\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{z}^{(t)}$$

where $\mathbf{W}$ is a diagonal weight matrix and $\mathbf{z}$ is the working response, both depending on the current parameter estimates.

For logistic regression, the coefficient $\beta_j$ has a natural interpretation: a unit increase in $x_j$ multiplies the odds $p/(1-p)$ by $e^{\beta_j}$. The deviance $D = -2(\ell(\hat{\boldsymbol{\beta}}) - \ell_{\text{saturated}})$ plays the role that RSS plays in linear regression, and deviance residuals are the GLM analogue of ordinary residuals.

Computational Lab: Regression Diagnostics

We fit an OLS model, examine residual plots and leverage, detect influential points via Cook’s distance, and fit a logistic regression with a decision boundary.

Regression Analysis: OLS, Diagnostics, and GLMs

Python
script.py118 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Rate this chapter: