Least Squares
Optimal approximation in overdetermined systems and data fitting
Historical Context
The method of least squares was independently developed by Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1795, published 1809) for fitting astronomical observations. Gauss used it to predict the orbit of Ceres from limited observations—a spectacular demonstration that established least squares as the standard method for data fitting. The Gauss-Markov theorem (1900/1922) provided the theoretical justification: among all linear unbiased estimators, the least squares solution has minimum variance.
3.1 The Normal Equations
Given an overdetermined system $Ax = b$ with $A \in \mathbb{R}^{m \times n}$ ($m > n$), we seek $x^*$ minimizing $\|Ax - b\|_2^2$. Setting the gradient to zero:
Geometric Interpretation
The residual $r = b - Ax^*$ is orthogonal to the column space of $A$: $Ax^*$ is the orthogonal projection of $b$ onto $\text{col}(A)$. The projection matrix is $P = A(A^TA)^{-1}A^T$.
3.2 Least Squares via QR
The normal equations can be ill-conditioned: $\kappa(A^TA) = \kappa(A)^2$. The QR approach avoids forming $A^TA$. With $A = QR$:
This is a simple triangular solve with condition number $\kappa(R) = \kappa(A)$ instead of $\kappa(A)^2$. QR via Householder reflections is the standard algorithm in LAPACK.
3.3 Least Squares via SVD
The SVD handles the general case including rank-deficient $A$. With $A = U\Sigma V^T$:
where $\Sigma^+$ inverts nonzero singular values and zeros the rest. This gives theminimum-norm least squares solution: among all minimizers of $\|Ax - b\|$, it selects the one with smallest $\|x\|$.
3.4 Weighted and Generalized Least Squares
When observations have unequal variances $\text{Var}(\epsilon) = W^{-1}$, weighted least squares minimizes $\|W^{1/2}(Ax - b)\|^2$, yielding the normal equations $A^TWAx = A^TWb$.
Generalized least squares handles correlated errors with covariance $\Omega$: transform to $\Omega^{-1/2}Ax = \Omega^{-1/2}b$ and apply ordinary least squares.
3.5 Regularization
Tikhonov Regularization (Ridge Regression)
Minimizes $\|Ax - b\|^2 + \alpha\|x\|^2$, with solution$(A^TA + \alpha I)x_\alpha = A^Tb$. The parameter $\alpha > 0$ trades fit quality for solution smoothness, addressing ill-conditioning.
Via the SVD, the regularized solution is $x_\alpha = \sum_i \frac{\sigma_i}{\sigma_i^2 + \alpha}(u_i^Tb)v_i$, showing that regularization damps contributions from small singular values. The L-curvemethod plots $\|x_\alpha\|$ vs $\|Ax_\alpha - b\|$ to select the optimal $\alpha$at the corner of the curve.
LASSO ($L^1$ penalty: $\|Ax - b\|^2 + \alpha\|x\|_1$) promotes sparsity in the solution, performing simultaneous estimation and variable selection—a cornerstone of modern high-dimensional statistics.
Computational Laboratory
This simulation compares normal equations, QR, and SVD least squares methods, analyzes condition number growth, demonstrates Tikhonov regularization, and visualizes the L-curve for parameter selection.
Least Squares: Methods and Regularization
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server