Part III: Decompositions | Chapter 4

Least Squares

Optimal approximation in overdetermined systems and data fitting

Historical Context

The method of least squares was independently developed by Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1795, published 1809) for fitting astronomical observations. Gauss used it to predict the orbit of Ceres from limited observations—a spectacular demonstration that established least squares as the standard method for data fitting. The Gauss-Markov theorem (1900/1922) provided the theoretical justification: among all linear unbiased estimators, the least squares solution has minimum variance.

3.1 The Normal Equations

Given an overdetermined system $Ax = b$ with $A \in \mathbb{R}^{m \times n}$ ($m > n$), we seek $x^*$ minimizing $\|Ax - b\|_2^2$. Setting the gradient to zero:

$$A^TAx^* = A^Tb$$

Geometric Interpretation

The residual $r = b - Ax^*$ is orthogonal to the column space of $A$: $Ax^*$ is the orthogonal projection of $b$ onto $\text{col}(A)$. The projection matrix is $P = A(A^TA)^{-1}A^T$.

3.2 Least Squares via QR

The normal equations can be ill-conditioned: $\kappa(A^TA) = \kappa(A)^2$. The QR approach avoids forming $A^TA$. With $A = QR$:

$$Rx^* = Q^Tb$$

This is a simple triangular solve with condition number $\kappa(R) = \kappa(A)$ instead of $\kappa(A)^2$. QR via Householder reflections is the standard algorithm in LAPACK.

3.3 Least Squares via SVD

The SVD handles the general case including rank-deficient $A$. With $A = U\Sigma V^T$:

$$x^* = A^+b = V\Sigma^+U^Tb$$

where $\Sigma^+$ inverts nonzero singular values and zeros the rest. This gives theminimum-norm least squares solution: among all minimizers of $\|Ax - b\|$, it selects the one with smallest $\|x\|$.

3.4 Weighted and Generalized Least Squares

When observations have unequal variances $\text{Var}(\epsilon) = W^{-1}$, weighted least squares minimizes $\|W^{1/2}(Ax - b)\|^2$, yielding the normal equations $A^TWAx = A^TWb$.

Generalized least squares handles correlated errors with covariance $\Omega$: transform to $\Omega^{-1/2}Ax = \Omega^{-1/2}b$ and apply ordinary least squares.

3.5 Regularization

Tikhonov Regularization (Ridge Regression)

Minimizes $\|Ax - b\|^2 + \alpha\|x\|^2$, with solution$(A^TA + \alpha I)x_\alpha = A^Tb$. The parameter $\alpha > 0$ trades fit quality for solution smoothness, addressing ill-conditioning.

Via the SVD, the regularized solution is $x_\alpha = \sum_i \frac{\sigma_i}{\sigma_i^2 + \alpha}(u_i^Tb)v_i$, showing that regularization damps contributions from small singular values. The L-curvemethod plots $\|x_\alpha\|$ vs $\|Ax_\alpha - b\|$ to select the optimal $\alpha$at the corner of the curve.

LASSO ($L^1$ penalty: $\|Ax - b\|^2 + \alpha\|x\|_1$) promotes sparsity in the solution, performing simultaneous estimation and variable selection—a cornerstone of modern high-dimensional statistics.

Computational Laboratory

This simulation compares normal equations, QR, and SVD least squares methods, analyzes condition number growth, demonstrates Tikhonov regularization, and visualizes the L-curve for parameter selection.

Least Squares: Methods and Regularization

Python

least_squares.py119 lines

import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

# ============================================================
# Least Squares: Theory and Methods
# ============================================================
np.random.seed(42)

# --- 1. Normal equations ---
print("=" * 60)
print("NORMAL EQUATIONS: A^TAx = A^Tb")
print("=" * 60)

n, m = 50, 3
t = np.linspace(0, 1, n)
A = np.column_stack([np.ones(n), t, t**2])
true_coeffs = np.array([1.0, 2.0, -3.0])
b = A @ true_coeffs + 0.2 * np.random.randn(n)

x_normal = np.linalg.solve(A.T @ A, A.T @ b)
print(f"True coefficients: {true_coeffs}")
print(f"Normal equations:  {np.round(x_normal, 4)}")
print(f"Residual norm: {np.linalg.norm(b - A @ x_normal):.4f}")

# --- 2. QR method ---
print("\n" + "=" * 60)
print("QR LEAST SQUARES")
print("=" * 60)

Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)
print(f"QR solution: {np.round(x_qr, 4)}")
print(f"Match normal eq? {np.allclose(x_normal, x_qr)}")

# --- 3. SVD method ---
print("\n" + "=" * 60)
print("SVD LEAST SQUARES (PSEUDOINVERSE)")
print("=" * 60)

U, s, Vt = np.linalg.svd(A, full_matrices=False)
x_svd = Vt.T @ np.diag(1/s) @ U.T @ b
x_lstsq = np.linalg.lstsq(A, b, rcond=None)[0]
print(f"SVD solution: {np.round(x_svd, 4)}")
print(f"lstsq match?  {np.allclose(x_svd, x_lstsq)}")

# --- 4. Condition number and stability ---
print("\n" + "=" * 60)
print("CONDITION NUMBER ANALYSIS")
print("=" * 60)

for degree in range(1, 8):
    A_poly = np.column_stack([t**k for k in range(degree+1)])
    cond = np.linalg.cond(A_poly)
    cond_ATA = np.linalg.cond(A_poly.T @ A_poly)
    print(f"Degree {degree}: cond(A) = {cond:.2e}, cond(A^TA) = {cond_ATA:.2e}")

# --- 5. Ridge regression (Tikhonov) ---
print("\n" + "=" * 60)
print("TIKHONOV REGULARIZATION (RIDGE)")
print("=" * 60)

A_ill = np.column_stack([t**k for k in range(10)])
b_ill = np.sin(2*np.pi*t) + 0.1*np.random.randn(n)

for alpha in [0, 1e-6, 1e-4, 1e-2, 1.0]:
    x_ridge = np.linalg.solve(A_ill.T @ A_ill + alpha * np.eye(10), A_ill.T @ b_ill)
    residual = np.linalg.norm(b_ill - A_ill @ x_ridge)
    reg_term = np.linalg.norm(x_ridge)
    print(f"α={alpha:.0e}: ||r|| = {residual:.4f}, ||x|| = {reg_term:.4f}")

# --- 6. Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Polynomial fit
ax = axes[0]
ax.scatter(t, b, s=10, c="#38bdf8", alpha=0.5, label="Data")
t_fine = np.linspace(0, 1, 200)
A_fine = np.column_stack([np.ones(200), t_fine, t_fine**2])
ax.plot(t_fine, A_fine @ x_normal, color="#3b82f6", linewidth=2, label="LS fit")
ax.plot(t_fine, A_fine @ true_coeffs, "--", color="#0ea5e9", linewidth=1, label="True")
ax.set_title("Least Squares Fit", fontsize=12, fontweight="bold")
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
ax.set_facecolor("#0f172a")

# Plot 2: Regularization path
ax = axes[1]
alphas = np.logspace(-8, 2, 50)
coeff_paths = []
for alpha in alphas:
    x_r = np.linalg.solve(A_ill.T @ A_ill + alpha * np.eye(10), A_ill.T @ b_ill)
    coeff_paths.append(x_r)
coeff_paths = np.array(coeff_paths)
for i in range(min(5, 10)):
    ax.semilogx(alphas, coeff_paths[:, i], color=["#3b82f6","#0ea5e9","#06b6d4","#38bdf8","#7dd3fc"][i], linewidth=1.5)
ax.set_xlabel("α", color="gray")
ax.set_ylabel("Coefficient", color="gray")
ax.set_title("Ridge Regularization Path", fontsize=12, fontweight="bold")
ax.grid(True, alpha=0.3)
ax.set_facecolor("#0f172a")

# Plot 3: Residual vs regularization
ax = axes[2]
residuals = [np.linalg.norm(b_ill - A_ill @ cp) for cp in coeff_paths]
norms = [np.linalg.norm(cp) for cp in coeff_paths]
ax.loglog(residuals, norms, "o-", color="#3b82f6", markersize=3)
ax.set_xlabel("||r||", color="gray")
ax.set_ylabel("||x||", color="gray")
ax.set_title("L-curve", fontsize=12, fontweight="bold")
ax.grid(True, alpha=0.3)
ax.set_facecolor("#0f172a")

plt.tight_layout()
plt.savefig("output.png", dpi=150, bbox_inches="tight", facecolor="#0f172a")
plt.show()
print("\nVisualization saved.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Matrix Decompositions Tensors & Multilinear Algebra →