Part IV: Advanced Topics | Chapter 4

Applications in Data Science

How linear algebra powers modern machine learning and data analysis

Historical Context

Karl Pearson invented PCA in 1901 to find "lines and planes of closest fit." The Netflix Prize (2006-2009) demonstrated the power of matrix factorization for recommender systems. Google's PageRank (1998) showed that the dominant eigenvector of a web graph encodes page importance. Today, every major ML algorithm—from linear regression to transformers—is built on linear algebra primitives: matrix multiplication, SVD, eigendecomposition, and gradient computation through matrix calculus.

4.1 Principal Component Analysis

PCA finds orthogonal directions of maximum variance. Given data matrix $X \in \mathbb{R}^{n \times p}$(centered), the sample covariance is $C = \frac{1}{n-1}X^TX$. The principal components are eigenvectors of $C$, ordered by eigenvalue magnitude.

$$C = V\Lambda V^T, \quad \text{variance along PC}_k = \lambda_k$$

Dimensionality reduction to $k$ components retains $\sum_{i=1}^k \lambda_i / \sum_i \lambda_i$fraction of total variance—the optimal rank-$k$ approximation by the Eckart-Young theorem.

4.2 Linear Regression and Regularization

Linear regression $y = X\beta + \epsilon$ is solved by the normal equations $\hat\beta = (X^TX)^{-1}X^Ty$.Ridge regression adds $L^2$ penalty: $\hat\beta_\alpha = (X^TX + \alpha I)^{-1}X^Ty$, while LASSO uses $L^1$: $\min \|y - X\beta\|^2 + \alpha\|\beta\|_1$, promoting sparsity.

4.3 Recommender Systems

The user-item rating matrix $R \approx UV^T$ is approximated via low-rank factorization. The SVD gives the optimal rank-$k$ approximation. In practice, alternating least squares (ALS) or stochastic gradient descent handle missing entries and scale to millions of users and items.

4.4 Graph Theory and Spectral Methods

The graph Laplacian $L = D - W$ (degree matrix minus adjacency) encodes graph connectivity. Its eigenvalues reveal structure: the number of zero eigenvalues equals the number of connected components, and the Fiedler vector (second-smallest eigenvector) gives the optimal graph bisection.

PageRank computes the dominant eigenvector of the stochastic transition matrix$G = \alpha H + (1-\alpha)\frac{1}{n}ee^T$, where $\alpha \approx 0.85$ is the damping factor.

4.5 Neural Networks

A neural network layer computes $h = \sigma(Wx + b)$ where $W$ is a weight matrix. Backpropagation is the chain rule applied to compositions of matrix operations. The attention mechanism in transformers computes $\text{Attention}(Q,K,V) = \text{softmax}(QK^T/\sqrt{d_k})V$—a sequence of matrix multiplications. Understanding the linear algebra of weight matrices, their singular values, and their condition numbers is essential for training stability and generalization.

Computational Laboratory

This simulation demonstrates PCA on 2D data, spectral clustering of concentric circles, PageRank computation, and matrix factorization for recommender systems.

Linear Algebra in Data Science

Python
data_science.py143 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Rate this chapter: