Part I Β· Chapter 1

Linear Algebra for ML

Vectors, matrices, eigendecomposition, and the Singular Value Decomposition β€” the geometric and algebraic backbone of every machine learning algorithm.

1. Vectors and Matrices

A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered tuple of real numbers. The inner product (dot product) of two vectors is

\[ \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{n} x_i y_i = \|\mathbf{x}\| \|\mathbf{y}\| \cos\theta \]

which encodes both the product of magnitudes and the angle between vectors. A matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) represents a linear map from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). The (i,j) entry of the matrix product \(\mathbf{C} = \mathbf{AB}\) is

\[ C_{ij} = \sum_{k=1}^{p} A_{ik} B_{kj} \]

Key identities used throughout ML:

  • Transpose: \((\mathbf{AB})^\top = \mathbf{B}^\top \mathbf{A}^\top\)
  • Inverse: \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) when both inverses exist
  • Trace: \(\mathrm{tr}(\mathbf{AB}) = \mathrm{tr}(\mathbf{BA})\) (cyclic property)

2. Determinant and Invertibility

The determinant \(\det(\mathbf{A})\) measures the signed volume scaling factor of the linear transformation. For a \(2 \times 2\) matrix:

\[ \det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc \]

A square matrix is invertible if and only if \(\det(\mathbf{A}) \neq 0\), equivalently if and only if its columns are linearly independent. The inverse satisfies \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\).

Key property: \(\det(\mathbf{AB}) = \det(\mathbf{A})\det(\mathbf{B})\), so singular matrices (zero determinant) cannot be inverted and represent projections that collapse dimensions.

3. Eigendecomposition

An eigenvector of \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is a non-zero vector that is only scaled (not rotated) by \(\mathbf{A}\):

\[ \mathbf{A}\mathbf{q}_i = \lambda_i \mathbf{q}_i \]

The eigenvalue \(\lambda_i\) gives the scale factor. To find eigenvalues we solve the characteristic polynomial \(\det(\mathbf{A} - \lambda \mathbf{I}) = 0\).

If \(\mathbf{A}\) has \(n\) linearly independent eigenvectors collected as columns of a matrix \(\mathbf{Q} = [\mathbf{q}_1, \ldots, \mathbf{q}_n]\), then:

\[ \mathbf{A}\mathbf{Q} = \mathbf{Q}\boldsymbol{\Lambda} \quad \Longrightarrow \quad \mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^{-1} \]
\(\boldsymbol{\Lambda} = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)\)

For symmetric matrices \(\mathbf{A} = \mathbf{A}^\top\) (e.g., covariance matrices), eigenvectors are orthogonal (\(\mathbf{Q}^{-1} = \mathbf{Q}^\top\)), so:

\[ \mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top = \sum_{i=1}^{n} \lambda_i \mathbf{q}_i \mathbf{q}_i^\top \]

This spectral decomposition shows that \(\mathbf{A}\) is a sum of rank-1 projections weighted by eigenvalues β€” the foundation of PCA.

Input spacee₁eβ‚‚β†’A = QΞ›Qα΅€Output spaceq₁ (λ₁)qβ‚‚ (Ξ»β‚‚)unit circleellipse (scaled by eigenvalues)

The linear map A stretches the unit circle into an ellipse along the eigenvector directions, with eigenvalues giving the stretch factors.

4. Rank and Null Space

The rank of \(\mathbf{A} \in \mathbb{R}^{m \times n}\) is the dimension of its column space (range): \(\mathrm{rank}(\mathbf{A}) = \dim(\mathrm{col}(\mathbf{A}))\). By the rank-nullity theorem:

\[ \mathrm{rank}(\mathbf{A}) + \dim(\mathrm{null}(\mathbf{A})) = n \]

The null space \(\mathrm{null}(\mathbf{A}) = \{\mathbf{x} : \mathbf{A}\mathbf{x} = \mathbf{0}\}\) contains all vectors that \(\mathbf{A}\) maps to zero β€” critically important for understanding solution spaces of linear systems \(\mathbf{A}\mathbf{x} = \mathbf{b}\).

A matrix is positive definite (PD) if \(\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0\) for all \(\mathbf{x} \neq \mathbf{0}\), equivalently all eigenvalues are positive. PD matrices arise naturally as Hessians at strict local minima and as covariance matrices. The matrix \(\mathbf{X}^\top \mathbf{X}\) is always positive semi-definite, and PD when \(\mathbf{X}\) has full column rank.

5. Singular Value Decomposition (SVD)

The SVD generalises eigendecomposition to rectangular matrices. For \(\mathbf{A} \in \mathbb{R}^{m \times n}\) with \(m \geq n\), the SVD is:

\[ \mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \]
\(\mathbf{U} \in \mathbb{R}^{m \times m}\) orthogonal, \(\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}\) diagonal with \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n \geq 0\), \(\mathbf{V} \in \mathbb{R}^{n \times n}\) orthogonal

Derivation from \(\mathbf{A}^\top \mathbf{A}\)

Consider the symmetric PSD matrix \(\mathbf{A}^\top \mathbf{A} \in \mathbb{R}^{n \times n}\). By the spectral theorem it has an orthonormal eigenbasis:

\[ \mathbf{A}^\top \mathbf{A} \,\mathbf{v}_i = \lambda_i \mathbf{v}_i, \quad \lambda_i \geq 0 \]

Define the singular values \(\sigma_i = \sqrt{\lambda_i}\) and the left singular vectors \(\mathbf{u}_i = \mathbf{A}\mathbf{v}_i / \sigma_i\) (for \(\sigma_i > 0\)). Then:

\[ \mathbf{A}\mathbf{v}_i = \sigma_i \mathbf{u}_i \quad \Longleftrightarrow \quad \mathbf{A}\mathbf{V} = \mathbf{U}\boldsymbol{\Sigma} \quad \Longleftrightarrow \quad \mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \]

The \(\mathbf{u}_i\) are orthonormal: \(\mathbf{u}_i^\top \mathbf{u}_j = (\mathbf{A}\mathbf{v}_i)^\top(\mathbf{A}\mathbf{v}_j)/(\sigma_i\sigma_j) = \mathbf{v}_i^\top \mathbf{A}^\top \mathbf{A} \mathbf{v}_j / (\sigma_i \sigma_j) = \lambda_j \delta_{ij} / (\sigma_i \sigma_j) = \delta_{ij}\).

Eckart–Young Theorem (Best Low-Rank Approximation)

The rank-\(k\) truncated SVD \(\mathbf{A}_k = \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top\) solves:

\[ \min_{\mathbf{B}: \,\mathrm{rank}(\mathbf{B})\leq k} \|\mathbf{A} - \mathbf{B}\|_F \quad = \quad \|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2} \]

This is the theoretical foundation of PCA, image compression, and latent-factor models.

Python: SVD on a Data Matrix

We construct a low-rank data matrix, compute its full SVD, analyse the singular value spectrum, and show how few components capture most of the variance β€” the core idea behind PCA and data compression.

Python
script.py97 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

  • βœ“ Matrix multiplication composes linear transformations; the inner product encodes angle and length.
  • βœ“ Eigendecomposition \(A = Q\Lambda Q^{-1}\) diagonalises a matrix in its own basis; symmetric matrices have orthonormal eigenvectors.
  • βœ“ SVD \(A = U\Sigma V^\top\) extends eigendecomposition to rectangular matrices and yields the best low-rank approximation.
  • βœ“ Rank-nullity theorem governs solution existence and uniqueness; PD matrices guarantee unique solutions.
  • βœ“ Truncated SVD with \(k\) components achieves optimal Frobenius-norm compression by retaining the largest singular values.