Part I · Chapter 1

Linear Algebra for ML

Vectors, matrices, eigendecomposition, and the Singular Value Decomposition — the geometric and algebraic backbone of every machine learning algorithm.

1. Vectors and Matrices

A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered tuple of real numbers. The inner product (dot product) of two vectors is

\[ \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{n} x_i y_i = \|\mathbf{x}\| \|\mathbf{y}\| \cos\theta \]

which encodes both the product of magnitudes and the angle between vectors. A matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) represents a linear map from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). The (i,j) entry of the matrix product \(\mathbf{C} = \mathbf{AB}\) is

\[ C_{ij} = \sum_{k=1}^{p} A_{ik} B_{kj} \]

Key identities used throughout ML:

Transpose: \((\mathbf{AB})^\top = \mathbf{B}^\top \mathbf{A}^\top\)
Inverse: \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) when both inverses exist
Trace: \(\mathrm{tr}(\mathbf{AB}) = \mathrm{tr}(\mathbf{BA})\) (cyclic property)

2. Determinant and Invertibility

The determinant \(\det(\mathbf{A})\) measures the signed volume scaling factor of the linear transformation. For a \(2 \times 2\) matrix:

\[ \det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc \]

A square matrix is invertible if and only if \(\det(\mathbf{A}) \neq 0\), equivalently if and only if its columns are linearly independent. The inverse satisfies \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\).

Key property: \(\det(\mathbf{AB}) = \det(\mathbf{A})\det(\mathbf{B})\), so singular matrices (zero determinant) cannot be inverted and represent projections that collapse dimensions.

3. Eigendecomposition

An eigenvector of \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is a non-zero vector that is only scaled (not rotated) by \(\mathbf{A}\):

\[ \mathbf{A}\mathbf{q}_i = \lambda_i \mathbf{q}_i \]

The eigenvalue \(\lambda_i\) gives the scale factor. To find eigenvalues we solve the characteristic polynomial \(\det(\mathbf{A} - \lambda \mathbf{I}) = 0\).

If \(\mathbf{A}\) has \(n\) linearly independent eigenvectors collected as columns of a matrix \(\mathbf{Q} = [\mathbf{q}_1, \ldots, \mathbf{q}_n]\), then:

\[ \mathbf{A}\mathbf{Q} = \mathbf{Q}\boldsymbol{\Lambda} \quad \Longrightarrow \quad \mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^{-1} \]

\(\boldsymbol{\Lambda} = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)\)

For symmetric matrices \(\mathbf{A} = \mathbf{A}^\top\) (e.g., covariance matrices), eigenvectors are orthogonal (\(\mathbf{Q}^{-1} = \mathbf{Q}^\top\)), so:

\[ \mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top = \sum_{i=1}^{n} \lambda_i \mathbf{q}_i \mathbf{q}_i^\top \]

This spectral decomposition shows that \(\mathbf{A}\) is a sum of rank-1 projections weighted by eigenvalues — the foundation of PCA.

The linear map A stretches the unit circle into an ellipse along the eigenvector directions, with eigenvalues giving the stretch factors.

4. Rank and Null Space

The rank of \(\mathbf{A} \in \mathbb{R}^{m \times n}\) is the dimension of its column space (range): \(\mathrm{rank}(\mathbf{A}) = \dim(\mathrm{col}(\mathbf{A}))\). By the rank-nullity theorem:

\[ \mathrm{rank}(\mathbf{A}) + \dim(\mathrm{null}(\mathbf{A})) = n \]

The null space \(\mathrm{null}(\mathbf{A}) = \{\mathbf{x} : \mathbf{A}\mathbf{x} = \mathbf{0}\}\) contains all vectors that \(\mathbf{A}\) maps to zero — critically important for understanding solution spaces of linear systems \(\mathbf{A}\mathbf{x} = \mathbf{b}\).

A matrix is positive definite (PD) if \(\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0\) for all \(\mathbf{x} \neq \mathbf{0}\), equivalently all eigenvalues are positive. PD matrices arise naturally as Hessians at strict local minima and as covariance matrices. The matrix \(\mathbf{X}^\top \mathbf{X}\) is always positive semi-definite, and PD when \(\mathbf{X}\) has full column rank.

5. Singular Value Decomposition (SVD)

The SVD generalises eigendecomposition to rectangular matrices. For \(\mathbf{A} \in \mathbb{R}^{m \times n}\) with \(m \geq n\), the SVD is:

\[ \mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \]

\(\mathbf{U} \in \mathbb{R}^{m \times m}\) orthogonal, \(\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}\) diagonal with \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n \geq 0\), \(\mathbf{V} \in \mathbb{R}^{n \times n}\) orthogonal

Derivation from \(\mathbf{A}^\top \mathbf{A}\)

Consider the symmetric PSD matrix \(\mathbf{A}^\top \mathbf{A} \in \mathbb{R}^{n \times n}\). By the spectral theorem it has an orthonormal eigenbasis:

\[ \mathbf{A}^\top \mathbf{A} \,\mathbf{v}_i = \lambda_i \mathbf{v}_i, \quad \lambda_i \geq 0 \]

Define the singular values \(\sigma_i = \sqrt{\lambda_i}\) and the left singular vectors \(\mathbf{u}_i = \mathbf{A}\mathbf{v}_i / \sigma_i\) (for \(\sigma_i > 0\)). Then:

\[ \mathbf{A}\mathbf{v}_i = \sigma_i \mathbf{u}_i \quad \Longleftrightarrow \quad \mathbf{A}\mathbf{V} = \mathbf{U}\boldsymbol{\Sigma} \quad \Longleftrightarrow \quad \mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \]

The \(\mathbf{u}_i\) are orthonormal: \(\mathbf{u}_i^\top \mathbf{u}_j = (\mathbf{A}\mathbf{v}_i)^\top(\mathbf{A}\mathbf{v}_j)/(\sigma_i\sigma_j) = \mathbf{v}_i^\top \mathbf{A}^\top \mathbf{A} \mathbf{v}_j / (\sigma_i \sigma_j) = \lambda_j \delta_{ij} / (\sigma_i \sigma_j) = \delta_{ij}\).

Eckart–Young Theorem (Best Low-Rank Approximation)

The rank-\(k\) truncated SVD \(\mathbf{A}_k = \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top\) solves:

\[ \min_{\mathbf{B}: \,\mathrm{rank}(\mathbf{B})\leq k} \|\mathbf{A} - \mathbf{B}\|_F \quad = \quad \|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2} \]

This is the theoretical foundation of PCA, image compression, and latent-factor models.

Python: SVD on a Data Matrix

We construct a low-rank data matrix, compute its full SVD, analyse the singular value spectrum, and show how few components capture most of the variance — the core idea behind PCA and data compression.

Python

script.py97 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

rng = np.random.default_rng(42)

# ── 1. Create a synthetic data matrix (20 samples, 10 features) ──
n, p = 20, 10
X = rng.standard_normal((n, p))
# Inject low-rank structure: first 2 components dominate
X += 4 * np.outer(rng.standard_normal(n), rng.standard_normal(p))
X += 2 * np.outer(rng.standard_normal(n), rng.standard_normal(p))

# ── 2. Compute the full SVD  X = U Sigma V^T ──
U, s, Vt = np.linalg.svd(X, full_matrices=False)
print('Shape of U:', U.shape)
print('Shape of s (singular values):', s.shape)
print('Shape of Vt:', Vt.shape)
print('Singular values:', np.round(s, 3))

# ── 3. Compute reconstruction error vs rank k ──
ranks = list(range(1, p + 1))
errors = []
for k in ranks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
    err = np.linalg.norm(X - X_k, 'fro') ** 2 / np.linalg.norm(X, 'fro') ** 2
    errors.append(err)

# ── 4. Variance explained ──
variance_explained = (s ** 2) / np.sum(s ** 2)
cumulative_variance = np.cumsum(variance_explained)

# ── 5. Plot ──
fig = plt.figure(figsize=(14, 10), facecolor='#0f0f1a')
gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.4, wspace=0.35)

ax1 = fig.add_subplot(gs[0, 0])
ax1.set_facecolor('#0f0f1a')
ax1.bar(ranks, s, color='#a78bfa', edgecolor='#7c3aed', linewidth=0.5)
ax1.set_xlabel('Component index', color='#e2e8f0')
ax1.set_ylabel('Singular value', color='#e2e8f0')
ax1.set_title('Singular Value Spectrum', color='#c4b5fd', fontsize=12, fontweight='bold')
ax1.tick_params(colors='#94a3b8')
for spine in ax1.spines.values():
    spine.set_edgecolor('#334155')

ax2 = fig.add_subplot(gs[0, 1])
ax2.set_facecolor('#0f0f1a')
ax2.plot(ranks, cumulative_variance * 100, 'o-', color='#a78bfa', linewidth=2, markersize=5)
ax2.axhline(90, color='#f59e0b', linestyle='--', linewidth=1, label='90% threshold')
ax2.axhline(99, color='#34d399', linestyle='--', linewidth=1, label='99% threshold')
ax2.set_xlabel('Number of components k', color='#e2e8f0')
ax2.set_ylabel('Cumulative variance (%)', color='#e2e8f0')
ax2.set_title('Cumulative Variance Explained', color='#c4b5fd', fontsize=12, fontweight='bold')
ax2.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=9)
ax2.tick_params(colors='#94a3b8')
for spine in ax2.spines.values():
    spine.set_edgecolor('#334155')

ax3 = fig.add_subplot(gs[1, 0])
ax3.set_facecolor('#0f0f1a')
ax3.plot(ranks, errors, 's-', color='#f472b6', linewidth=2, markersize=5)
ax3.set_xlabel('Rank k', color='#e2e8f0')
ax3.set_ylabel('Relative Frobenius error', color='#e2e8f0')
ax3.set_title('Reconstruction Error vs Rank', color='#c4b5fd', fontsize=12, fontweight='bold')
ax3.tick_params(colors='#94a3b8')
for spine in ax3.spines.values():
    spine.set_edgecolor('#334155')

ax4 = fig.add_subplot(gs[1, 1])
ax4.set_facecolor('#0f0f1a')
k_vals = [1, 2, 3, 5]
colors_k = ['#f87171', '#fb923c', '#facc15', '#4ade80']
for k, c in zip(k_vals, colors_k):
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
    ax4.scatter(X[:, 0], X_k[:, 0], s=30, alpha=0.7, color=c, label=f'k={k}')
ax4.scatter(X[:, 0], X[:, 0], s=30, alpha=0.3, color='white', label='original', marker='x')
ax4.set_xlabel('Original X[:,0]', color='#e2e8f0')
ax4.set_ylabel('Reconstructed X_k[:,0]', color='#e2e8f0')
ax4.set_title('Reconstruction Quality (first feature)', color='#c4b5fd', fontsize=12, fontweight='bold')
ax4.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax4.tick_params(colors='#94a3b8')
for spine in ax4.spines.values():
    spine.set_edgecolor('#334155')

fig.suptitle('SVD Analysis of a Data Matrix', color='#ddd6fe', fontsize=14, fontweight='bold', y=1.01)
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print('Saved output.png')

# ── 6. How many components for 90% variance? ──
k90 = int(np.searchsorted(cumulative_variance, 0.90)) + 1
k99 = int(np.searchsorted(cumulative_variance, 0.99)) + 1
print(f'Components for 90% variance: {k90}')
print(f'Components for 99% variance: {k99}')
print(f'Compression ratio at 90%: {k90 * (n + p + 1) / (n * p):.3f}')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

✓ Matrix multiplication composes linear transformations; the inner product encodes angle and length.
✓ Eigendecomposition \(A = Q\Lambda Q^{-1}\) diagonalises a matrix in its own basis; symmetric matrices have orthonormal eigenvectors.
✓ SVD \(A = U\Sigma V^\top\) extends eigendecomposition to rectangular matrices and yields the best low-rank approximation.
✓ Rank-nullity theorem governs solution existence and uniqueness; PD matrices guarantee unique solutions.
✓ Truncated SVD with \(k\) components achieves optimal Frobenius-norm compression by retaining the largest singular values.

Share:X Reddit LinkedIn

Course Overview Next: Probability & Statistics