Part II: Deep Learning Architectures | Chapter 7

Autoencoders & Dimensionality Reduction

From linear autoencoders recovering PCA to denoising and sparse variants — learning compressed representations of scientific data

Why Dimensionality Reduction?

Scientific datasets are often high-dimensional: particle physics events with thousands of detector channels, protein structures with thousands of atomic coordinates, or spectral measurements across hundreds of wavelengths. Autoencoders learn a nonlinear mapping from high-dimensional input space to a low-dimensional latent (bottleneck) space, capturing the essential structure of the data.

We begin by proving that a linear autoencoder with a squared-error loss recovers exactly the PCA solution, then extend to nonlinear, denoising, and sparse variants with applications in anomaly detection for particle physics and materials science.

1. The Autoencoder Framework

An autoencoder consists of two functions: an encoder$f_\phi: \mathbb{R}^D \to \mathbb{R}^d$ that maps the input to a latent representation, and a decoder$g_\theta: \mathbb{R}^d \to \mathbb{R}^D$ that reconstructs the input from the latent code, where $d \ll D$ is the bottleneck dimension.

Reconstruction Loss

Given training data $\{x_1, \ldots, x_N\} \subset \mathbb{R}^D$, we minimise:

$$\mathcal{L}(\phi, \theta) = \frac{1}{N}\sum_{i=1}^{N} \|x_i - g_\theta(f_\phi(x_i))\|^2$$

The bottleneck forces the network to learn a compressed representation that captures the most important variation in the data. The latent code $z_i = f_\phi(x_i) \in \mathbb{R}^d$is the learned low-dimensional embedding.

2. Linear Autoencoder = PCA (Proof)

When both the encoder and decoder are linear (no activation functions), the autoencoder recovers the principal component analysis (PCA) solution. This is a foundational result connecting neural networks to classical statistics.

Setup

Let the encoder be $f(x) = W_e x$ with $W_e \in \mathbb{R}^{d \times D}$and the decoder be $g(z) = W_d z$ with $W_d \in \mathbb{R}^{D \times d}$. Assume the data is centred ($\bar{x} = 0$). The loss becomes:

$$\mathcal{L}(W_e, W_d) = \frac{1}{N}\sum_{i=1}^{N} \|x_i - W_d W_e x_i\|^2 = \frac{1}{N}\|X - W_d W_e X\|_F^2$$

where $X \in \mathbb{R}^{D \times N}$ is the data matrix (each column is a sample).

Theorem: Optimal Linear AE Recovers PCA

Claim: The optimal reconstruction matrix$P^* = W_d^* W_e^*$ is the orthogonal projection onto the subspace spanned by the top $d$ eigenvectors of the data covariance matrix$C = \frac{1}{N}XX^\top$.

Proof:

Step 1. Define $P = W_d W_e \in \mathbb{R}^{D \times D}$. Since$W_e$ maps to $\mathbb{R}^d$ and $W_d$ maps back, we have$\text{rank}(P) \leq d$. The loss is:

$$\mathcal{L} = \frac{1}{N}\text{tr}\left[(X - PX)(X - PX)^\top\right] = \text{tr}\left[(I - P)C(I - P)^\top\right]$$

Step 2. Let the eigendecomposition of $C$ be$C = U \Lambda U^\top$ where $\Lambda = \text{diag}(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0)$. The loss in the eigenbasis becomes:

$$\mathcal{L} = \text{tr}\left[(I - P) U \Lambda U^\top (I - P)^\top\right]$$

Step 3. For any rank-$d$ projection matrix $P$, the Eckart-Young-Mirsky theorem (or direct variational argument) shows the minimum is achieved when$P$ projects onto the span of the top $d$ eigenvectors:

$$P^* = U_d U_d^\top, \quad \text{where } U_d = [u_1, \ldots, u_d]$$

Step 4. The minimum loss equals the sum of the discarded eigenvalues:

$$\mathcal{L}^* = \sum_{j=d+1}^{D} \lambda_j$$

Step 5. One valid factorisation is $W_e^* = U_d^\top$ and$W_d^* = U_d$, giving the standard PCA projection. Note that the factorisation is not unique — any invertible $A \in \mathbb{R}^{d \times d}$ gives an equivalent solution $W_e = A U_d^\top$, $W_d = U_d A^{-1}$. However, the reconstruction $P^* = W_d W_e = U_d U_d^\top$ is unique. $\blacksquare$

Python

script.py72 lines

import numpy as np

# Prove Linear AE = PCA numerically
np.random.seed(42)

D = 10   # input dimension
d = 3    # bottleneck dimension
N = 500  # samples

# Generate correlated data
A = np.random.randn(D, D)
cov = A @ A.T / D + 0.1 * np.eye(D)
L = np.linalg.cholesky(cov)
X = (L @ np.random.randn(D, N))  # D x N, centered

# PCA via SVD
C = X @ X.T / N  # covariance matrix
eigenvalues, eigenvectors = np.linalg.eigh(C)
# Sort descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

U_d = eigenvectors[:, :d]
P_pca = U_d @ U_d.T  # PCA projection

# Reconstruction error with PCA
X_recon_pca = P_pca @ X
loss_pca = np.mean(np.sum((X - X_recon_pca)**2, axis=0))

# Theoretical minimum = sum of discarded eigenvalues
loss_theory = np.sum(eigenvalues[d:])

print("PCA Analysis:")
print(f"  Top {d} eigenvalues: {eigenvalues[:d].round(4)}")
print(f"  Discarded eigenvalues: {eigenvalues[d:].round(4)}")
print(f"  Theoretical min loss (sum discarded): {loss_theory:.6f}")
print(f"  Actual PCA loss: {loss_pca:.6f}")
print(f"  Match: {np.isclose(loss_pca, loss_theory, rtol=1e-4)}")

# Train a linear autoencoder by gradient descent
W_e = np.random.randn(d, D) * 0.1
W_d = np.random.randn(D, d) * 0.1
lr = 0.001

losses = []
for step in range(2000):
    Z = W_e @ X           # d x N
    X_hat = W_d @ Z       # D x N
    residual = X - X_hat   # D x N
    loss = np.mean(np.sum(residual**2, axis=0))
    losses.append(loss)

# Gradients
    dX_hat = -2 * residual / N
    dW_d = dX_hat @ Z.T
    dW_e = (W_d.T @ dX_hat) @ X.T

W_e -= lr * dW_e
    W_d -= lr * dW_d

print(f"\nLinear AE (gradient descent):")
print(f"  Initial loss: {losses[0]:.6f}")
print(f"  Final loss:   {losses[-1]:.6f}")
print(f"  PCA loss:     {loss_pca:.6f}")
print(f"  Converged to PCA: {np.isclose(losses[-1], loss_pca, rtol=1e-2)}")

# Check that W_d @ W_e converges to PCA projection
P_ae = W_d @ W_e
proj_error = np.linalg.norm(P_ae - P_pca, 'fro')
print(f"  ||P_AE - P_PCA||_F = {proj_error:.6f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. The Bottleneck Representation

The magic of autoencoders is the bottleneck layer. By forcing information through a low-dimensional channel, the network must learn an efficient encoding of the data manifold. We can analyse this rigorously.

Information-Theoretic View

Suppose $x \in \mathbb{R}^D$ lies on (or near) a $d$-dimensional manifold $\mathcal{M} \subset \mathbb{R}^D$. The encoder learns a coordinate chart $f: \mathcal{M} \to \mathbb{R}^d$, and the decoder learns the inverse chart $g: \mathbb{R}^d \to \mathcal{M}$.

More formally, the optimal encoder-decoder pair minimises the expected reconstruction error:

$$\min_{f,g} \mathbb{E}_{x \sim p(x)}\left[\|x - g(f(x))\|^2\right]$$

If $f$ and $g$ are unconstrained (e.g., universal approximators), the minimum is achieved when $g \circ f$ is the orthogonal projection onto the data manifold: $g(f(x)) = \text{proj}_{\mathcal{M}}(x)$. The residual$\|x - g(f(x))\|^2$ then captures only the off-manifold noise.

Derivation: Gradient of the Bottleneck Representation

For a deep autoencoder with encoder $f_\phi(x) = \sigma(W_L \cdots \sigma(W_2 \sigma(W_1 x + b_1) + b_2) \cdots + b_L)$, the Jacobian of the encoder at a point $x$ is:

$$J_f(x) = \frac{\partial f_\phi}{\partial x} = D_L W_L \cdots D_2 W_2 D_1 W_1 \in \mathbb{R}^{d \times D}$$

where $D_l = \text{diag}(\sigma'(z_l))$ are the activation derivative matrices. The rows of $J_f$ span the tangent space of the learned manifold at $x$. At the optimum, $J_f$ approximates the top $d$ left singular vectors of the local data distribution around $x$.

This gives a geometric interpretation: the encoder learns a nonlinear coordinate system adapted to the data manifold, generalising PCA to curved manifolds.

Python

script.py93 lines

import numpy as np

# Nonlinear autoencoder on a 2D manifold embedded in 10D
np.random.seed(42)

D = 10  # ambient dimension
d = 2   # latent dimension (true manifold dim)
N = 1000

# Generate data on a 2D manifold: Swiss roll in 10D
t = 3 * np.pi / 2 * (1 + 2 * np.random.rand(N))
height = np.random.rand(N) * 10
X_manifold = np.zeros((N, D))
X_manifold[:, 0] = t * np.cos(t)
X_manifold[:, 1] = height
X_manifold[:, 2] = t * np.sin(t)
# Add small noise in all dimensions
X_manifold += np.random.randn(N, D) * 0.1

# Center
X_manifold -= X_manifold.mean(axis=0)

# Simple 2-layer autoencoder
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)

# Architecture: D -> 32 -> d -> 32 -> D
h1 = 32

W1 = np.random.randn(D, h1) * np.sqrt(2.0/D)
b1 = np.zeros(h1)
W2 = np.random.randn(h1, d) * np.sqrt(2.0/h1)
b2 = np.zeros(d)
W3 = np.random.randn(d, h1) * np.sqrt(2.0/d)
b3 = np.zeros(h1)
W4 = np.random.randn(h1, D) * np.sqrt(2.0/h1)
b4 = np.zeros(D)

lr = 0.0005
batch_size = 64

print("Training nonlinear autoencoder on Swiss roll (10D -> 2D -> 10D):")
for epoch in range(200):
    perm = np.random.permutation(N)
    total_loss = 0.0
    for start in range(0, N, batch_size):
        idx = perm[start:start+batch_size]
        x = X_manifold[idx]  # (B, D)
        B = len(idx)

# Forward
        z1 = x @ W1 + b1
        a1 = relu(z1)
        z_latent = a1 @ W2 + b2   # bottleneck
        z3 = z_latent @ W3 + b3
        a3 = relu(z3)
        x_hat = a3 @ W4 + b4

loss = np.mean(np.sum((x - x_hat)**2, axis=1))
        total_loss += loss * B

# Backward
        dx_hat = -2 * (x - x_hat) / B
        dW4 = a3.T @ dx_hat
        db4 = dx_hat.sum(axis=0)
        da3 = dx_hat @ W4.T
        dz3 = da3 * relu_grad(z3)
        dW3 = z_latent.T @ dz3
        db3 = dz3.sum(axis=0)
        dz_latent = dz3 @ W3.T
        dW2 = a1.T @ dz_latent
        db2 = dz_latent.sum(axis=0)
        da1 = dz_latent @ W2.T
        dz1 = da1 * relu_grad(z1)
        dW1 = x.T @ dz1
        db1 = dz1.sum(axis=0)

for W, dW in [(W1,dW1),(W2,dW2),(W3,dW3),(W4,dW4)]:
            W -= lr * dW
        for b, db in [(b1,db1),(b2,db2),(b3,db3),(b4,db4)]:
            b -= lr * db

avg_loss = total_loss / N
    if epoch % 40 == 0 or epoch == 199:
        print(f"  Epoch {epoch:3d}: loss = {avg_loss:.4f}")

# Encode all data
z1_all = relu(X_manifold @ W1 + b1)
Z = z1_all @ W2 + b2
print(f"\nLatent space shape: {Z.shape}")
print(f"Latent dim variance: [{np.var(Z[:,0]):.3f}, {np.var(Z[:,1]):.3f}]")
print(f"Correlation between latent dims: {np.corrcoef(Z[:,0], Z[:,1])[0,1]:.4f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4. Denoising Autoencoders (DAE)

A denoising autoencoder is trained to reconstruct the clean input from acorrupted version. This forces the network to learn robust features rather than memorising the identity function.

DAE Training Procedure

Given a corruption process $q(\tilde{x}|x)$ (e.g., additive Gaussian noise or masking), the DAE minimises:

$$\mathcal{L}_{\text{DAE}} = \mathbb{E}_{x \sim p(x)} \mathbb{E}_{\tilde{x} \sim q(\tilde{x}|x)} \left[\|x - g_\theta(f_\phi(\tilde{x}))\|^2\right]$$

Key insight (Vincent et al., 2008): The optimal denoising function is the posterior mean:

$$g^*(f^*(\tilde{x})) = \mathbb{E}[x | \tilde{x}]$$

For Gaussian corruption $\tilde{x} = x + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$, the optimal reconstruction direction points from $\tilde{x}$ toward the data manifold. Specifically:

$$g^*(f^*(\tilde{x})) - \tilde{x} \propto \nabla_{\tilde{x}} \log p(\tilde{x})$$

This means the DAE implicitly learns the score function(gradient of the log-density), connecting it to score-based generative models and diffusion models.

Python

script.py78 lines

import numpy as np

# Denoising autoencoder: learn to reconstruct clean data from noisy input
np.random.seed(42)

D = 5
d = 2
N = 800

# Data on a 2D plane in 5D
basis1 = np.array([1, 0, 1, 0, 0]) / np.sqrt(2)
basis2 = np.array([0, 1, 0, 1, 0]) / np.sqrt(2)
coords = np.random.randn(N, 2)
X_clean = coords[:, 0:1] * basis1 + coords[:, 1:2] * basis2

noise_std = 0.3

def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)

# AE: D -> 16 -> d -> 16 -> D
h = 16
W1 = np.random.randn(D, h) * np.sqrt(2.0/D)
b1 = np.zeros(h)
W2 = np.random.randn(h, d) * np.sqrt(2.0/h)
b2 = np.zeros(d)
W3 = np.random.randn(d, h) * np.sqrt(2.0/d)
b3 = np.zeros(h)
W4 = np.random.randn(h, D) * np.sqrt(2.0/h)
b4 = np.zeros(D)

lr = 0.001

print("Denoising Autoencoder Training:")
for epoch in range(300):
    # Corrupt input
    noise = np.random.randn(N, D) * noise_std
    X_noisy = X_clean + noise

# Forward (input = noisy, target = clean)
    a1 = relu(X_noisy @ W1 + b1)
    z = a1 @ W2 + b2
    a3 = relu(z @ W3 + b3)
    X_hat = a3 @ W4 + b4

loss = np.mean(np.sum((X_clean - X_hat)**2, axis=1))

# Backward
    dX_hat = -2 * (X_clean - X_hat) / N
    dW4 = a3.T @ dX_hat; db4 = dX_hat.sum(0)
    da3 = dX_hat @ W4.T; dz3 = da3 * relu_grad(z @ W3 + b3)
    dW3 = z.T @ dz3; db3 = dz3.sum(0)
    dz = dz3 @ W3.T
    dW2 = a1.T @ dz; db2 = dz.sum(0)
    da1 = dz @ W2.T; dz1 = da1 * relu_grad(X_noisy @ W1 + b1)
    dW1 = X_noisy.T @ dz1; db1 = dz1.sum(0)

for W, dW in [(W1,dW1),(W2,dW2),(W3,dW3),(W4,dW4)]:
        W -= lr * dW
    for b, db in [(b1,db1),(b2,db2),(b3,db3),(b4,db4)]:
        b -= lr * db

if epoch % 60 == 0 or epoch == 299:
        print(f"  Epoch {epoch:3d}: recon loss = {loss:.6f}")

# Test: denoise new samples
X_test = coords[:5, 0:1] * basis1 + coords[:5, 1:2] * basis2
X_test_noisy = X_test + np.random.randn(5, D) * noise_std
a1_t = relu(X_test_noisy @ W1 + b1)
z_t = a1_t @ W2 + b2
a3_t = relu(z_t @ W3 + b3)
X_denoised = a3_t @ W4 + b4

print("\nDenoising test (5 samples):")
print(f"  Mean noise magnitude:   {np.mean(np.linalg.norm(X_test_noisy - X_test, axis=1)):.4f}")
print(f"  Mean residual after AE: {np.mean(np.linalg.norm(X_denoised - X_test, axis=1)):.4f}")
print(f"  Noise reduction factor: {np.mean(np.linalg.norm(X_test_noisy - X_test, axis=1)) / np.mean(np.linalg.norm(X_denoised - X_test, axis=1)):.2f}x")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

5. Sparse Autoencoders

Sparse autoencoders encourage the latent representation to be sparse — most latent units should be inactive (close to zero) for any given input. This leads to more interpretable features.

Sparsity Penalty

The loss function includes a sparsity penalty. Let $\hat{\rho}_j = \frac{1}{N}\sum_{i=1}^{N} a_j(x_i)$be the average activation of hidden unit $j$. We want $\hat{\rho}_j \approx \rho$for a small target $\rho$ (e.g., 0.05). The KL divergence penalty is:

$$\Omega_{\text{sparse}} = \sum_{j=1}^{d_h} \text{KL}(\rho \| \hat{\rho}_j) = \sum_{j=1}^{d_h} \left[\rho \log\frac{\rho}{\hat{\rho}_j} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j}\right]$$

Alternatively, a simpler $L^1$ penalty can be used:

$$\mathcal{L}_{\text{sparse}} = \mathcal{L}_{\text{recon}} + \lambda \sum_{j=1}^{d_h} |\hat{\rho}_j|$$

In the overcomplete case ($d_h > D$), sparsity prevents the trivial identity solution. Each input activates a different subset of hidden units, giving a distributed code.

Python

script.py57 lines

import numpy as np

# Sparse autoencoder with L1 penalty
np.random.seed(42)

D = 8
d_h = 32  # OVERCOMPLETE (d_h > D)
N = 600

# Generate structured data
X = np.random.randn(N, 3) @ np.random.randn(3, D)
X += np.random.randn(N, D) * 0.1

def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

W1 = np.random.randn(D, d_h) * 0.3
b1 = np.zeros(d_h)
W2 = np.random.randn(d_h, D) * 0.3
b2 = np.zeros(D)

lr = 0.005
lam = 0.01  # L1 sparsity weight

print("Sparse Autoencoder (overcomplete: 8D -> 32D -> 8D):")
for epoch in range(400):
    # Forward
    z = X @ W1 + b1
    a = sigmoid(z)  # hidden activations
    X_hat = a @ W2 + b2

recon_loss = np.mean(np.sum((X - X_hat)**2, axis=1))
    sparsity = np.mean(np.abs(a))
    loss = recon_loss + lam * np.sum(np.mean(np.abs(a), axis=0))

# Backward
    dX_hat = -2 * (X - X_hat) / N
    dW2 = a.T @ dX_hat
    db2 = dX_hat.sum(0)
    da = dX_hat @ W2.T + lam * np.sign(a) / N  # + L1 gradient
    dz = da * a * (1 - a)  # sigmoid derivative
    dW1 = X.T @ dz
    db1 = dz.sum(0)

W1 -= lr * dW1; b1 -= lr * db1
    W2 -= lr * dW2; b2 -= lr * db2

if epoch % 80 == 0 or epoch == 399:
        active = np.mean(a > 0.1)
        print(f"  Epoch {epoch:3d}: recon={recon_loss:.4f}, sparsity={sparsity:.4f}, frac_active={active:.3f}")

# Analyse learned features
a_final = sigmoid(X @ W1 + b1)
print(f"\nFinal hidden activation statistics:")
print(f"  Mean activation per unit: {np.mean(a_final, axis=0)[:8].round(3)}")
print(f"  Active units per sample (>0.1): {np.mean(np.sum(a_final > 0.1, axis=1)):.1f} / {d_h}")
print(f"  This gives a sparse, distributed code!")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

6. Contractive Autoencoders

Contractive autoencoders (Rifai et al., 2011) add a penalty on the Frobenius norm of the encoder Jacobian, encouraging the representation to be locally invariant to small perturbations:

$$\mathcal{L}_{\text{CAE}} = \mathcal{L}_{\text{recon}} + \lambda \left\|\frac{\partial f_\phi(x)}{\partial x}\right\|_F^2 = \mathcal{L}_{\text{recon}} + \lambda \sum_{ij} \left(\frac{\partial z_j}{\partial x_i}\right)^2$$

For a single-layer encoder $f(x) = \sigma(Wx + b)$ with sigmoid activation, the Jacobian is $J_f = \text{diag}(\sigma'(Wx+b)) \cdot W$, and the penalty becomes:

$$\|J_f\|_F^2 = \sum_{j=1}^{d} a_j(1 - a_j)^2 \sum_{i=1}^{D} W_{ji}^2$$

where $a_j = \sigma(w_j^\top x + b_j)$. This penalises sensitivity to input variations orthogonal to the data manifold, effectively learning a flat representation.

7. Anomaly Detection in Physics

One of the most powerful applications of autoencoders in science is anomaly detection. The key idea: train an autoencoder on "normal" data (e.g., known Standard Model physics). Anomalous events (potential new physics) will have high reconstruction error because they lie far from the learned manifold.

Anomaly Score

The anomaly score for a new observation $x$ is simply the reconstruction error:

$$s(x) = \|x - g_\theta(f_\phi(x))\|^2$$

A threshold $\tau$ is set (e.g., at the 99th percentile of training errors). Points with $s(x) > \tau$ are flagged as anomalies.

This approach has been applied at the LHC to search for beyond-Standard-Model physics in a model-independent way, in gravitational wave detection, and in materials science to identify unusual crystal structures.

Python

script.py81 lines

import numpy as np

# Anomaly detection with autoencoders (physics simulation)
np.random.seed(42)

D = 6
d = 2
N_normal = 500
N_anomaly = 20

# "Normal" physics: data on a 2D manifold (background events)
t1 = np.random.randn(N_normal)
t2 = np.random.randn(N_normal)
X_normal = np.column_stack([
    t1, t2, t1**2 * 0.5, np.sin(t2), t1 * t2 * 0.3, np.cos(t1)
]) + np.random.randn(N_normal, D) * 0.1

# "Anomalous" events: from a different distribution (new physics)
X_anomaly = np.random.randn(N_anomaly, D) * 2.0 + np.array([3, 0, 5, 0, 2, 0])

# Train autoencoder on normal data only
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)

h = 16
W1 = np.random.randn(D, h) * np.sqrt(2.0/D)
b1 = np.zeros(h)
W2 = np.random.randn(h, d) * np.sqrt(2.0/h)
b2 = np.zeros(d)
W3 = np.random.randn(d, h) * np.sqrt(2.0/d)
b3 = np.zeros(h)
W4 = np.random.randn(h, D) * np.sqrt(2.0/h)
b4 = np.zeros(D)

lr = 0.002
X_train = X_normal

for epoch in range(500):
    a1 = relu(X_train @ W1 + b1)
    z = a1 @ W2 + b2
    a3 = relu(z @ W3 + b3)
    X_hat = a3 @ W4 + b4

loss = np.mean(np.sum((X_train - X_hat)**2, axis=1))

dX_hat = -2 * (X_train - X_hat) / N_normal
    dW4 = a3.T @ dX_hat; db4 = dX_hat.sum(0)
    da3 = dX_hat @ W4.T; dz3 = da3 * relu_grad(z @ W3 + b3)
    dW3 = z.T @ dz3; db3 = dz3.sum(0)
    dz = dz3 @ W3.T
    dW2 = a1.T @ dz; db2 = dz.sum(0)
    da1 = dz @ W2.T; dz1 = da1 * relu_grad(X_train @ W1 + b1)
    dW1 = X_train.T @ dz1; db1 = dz1.sum(0)

W1 -= lr * dW1; b1 -= lr * db1
    W2 -= lr * dW2; b2 -= lr * db2
    W3 -= lr * dW3; b3 -= lr * db3
    W4 -= lr * dW4; b4 -= lr * db4

# Compute anomaly scores
def anomaly_score(X):
    a1 = relu(X @ W1 + b1)
    z = a1 @ W2 + b2
    a3 = relu(z @ W3 + b3)
    X_hat = a3 @ W4 + b4
    return np.sum((X - X_hat)**2, axis=1)

scores_normal = anomaly_score(X_normal)
scores_anomaly = anomaly_score(X_anomaly)

# Set threshold at 99th percentile of normal scores
threshold = np.percentile(scores_normal, 99)

print("Anomaly Detection Results:")
print(f"  Normal events:  mean score = {np.mean(scores_normal):.4f}, std = {np.std(scores_normal):.4f}")
print(f"  Anomaly events: mean score = {np.mean(scores_anomaly):.4f}, std = {np.std(scores_anomaly):.4f}")
print(f"  Threshold (99th pct of normal): {threshold:.4f}")
print(f"\n  Normal flagged as anomaly: {np.sum(scores_normal > threshold)}/{N_normal} ({100*np.mean(scores_normal > threshold):.1f}%)")
print(f"  Anomalies detected: {np.sum(scores_anomaly > threshold)}/{N_anomaly} ({100*np.mean(scores_anomaly > threshold):.1f}%)")
print(f"\n  Separation ratio (anomaly_mean / normal_mean): {np.mean(scores_anomaly)/np.mean(scores_normal):.1f}x")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Deep Autoencoder Architectures

Convolutional AE

For image and spatial field data, convolutional autoencoders use conv layers (with stride) in the encoder and transposed convolutions (upsampling) in the decoder. This preserves spatial structure and dramatically reduces parameters compared to fully connected layers.

Encoder: Conv $\to$ ReLU $\to$ Pool, repeated. Decoder: Upsample $\to$ Conv $\to$ ReLU, repeated.

Graph AE

For molecular or network data, graph autoencoders use graph neural network layers. The encoder produces node-level or graph-level embeddings via message passing:

$h_v^{(l+1)} = \sigma\left(W^{(l)} \sum_{u \in \mathcal{N}(v)} h_u^{(l)} / |\mathcal{N}(v)|\right)$

Tied Weights

A common regularisation technique is to tie the decoder weights to the transpose of the encoder weights: $W_d = W_e^\top$. This halves the number of parameters and enforces that the encoder and decoder share the same feature basis. For the linear case, this directly gives $P = W_e^\top W_e$, which is guaranteed to be a (non-orthogonal) projection.

9. Comparison: AE vs. PCA vs. t-SNE vs. UMAP

Method	Type	Invertible?	Scalability
PCA	Linear	Yes (exact)	Excellent
Linear AE	Linear	Yes (= PCA)	Excellent
Deep AE	Nonlinear	Yes (decoder)	Good (GPU)
t-SNE	Nonlinear	No	Moderate
UMAP	Nonlinear	Approximate	Good

The key advantage of autoencoders over t-SNE and UMAP is that the decoder provides a generative model: we can sample points in latent space and decode them to generate new data. This is the foundation for variational autoencoders (VAEs).

Python

script.py79 lines

import numpy as np

# Compare PCA vs deep AE reconstruction on nonlinear data
np.random.seed(42)

N = 500
D = 8
d = 2

# Nonlinear manifold: circle in 8D
theta = np.random.uniform(0, 2*np.pi, N)
radius = 2.0 + 0.2 * np.random.randn(N)

X = np.zeros((N, D))
X[:, 0] = radius * np.cos(theta)
X[:, 1] = radius * np.sin(theta)
X[:, 2] = radius * np.cos(2*theta) * 0.5
X[:, 3] = radius * np.sin(2*theta) * 0.5
X[:, 4] = np.cos(theta) * np.sin(theta)
for j in range(5, D):
    X[:, j] = np.random.randn(N) * 0.05  # noise dims
X -= X.mean(axis=0)

# PCA reconstruction
C = X.T @ X / N
eigvals, eigvecs = np.linalg.eigh(C)
idx = np.argsort(eigvals)[::-1]
U_d = eigvecs[:, idx[:d]]
X_pca = X @ U_d @ U_d.T
pca_loss = np.mean(np.sum((X - X_pca)**2, axis=1))

# Deep AE
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)

h = 32
W1 = np.random.randn(D, h) * np.sqrt(2.0/D)
b1 = np.zeros(h)
W2 = np.random.randn(h, d) * np.sqrt(2.0/h)
b2 = np.zeros(d)
W3 = np.random.randn(d, h) * np.sqrt(2.0/d)
b3 = np.zeros(h)
W4 = np.random.randn(h, D) * np.sqrt(2.0/h)
b4 = np.zeros(D)
lr = 0.001

for epoch in range(800):
    a1 = relu(X @ W1 + b1)
    z = a1 @ W2 + b2
    a3 = relu(z @ W3 + b3)
    X_hat = a3 @ W4 + b4
    loss = np.mean(np.sum((X - X_hat)**2, axis=1))

dXh = -2*(X - X_hat)/N
    dW4=a3.T@dXh; db4=dXh.sum(0)
    da3=dXh@W4.T; dz3=da3*relu_grad(z@W3+b3)
    dW3=z.T@dz3; db3=dz3.sum(0)
    dz=dz3@W3.T; dW2=a1.T@dz; db2=dz.sum(0)
    da1=dz@W2.T; dz1=da1*relu_grad(X@W1+b1)
    dW1=X.T@dz1; db1=dz1.sum(0)

W1-=lr*dW1; b1-=lr*db1; W2-=lr*dW2; b2-=lr*db2
    W3-=lr*dW3; b3-=lr*db3; W4-=lr*dW4; b4-=lr*db4

# Final AE loss
a1 = relu(X @ W1 + b1)
z = a1 @ W2 + b2
a3 = relu(z @ W3 + b3)
X_ae = a3 @ W4 + b4
ae_loss = np.mean(np.sum((X - X_ae)**2, axis=1))

print(f"Reconstruction comparison (data on nonlinear manifold in {D}D):")
print(f"  PCA (d={d}) loss:     {pca_loss:.6f}")
print(f"  Deep AE (d={d}) loss: {ae_loss:.6f}")
print(f"  Improvement: {(pca_loss - ae_loss)/pca_loss * 100:.1f}%")
print(f"\nPCA eigenvalues: {eigvals[idx][:5].round(4)}")
print(f"Variance explained by top-{d} PCs: {np.sum(eigvals[idx][:d])/np.sum(eigvals)*100:.1f}%")
print(f"\nDeep AE captures nonlinear structure that PCA misses!")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Chapter Summary

• Autoencoders learn compressed representations by minimising reconstruction error through a bottleneck.
• Linear AE = PCA: With linear encoder/decoder, the optimal solution projects onto the top eigenvectors of the covariance matrix (proved via Eckart-Young).
• Denoising AEs learn to reconstruct clean data from corrupted inputs, implicitly learning the score function $\nabla \log p(x)$.
• Sparse AEs encourage few active latent units per input, learning overcomplete but interpretable representations.
• Contractive AEs penalise the encoder Jacobian norm, learning locally invariant features.
• Anomaly detection uses high reconstruction error to flag inputs that deviate from the training manifold — a model-independent approach to new physics searches.
• Deep nonlinear AEs consistently outperform PCA on data with curved manifold structure.

Share:X Reddit LinkedIn

← RNNs & Transformers Generative Models (VAEs & GANs) →