Autoencoders & Dimensionality Reduction
From linear autoencoders recovering PCA to denoising and sparse variants — learning compressed representations of scientific data
Why Dimensionality Reduction?
Scientific datasets are often high-dimensional: particle physics events with thousands of detector channels, protein structures with thousands of atomic coordinates, or spectral measurements across hundreds of wavelengths. Autoencoders learn a nonlinear mapping from high-dimensional input space to a low-dimensional latent (bottleneck) space, capturing the essential structure of the data.
We begin by proving that a linear autoencoder with a squared-error loss recovers exactly the PCA solution, then extend to nonlinear, denoising, and sparse variants with applications in anomaly detection for particle physics and materials science.
1. The Autoencoder Framework
An autoencoder consists of two functions: an encoder$f_\phi: \mathbb{R}^D \to \mathbb{R}^d$ that maps the input to a latent representation, and a decoder$g_\theta: \mathbb{R}^d \to \mathbb{R}^D$ that reconstructs the input from the latent code, where $d \ll D$ is the bottleneck dimension.
Reconstruction Loss
Given training data $\{x_1, \ldots, x_N\} \subset \mathbb{R}^D$, we minimise:
$$\mathcal{L}(\phi, \theta) = \frac{1}{N}\sum_{i=1}^{N} \|x_i - g_\theta(f_\phi(x_i))\|^2$$
The bottleneck forces the network to learn a compressed representation that captures the most important variation in the data. The latent code $z_i = f_\phi(x_i) \in \mathbb{R}^d$is the learned low-dimensional embedding.
2. Linear Autoencoder = PCA (Proof)
When both the encoder and decoder are linear (no activation functions), the autoencoder recovers the principal component analysis (PCA) solution. This is a foundational result connecting neural networks to classical statistics.
Setup
Let the encoder be $f(x) = W_e x$ with $W_e \in \mathbb{R}^{d \times D}$and the decoder be $g(z) = W_d z$ with $W_d \in \mathbb{R}^{D \times d}$. Assume the data is centred ($\bar{x} = 0$). The loss becomes:
$$\mathcal{L}(W_e, W_d) = \frac{1}{N}\sum_{i=1}^{N} \|x_i - W_d W_e x_i\|^2 = \frac{1}{N}\|X - W_d W_e X\|_F^2$$
where $X \in \mathbb{R}^{D \times N}$ is the data matrix (each column is a sample).
Theorem: Optimal Linear AE Recovers PCA
Claim: The optimal reconstruction matrix$P^* = W_d^* W_e^*$ is the orthogonal projection onto the subspace spanned by the top $d$ eigenvectors of the data covariance matrix$C = \frac{1}{N}XX^\top$.
Proof:
Step 1. Define $P = W_d W_e \in \mathbb{R}^{D \times D}$. Since$W_e$ maps to $\mathbb{R}^d$ and $W_d$ maps back, we have$\text{rank}(P) \leq d$. The loss is:
$$\mathcal{L} = \frac{1}{N}\text{tr}\left[(X - PX)(X - PX)^\top\right] = \text{tr}\left[(I - P)C(I - P)^\top\right]$$
Step 2. Let the eigendecomposition of $C$ be$C = U \Lambda U^\top$ where $\Lambda = \text{diag}(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0)$. The loss in the eigenbasis becomes:
$$\mathcal{L} = \text{tr}\left[(I - P) U \Lambda U^\top (I - P)^\top\right]$$
Step 3. For any rank-$d$ projection matrix $P$, the Eckart-Young-Mirsky theorem (or direct variational argument) shows the minimum is achieved when$P$ projects onto the span of the top $d$ eigenvectors:
$$P^* = U_d U_d^\top, \quad \text{where } U_d = [u_1, \ldots, u_d]$$
Step 4. The minimum loss equals the sum of the discarded eigenvalues:
$$\mathcal{L}^* = \sum_{j=d+1}^{D} \lambda_j$$
Step 5. One valid factorisation is $W_e^* = U_d^\top$ and$W_d^* = U_d$, giving the standard PCA projection. Note that the factorisation is not unique — any invertible $A \in \mathbb{R}^{d \times d}$ gives an equivalent solution $W_e = A U_d^\top$, $W_d = U_d A^{-1}$. However, the reconstruction $P^* = W_d W_e = U_d U_d^\top$ is unique. $\blacksquare$
Click Run to execute the Python code
Code will be executed with Python 3 on the server
3. The Bottleneck Representation
The magic of autoencoders is the bottleneck layer. By forcing information through a low-dimensional channel, the network must learn an efficient encoding of the data manifold. We can analyse this rigorously.
Information-Theoretic View
Suppose $x \in \mathbb{R}^D$ lies on (or near) a $d$-dimensional manifold $\mathcal{M} \subset \mathbb{R}^D$. The encoder learns a coordinate chart $f: \mathcal{M} \to \mathbb{R}^d$, and the decoder learns the inverse chart $g: \mathbb{R}^d \to \mathcal{M}$.
More formally, the optimal encoder-decoder pair minimises the expected reconstruction error:
$$\min_{f,g} \mathbb{E}_{x \sim p(x)}\left[\|x - g(f(x))\|^2\right]$$
If $f$ and $g$ are unconstrained (e.g., universal approximators), the minimum is achieved when $g \circ f$ is the orthogonal projection onto the data manifold: $g(f(x)) = \text{proj}_{\mathcal{M}}(x)$. The residual$\|x - g(f(x))\|^2$ then captures only the off-manifold noise.
Derivation: Gradient of the Bottleneck Representation
For a deep autoencoder with encoder $f_\phi(x) = \sigma(W_L \cdots \sigma(W_2 \sigma(W_1 x + b_1) + b_2) \cdots + b_L)$, the Jacobian of the encoder at a point $x$ is:
$$J_f(x) = \frac{\partial f_\phi}{\partial x} = D_L W_L \cdots D_2 W_2 D_1 W_1 \in \mathbb{R}^{d \times D}$$
where $D_l = \text{diag}(\sigma'(z_l))$ are the activation derivative matrices. The rows of $J_f$ span the tangent space of the learned manifold at $x$. At the optimum, $J_f$ approximates the top $d$ left singular vectors of the local data distribution around $x$.
This gives a geometric interpretation: the encoder learns a nonlinear coordinate system adapted to the data manifold, generalising PCA to curved manifolds.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
4. Denoising Autoencoders (DAE)
A denoising autoencoder is trained to reconstruct the clean input from acorrupted version. This forces the network to learn robust features rather than memorising the identity function.
DAE Training Procedure
Given a corruption process $q(\tilde{x}|x)$ (e.g., additive Gaussian noise or masking), the DAE minimises:
$$\mathcal{L}_{\text{DAE}} = \mathbb{E}_{x \sim p(x)} \mathbb{E}_{\tilde{x} \sim q(\tilde{x}|x)} \left[\|x - g_\theta(f_\phi(\tilde{x}))\|^2\right]$$
Key insight (Vincent et al., 2008): The optimal denoising function is the posterior mean:
$$g^*(f^*(\tilde{x})) = \mathbb{E}[x | \tilde{x}]$$
For Gaussian corruption $\tilde{x} = x + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$, the optimal reconstruction direction points from $\tilde{x}$ toward the data manifold. Specifically:
$$g^*(f^*(\tilde{x})) - \tilde{x} \propto \nabla_{\tilde{x}} \log p(\tilde{x})$$
This means the DAE implicitly learns the score function(gradient of the log-density), connecting it to score-based generative models and diffusion models.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
5. Sparse Autoencoders
Sparse autoencoders encourage the latent representation to be sparse — most latent units should be inactive (close to zero) for any given input. This leads to more interpretable features.
Sparsity Penalty
The loss function includes a sparsity penalty. Let $\hat{\rho}_j = \frac{1}{N}\sum_{i=1}^{N} a_j(x_i)$be the average activation of hidden unit $j$. We want $\hat{\rho}_j \approx \rho$for a small target $\rho$ (e.g., 0.05). The KL divergence penalty is:
$$\Omega_{\text{sparse}} = \sum_{j=1}^{d_h} \text{KL}(\rho \| \hat{\rho}_j) = \sum_{j=1}^{d_h} \left[\rho \log\frac{\rho}{\hat{\rho}_j} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j}\right]$$
Alternatively, a simpler $L^1$ penalty can be used:
$$\mathcal{L}_{\text{sparse}} = \mathcal{L}_{\text{recon}} + \lambda \sum_{j=1}^{d_h} |\hat{\rho}_j|$$
In the overcomplete case ($d_h > D$), sparsity prevents the trivial identity solution. Each input activates a different subset of hidden units, giving a distributed code.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
6. Contractive Autoencoders
Contractive autoencoders (Rifai et al., 2011) add a penalty on the Frobenius norm of the encoder Jacobian, encouraging the representation to be locally invariant to small perturbations:
$$\mathcal{L}_{\text{CAE}} = \mathcal{L}_{\text{recon}} + \lambda \left\|\frac{\partial f_\phi(x)}{\partial x}\right\|_F^2 = \mathcal{L}_{\text{recon}} + \lambda \sum_{ij} \left(\frac{\partial z_j}{\partial x_i}\right)^2$$
For a single-layer encoder $f(x) = \sigma(Wx + b)$ with sigmoid activation, the Jacobian is $J_f = \text{diag}(\sigma'(Wx+b)) \cdot W$, and the penalty becomes:
$$\|J_f\|_F^2 = \sum_{j=1}^{d} a_j(1 - a_j)^2 \sum_{i=1}^{D} W_{ji}^2$$
where $a_j = \sigma(w_j^\top x + b_j)$. This penalises sensitivity to input variations orthogonal to the data manifold, effectively learning a flat representation.
7. Anomaly Detection in Physics
One of the most powerful applications of autoencoders in science is anomaly detection. The key idea: train an autoencoder on "normal" data (e.g., known Standard Model physics). Anomalous events (potential new physics) will have high reconstruction error because they lie far from the learned manifold.
Anomaly Score
The anomaly score for a new observation $x$ is simply the reconstruction error:
$$s(x) = \|x - g_\theta(f_\phi(x))\|^2$$
A threshold $\tau$ is set (e.g., at the 99th percentile of training errors). Points with $s(x) > \tau$ are flagged as anomalies.
This approach has been applied at the LHC to search for beyond-Standard-Model physics in a model-independent way, in gravitational wave detection, and in materials science to identify unusual crystal structures.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
8. Deep Autoencoder Architectures
Convolutional AE
For image and spatial field data, convolutional autoencoders use conv layers (with stride) in the encoder and transposed convolutions (upsampling) in the decoder. This preserves spatial structure and dramatically reduces parameters compared to fully connected layers.
Encoder: Conv $\to$ ReLU $\to$ Pool, repeated. Decoder: Upsample $\to$ Conv $\to$ ReLU, repeated.
Graph AE
For molecular or network data, graph autoencoders use graph neural network layers. The encoder produces node-level or graph-level embeddings via message passing:
$h_v^{(l+1)} = \sigma\left(W^{(l)} \sum_{u \in \mathcal{N}(v)} h_u^{(l)} / |\mathcal{N}(v)|\right)$
Tied Weights
A common regularisation technique is to tie the decoder weights to the transpose of the encoder weights: $W_d = W_e^\top$. This halves the number of parameters and enforces that the encoder and decoder share the same feature basis. For the linear case, this directly gives $P = W_e^\top W_e$, which is guaranteed to be a (non-orthogonal) projection.
9. Comparison: AE vs. PCA vs. t-SNE vs. UMAP
| Method | Type | Invertible? | Scalability |
|---|---|---|---|
| PCA | Linear | Yes (exact) | Excellent |
| Linear AE | Linear | Yes (= PCA) | Excellent |
| Deep AE | Nonlinear | Yes (decoder) | Good (GPU) |
| t-SNE | Nonlinear | No | Moderate |
| UMAP | Nonlinear | Approximate | Good |
The key advantage of autoencoders over t-SNE and UMAP is that the decoder provides a generative model: we can sample points in latent space and decode them to generate new data. This is the foundation for variational autoencoders (VAEs).
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Chapter Summary
- • Autoencoders learn compressed representations by minimising reconstruction error through a bottleneck.
- • Linear AE = PCA: With linear encoder/decoder, the optimal solution projects onto the top eigenvectors of the covariance matrix (proved via Eckart-Young).
- • Denoising AEs learn to reconstruct clean data from corrupted inputs, implicitly learning the score function $\nabla \log p(x)$.
- • Sparse AEs encourage few active latent units per input, learning overcomplete but interpretable representations.
- • Contractive AEs penalise the encoder Jacobian norm, learning locally invariant features.
- • Anomaly detection uses high reconstruction error to flag inputs that deviate from the training manifold — a model-independent approach to new physics searches.
- • Deep nonlinear AEs consistently outperform PCA on data with curved manifold structure.