Part IV · Chapter 12

Autoencoders & Variational Autoencoders

Autoencoders learn compact representations by compressing data into a latent code and reconstructing it. Variational Autoencoders (VAEs) add a probabilistic prior on the latent space, enabling principled generation of new samples. We derive the VAE objective from the evidence lower bound, prove the reparameterisation trick, and compute the KL divergence in closed form.

1. Vanilla Autoencoder

An autoencoder is an encoder-decoder pair trained to reconstruct its input:

\[ \mathbf{z} = f_\phi(\mathbf{x}), \qquad \hat{\mathbf{x}} = g_\theta(\mathbf{z}), \qquad \mathcal{L}_\mathrm{AE} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 \]

The bottleneck (latent dimension \(d \ll D\)) forces the encoder to learn a compressed, informative representation. The decoder must reconstruct the input from this code.

A denoising autoencoder corrupts the input during training:\(\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon},\;\boldsymbol{\epsilon} \sim \mathcal{N}(0,\sigma^2\mathbf{I})\), then minimises \(\|\mathbf{x} - g_\theta(f_\phi(\tilde{\mathbf{x}}))\|^2\). This encourages learning robust features that are invariant to small perturbations.

2. Variational Autoencoder — Full Derivation

The VAE (Kingma & Welling 2014) is a latent variable model with prior \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0},\mathbf{I})\)and likelihood \(p_\theta(\mathbf{x}|\mathbf{z})\). We want to maximise the log-likelihood\(\log p_\theta(\mathbf{x})\), but this requires integrating over all \(\mathbf{z}\):

\[ \log p_\theta(\mathbf{x}) = \log\int p_\theta(\mathbf{x}|\mathbf{z})\,p(\mathbf{z})\,d\mathbf{z} \]

This integral is intractable. Introduce an approximate posterior \(q_\phi(\mathbf{z}|\mathbf{x})\)(the encoder) and apply the identity:

ELBO decomposition

\[ \log p_\theta(\mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \mathrm{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x})\,\big\|\,p(\mathbf{z})\right)}_{\text{ELBO}} + \mathrm{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x})\,\big\|\,p(\mathbf{z}|\mathbf{x})\right) \]

Proof: multiply and divide inside the log by \(q_\phi(\mathbf{z}|\mathbf{x})\):

\[ \log p_\theta(\mathbf{x}) = \log \mathbb{E}_{q_\phi}\!\left[\frac{p_\theta(\mathbf{x}|\mathbf{z})\,p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right] \geq \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(\mathbf{x}|\mathbf{z})\,p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right] \]

by Jensen's inequality (log is concave). The lower bound equals \(\log p_\theta(\mathbf{x})\) iff \(q_\phi = p(\mathbf{z}|\mathbf{x})\).

ELBO rewritten

\[ \mathcal{L}_\mathrm{ELBO} = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right]}_{\text{reconstruction}} - \underbrace{\mathrm{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x})\,\big\|\,p(\mathbf{z})\right)}_{\text{regularisation}} \]

The reconstruction term encourages the decoder to reproduce the input. The KL term regularises the approximate posterior toward the prior, preventing overfitting to individual data points.

Gaussian encoder: \(q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \mathrm{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x})))\)

The KL between two Gaussians has a closed form. For diagonal \(q = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2\mathbf{I})\)and \(p = \mathcal{N}(\mathbf{0}, \mathbf{I})\):

\[ \mathrm{KL}(q \| p) = \frac{1}{2}\sum_{j=1}^d \left(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\right) \]

Derivation: \(\mathrm{KL}(q\|p) = \mathbb{E}_q[\log q - \log p] = -\frac{1}{2}\sum_j(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2)\)using the entropy of a Gaussian and the log-normaliser of \(p\).

Reparameterisation trick — why it enables gradients

We need to backpropagate through the sampling operation \(\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})\). Naïve Monte Carlo sampling is not differentiable w.r.t. \(\phi\) because the distribution itself depends on \(\phi\).

Reparameterisation: write \(\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}\)where \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})\).

\[ \mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \]

Now \(\mathbf{z}\) is a deterministic function of \((\phi, \boldsymbol{\varepsilon})\). Gradients \(\partial \mathbf{z}/\partial \boldsymbol{\mu}_\phi = \mathbf{I}\) and\(\partial \mathbf{z}/\partial \boldsymbol{\sigma}_\phi = \mathrm{diag}(\boldsymbol{\varepsilon})\)flow through \(\mathbf{z}\) to \(\phi\) via the chain rule.

VAE Training Objective (maximise)

\[ \mathcal{L}_\mathrm{VAE}(\phi,\theta;\mathbf{x}) = \mathbb{E}_{\boldsymbol{\varepsilon}\sim\mathcal{N}(0,\mathbf{I})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \frac{1}{2}\sum_j\!\left(\sigma_{\phi,j}^2 + \mu_{\phi,j}^2 - 1 - \log\sigma_{\phi,j}^2\right) \]

For Gaussian likelihood \(p_\theta(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\hat{\mathbf{x}}, \mathbf{I})\), the reconstruction term becomes \(-\|\mathbf{x}-\hat{\mathbf{x}}\|^2/2\) (MSE loss up to constants).

3. VAE Architecture Diagram

The encoder outputs \(\boldsymbol{\mu}_\phi\) and \(\log\boldsymbol{\sigma}^2_\phi\). The reparameterisation node samples \(\mathbf{z}\) differentiably, enabling gradients to flow from the decoder loss all the way back through to the encoder parameters.

4. Python: VAE on Synthetic 2D Data

Full NumPy VAE implementation with manual backpropagation through the reparameterisation trick. Trained on a 2D three-cluster dataset; we visualise the latent space, reconstructions, and new samples decoded from the prior \(\mathcal{N}(\mathbf{0},\mathbf{I})\).

Python

script.py232 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

np.random.seed(42)

# ── Synthetic 2D dataset: three Gaussians ─────────────────────────────────────
def sample_data(n=600):
    mus = [[-2,-2],[2,-1],[0,3]]
    pts = []
    for mu in mus:
        pts.append(np.random.randn(n//3, 2) * 0.55 + mu)
    return np.vstack(pts)

X_data = sample_data(600)   # (600, 2)

# ── Activations ───────────────────────────────────────────────────────────────
def relu(z):    return np.maximum(0, z)
def relu_d(z):  return (z > 0).astype(float)
def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

# ── VAE network sizes: 2 -> 16 -> (latent_dim * 2) / 16 -> 2 ─────────────────
INPUT_DIM  = 2
HIDDEN_DIM = 32
LATENT_DIM = 2

def init_vae():
    # encoder weights: input -> hidden -> (mu, log_var)
    W = {}
    W['Weh'] = np.random.randn(HIDDEN_DIM, INPUT_DIM)  * np.sqrt(2.0 / INPUT_DIM)
    W['beh'] = np.zeros((HIDDEN_DIM, 1))
    W['Wmu'] = np.random.randn(LATENT_DIM, HIDDEN_DIM) * np.sqrt(1.0 / HIDDEN_DIM)
    W['bmu'] = np.zeros((LATENT_DIM, 1))
    W['Wlv'] = np.random.randn(LATENT_DIM, HIDDEN_DIM) * np.sqrt(1.0 / HIDDEN_DIM)
    W['blv'] = np.zeros((LATENT_DIM, 1))
    # decoder weights: latent -> hidden -> output
    W['Wdh'] = np.random.randn(HIDDEN_DIM, LATENT_DIM) * np.sqrt(2.0 / LATENT_DIM)
    W['bdh'] = np.zeros((HIDDEN_DIM, 1))
    W['Wdo'] = np.random.randn(INPUT_DIM, HIDDEN_DIM)  * np.sqrt(1.0 / HIDDEN_DIM)
    W['bdo'] = np.zeros((INPUT_DIM, 1))
    return W

# ── Forward pass ──────────────────────────────────────────────────────────────
def vae_forward(X, W):
    # X: (D, N) — columns are data points
    # encoder
    H_enc = relu(W['Weh'] @ X + W['beh'])          # (H, N)
    mu_z  = W['Wmu'] @ H_enc + W['bmu']             # (L, N)
    lv_z  = W['Wlv'] @ H_enc + W['blv']             # (L, N)  log-variance
    lv_z  = np.clip(lv_z, -4, 4)

# reparameterisation: z = mu + sigma * epsilon
    eps   = np.random.randn(*mu_z.shape)
    sigma = np.exp(0.5 * lv_z)
    z     = mu_z + sigma * eps

# decoder
    H_dec = relu(W['Wdh'] @ z + W['bdh'])           # (H, N)
    x_hat = W['Wdo'] @ H_dec + W['bdo']             # (D, N)  (linear output for MSE)

return {'H_enc': H_enc, 'mu_z': mu_z, 'lv_z': lv_z, 'eps': eps,
            'sigma': sigma, 'z': z, 'H_dec': H_dec, 'x_hat': x_hat}

# ── Loss: reconstruction MSE + KL divergence ─────────────────────────────────
# KL(N(mu,sigma^2) || N(0,1)) = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
def vae_loss(X, cache, beta=1.0):
    rec_loss = 0.5 * np.mean(np.sum((X - cache['x_hat'])**2, axis=0))
    kl_loss  = -0.5 * np.mean(np.sum(1 + cache['lv_z'] - cache['mu_z']**2 - np.exp(cache['lv_z']), axis=0))
    return rec_loss + beta * kl_loss, rec_loss, kl_loss

# ── Backward pass (manual, simplified — first-order gradients) ────────────────
def vae_backward(X, W, cache, lr=1e-3, beta=1.0):
    N = X.shape[1]
    x_hat = cache['x_hat']
    H_dec = cache['H_dec']
    z     = cache['z']
    H_enc = cache['H_enc']
    mu_z  = cache['mu_z']
    lv_z  = cache['lv_z']
    sigma = cache['sigma']
    eps   = cache['eps']

# reconstruction gradient
    dX_hat = (x_hat - X) / N                         # (D, N)

# decoder backward
    dWdo = dX_hat @ H_dec.T
    dbdo = dX_hat.sum(axis=1, keepdims=True)
    dH_dec = W['Wdo'].T @ dX_hat * relu_d(W['Wdh'] @ z + W['bdh'])
    dWdh = dH_dec @ z.T
    dbdh = dH_dec.sum(axis=1, keepdims=True)
    dz   = W['Wdh'].T @ dH_dec

# KL gradients w.r.t. mu and lv
    dkl_dmu = beta * mu_z / N
    dkl_dlv = beta * 0.5 * (np.exp(lv_z) - 1) / N

# reparameterisation: z = mu + sigma * eps, sigma = exp(0.5*lv)
    dmu_from_recon = dz
    dlv_from_recon = dz * eps * 0.5 * sigma

dmu = dmu_from_recon + dkl_dmu
    dlv = dlv_from_recon + dkl_dlv

# encoder backward through mu and lv heads
    dH_enc_mu = W['Wmu'].T @ dmu * relu_d(W['Weh'] @ X + W['beh'])
    dH_enc_lv = W['Wlv'].T @ dlv * relu_d(W['Weh'] @ X + W['beh'])
    dH_enc    = dH_enc_mu + dH_enc_lv

dWmu = dmu @ H_enc.T;  dbmu = dmu.sum(axis=1, keepdims=True)
    dWlv = dlv @ H_enc.T;  dblv = dlv.sum(axis=1, keepdims=True)
    dWeh = dH_enc @ X.T;   dbeh = dH_enc.sum(axis=1, keepdims=True)

for key, grad in [('Wdo', dWdo), ('bdo', dbdo), ('Wdh', dWdh), ('bdh', dbdh),
                      ('Wmu', dWmu), ('bmu', dbmu), ('Wlv', dWlv), ('blv', dblv),
                      ('Weh', dWeh), ('beh', dbeh)]:
        W[key] -= lr * np.clip(grad, -5, 5)
    return W

# ── Training ──────────────────────────────────────────────────────────────────
W = init_vae()
X_T = X_data.T   # (2, 600)
EPOCHS = 1200
BATCH  = 128
LR     = 3e-3
BETA   = 1.0

losses_total = []
losses_rec   = []
losses_kl    = []

for epoch in range(EPOCHS):
    idx = np.random.permutation(X_T.shape[1])
    for i in range(0, X_T.shape[1] - BATCH + 1, BATCH):
        Xb = X_T[:, idx[i:i+BATCH]]
        cache = vae_forward(Xb, W)
        total, rec, kl = vae_loss(Xb, cache, BETA)
        W = vae_backward(Xb, W, cache, lr=LR, beta=BETA)
    if (epoch+1) % 50 == 0:
        c = vae_forward(X_T, W)
        t, r, k = vae_loss(X_T, c, BETA)
        losses_total.append(t); losses_rec.append(r); losses_kl.append(k)
        if (epoch+1) % 300 == 0:
            print(f"Epoch {epoch+1:4d}: total={t:.4f}  recon={r:.4f}  KL={k:.4f}")

# ── Encode all data ───────────────────────────────────────────────────────────
final_cache = vae_forward(X_T, W)
mu_encoded  = final_cache['mu_z'].T   # (600, 2)

# ── Sample from prior ─────────────────────────────────────────────────────────
n_gen = 300
z_sample = np.random.randn(LATENT_DIM, n_gen)
H_gen    = relu(W['Wdh'] @ z_sample + W['bdh'])
x_gen    = (W['Wdo'] @ H_gen + W['bdo']).T

# ── Figure ────────────────────────────────────────────────────────────────────
fig = plt.figure(figsize=(14, 9), facecolor='#0f0f1a')
gs = GridSpec(2, 3, figure=fig, hspace=0.42, wspace=0.38)

# panel 1: training data
ax1 = fig.add_subplot(gs[0, 0])
ax1.set_facecolor('#0f0f1a')
ax1.scatter(X_data[:,0], X_data[:,1], c='#a78bfa', s=8, alpha=0.6)
ax1.set_title('Training Data', color='#ddd6fe', fontsize=11)
ax1.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax1.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 2: latent space (encoded mu)
ax2 = fig.add_subplot(gs[0, 1])
ax2.set_facecolor('#0f0f1a')
n3 = len(X_data) // 3
cols2 = ['#a78bfa']*n3 + ['#4ade80']*n3 + ['#fb923c']*n3
ax2.scatter(mu_encoded[:,0], mu_encoded[:,1], c=cols2, s=8, alpha=0.75)
ax2.set_title('Latent Space (μ)', color='#ddd6fe', fontsize=11)
ax2.set_xlabel('z₁', color='#94a3b8', fontsize=9)
ax2.set_ylabel('z₂', color='#94a3b8', fontsize=9)
ax2.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax2.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 3: generated samples
ax3 = fig.add_subplot(gs[0, 2])
ax3.set_facecolor('#0f0f1a')
ax3.scatter(x_gen[:,0], x_gen[:,1], c='#f472b6', s=8, alpha=0.6, label='Generated')
ax3.scatter(X_data[:,0], X_data[:,1], c='#a78bfa', s=5, alpha=0.2, label='Real')
ax3.set_title('Generated Samples', color='#ddd6fe', fontsize=11)
ax3.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8)
ax3.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax3.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 4: loss curves
ax4 = fig.add_subplot(gs[1, 0])
ax4.set_facecolor('#0f0f1a')
epochs_x = list(range(50, EPOCHS+1, 50))
ax4.plot(epochs_x, losses_total, color='#a78bfa', linewidth=1.5, label='Total')
ax4.plot(epochs_x, losses_rec,   color='#4ade80', linewidth=1.5, label='Recon')
ax4.plot(epochs_x, losses_kl,    color='#fb923c', linewidth=1.5, label='KL')
ax4.set_xlabel('Epoch', color='#e2e8f0', fontsize=9); ax4.set_ylabel('Loss', color='#e2e8f0', fontsize=9)
ax4.set_title('VAE Training Loss', color='#ddd6fe', fontsize=11)
ax4.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8)
ax4.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax4.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 5: reconstruction
ax5 = fig.add_subplot(gs[1, 1])
ax5.set_facecolor('#0f0f1a')
x_recon = final_cache['x_hat'].T
ax5.scatter(X_data[:,0], X_data[:,1], c='#a78bfa', s=6, alpha=0.35, label='Original')
ax5.scatter(x_recon[:,0], x_recon[:,1], c='#4ade80', s=6, alpha=0.5, label='Reconstructed')
ax5.set_title('Reconstruction', color='#ddd6fe', fontsize=11)
ax5.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8)
ax5.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax5.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 6: latent grid sampling
ax6 = fig.add_subplot(gs[1, 2])
ax6.set_facecolor('#0f0f1a')
g = np.linspace(-2.5, 2.5, 20)
gx, gy = np.meshgrid(g, g)
z_grid = np.row_stack([gx.ravel(), gy.ravel()])
H_grid = relu(W['Wdh'] @ z_grid + W['bdh'])
x_grid = (W['Wdo'] @ H_grid + W['bdo']).T
ax6.scatter(x_grid[:,0], x_grid[:,1], c='#c084fc', s=6, alpha=0.4)
ax6.set_title('Latent Grid Decode', color='#ddd6fe', fontsize=11)
ax6.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax6.spines.values(): sp.set_edgecolor('#4c1d95')

plt.suptitle('Variational Autoencoder on 2D Data', color='white', fontsize=14, fontweight='bold')
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print("Saved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 11: Dimensionality Reduction Ch 13: Bayesian Inference →