ML for Science/Part II: Deep Learning

Generative Models (VAEs & GANs)

Learning to generate data — from variational inference to adversarial training and diffusion

Introduction

Generative models learn the underlying probability distribution of data and can generate new samples from it. Unlike discriminative models that learn $p(y|\mathbf{x})$, generative models learn$p(\mathbf{x})$ or $p(\mathbf{x}|y)$. In science, generative models enable data augmentation, simulation acceleration, anomaly detection, and sampling from complex distributions.

Key Topics

  • 1. Variational Autoencoders (VAEs): ELBO Derivation
  • 2. The Reparameterization Trick
  • 3. Generative Adversarial Networks (GANs)
  • 4. Normalizing Flows
  • 5. Diffusion Models
  • 6. Scientific Applications

1. Latent Variable Models

We assume data $\mathbf{x}$ is generated by first sampling a latent variable$\mathbf{z} \sim p(\mathbf{z})$ from a simple prior (e.g., Gaussian), then sampling$\mathbf{x} \sim p_\theta(\mathbf{x}|\mathbf{z})$ from a decoder:

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z}$$

The Intractability Problem

We want to maximize the marginal likelihood $\log p_\theta(\mathbf{x})$, but the integral is intractable for neural network decoders. The posterior $p_\theta(\mathbf{z}|\mathbf{x})$ is also intractable:

$$p_\theta(\mathbf{z}|\mathbf{x}) = \frac{p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p_\theta(\mathbf{x})} \quad \text{(intractable denominator)}$$

The VAE resolves this by introducing an approximate posterior $q_\phi(\mathbf{z}|\mathbf{x})$(the encoder) and deriving a tractable lower bound.

2. VAE: Deriving the ELBO

Derivation from KL Divergence

Start with the KL divergence between the approximate and true posterior:

$$D_{\text{KL}}\left(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x})\right) = \mathbb{E}_{q_\phi}\left[\log\frac{q_\phi(\mathbf{z}|\mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x})}\right] \geq 0$$

Substituting Bayes' rule $p_\theta(\mathbf{z}|\mathbf{x}) = p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})/p_\theta(\mathbf{x})$:

$$D_{\text{KL}} = \mathbb{E}_{q_\phi}\left[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log p_\theta(\mathbf{x}|\mathbf{z}) - \log p(\mathbf{z}) + \log p_\theta(\mathbf{x})\right]$$

Rearranging (noting $\log p_\theta(\mathbf{x})$ is constant w.r.t. $\mathbf{z}$):

$$\log p_\theta(\mathbf{x}) = \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) + D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$$

Since the last term is non-negative, we get the Evidence Lower Bound (ELBO):

$$\boxed{\log p_\theta(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{regularization}} = \text{ELBO}}$$

Interpreting the ELBO Terms

  • Reconstruction term $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})]$: How well can the decoder reconstruct $\mathbf{x}$ from sampled $\mathbf{z}$? For Gaussian decoder, this is negative MSE: $-\|\mathbf{x} - \hat{\mathbf{x}}\|^2/(2\sigma^2)$.
  • KL regularization $D_{\text{KL}}(q_\phi \| p)$: Keeps the encoder distribution close to the prior. Prevents the latent space from collapsing to point masses.

Closed-Form KL for Gaussians

When $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:

$$\boxed{D_{\text{KL}} = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)}$$

Derivation: Using $D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\|\mathcal{N}(\mathbf{0},\mathbf{I})) = \frac{1}{2}[\text{tr}(\boldsymbol{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - d - \log\det\boldsymbol{\Sigma}]$with diagonal $\boldsymbol{\Sigma}$.

3. The Reparameterization Trick

The Problem

The ELBO involves an expectation over $q_\phi(\mathbf{z}|\mathbf{x})$. We approximate this by sampling, but sampling is not differentiable — we cannot backpropagate through$\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$.

The Solution

Instead of sampling $\mathbf{z}$ directly, reparameterize as a deterministic function of the parameters and an independent noise variable:

$$\boxed{\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}$$

Now $\mathbf{z}$ is a differentiable function of $\phi$ (through $\boldsymbol{\mu}$ and$\boldsymbol{\sigma}$), and the stochasticity comes from $\boldsymbol{\epsilon}$ which does not depend on $\phi$. Gradients flow through the deterministic path.

The expectation becomes $\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})}[\cdot]$, which we estimate with a single sample per data point in each mini-batch.

4. Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) learn to generate data through a two-player minimax game between a generator $G$ and a discriminator $D$.

The Minimax Objective

$$\boxed{\min_G \max_D \; V(D, G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]}$$

Discriminator ($\max_D$): Tries to correctly classify real data as real ($D(\mathbf{x}) \to 1$) and fake data as fake ($D(G(\mathbf{z})) \to 0$).

Generator ($\min_G$): Tries to fool the discriminator by generating data that $D$ classifies as real ($D(G(\mathbf{z})) \to 1$).

Optimal Discriminator

For fixed $G$, the optimal discriminator is:

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})}$$

Proof: The value function can be written as:

$$V(D,G) = \int \left[p_{\text{data}}(\mathbf{x})\log D(\mathbf{x}) + p_G(\mathbf{x})\log(1-D(\mathbf{x}))\right]d\mathbf{x}$$

Taking the derivative w.r.t. $D(\mathbf{x})$ and setting to zero gives the result. The function$a\log y + b\log(1-y)$ is maximized at $y = a/(a+b)$.

Global Optimum

Substituting $D^*$ back into $V$:

$$V(D^*, G) = -\log 4 + 2\,D_{\text{JS}}(p_{\text{data}} \| p_G)$$

where $D_{\text{JS}}$ is the Jensen-Shannon divergence:

$$D_{\text{JS}}(p\|q) = \frac{1}{2}D_{\text{KL}}\left(p\left\|\frac{p+q}{2}\right.\right) + \frac{1}{2}D_{\text{KL}}\left(q\left\|\frac{p+q}{2}\right.\right)$$

The global minimum is achieved when $p_G = p_{\text{data}}$, giving $D_{\text{JS}} = 0$and $D^*(\mathbf{x}) = 1/2$ everywhere.

Training Instabilities

  • Mode collapse: Generator produces only a few modes of the data distribution
  • Vanishing gradients: If $D$ is too strong, $\log(1-D(G(\mathbf{z}))) \to 0$ provides no learning signal
  • Non-saturating loss: Train $G$ to maximize $\log D(G(\mathbf{z}))$ instead (provides stronger gradients early)
  • Wasserstein GAN: Uses Earth Mover distance instead of JS divergence, providing smoother gradients

5. Normalizing Flows

Normalizing flows transform a simple base distribution through a sequence of invertible maps, providing exact likelihood computation.

Change of Variables

If $\mathbf{x} = f(\mathbf{z})$ where $f$ is invertible and $\mathbf{z} \sim p_Z(\mathbf{z})$:

$$\boxed{p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left|\det\frac{\partial f^{-1}}{\partial \mathbf{x}}\right| = p_Z(f^{-1}(\mathbf{x})) \left|\det\frac{\partial f}{\partial \mathbf{z}}\right|^{-1}}$$

Composition of Flows

A normalizing flow composes $K$ invertible transformations:

$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z}_0), \quad \mathbf{z}_0 \sim p_0(\mathbf{z}_0)$$

The log-likelihood is:

$$\log p(\mathbf{x}) = \log p_0(\mathbf{z}_0) - \sum_{k=1}^{K}\log\left|\det\frac{\partial f_k}{\partial \mathbf{z}_{k-1}}\right|$$

The key design challenge is making the Jacobian determinant efficient to compute. Coupling layers (RealNVP, Glow) achieve $O(d)$ Jacobian computation.

6. Diffusion Models

Diffusion models (Ho et al., 2020; Song & Ermon, 2019) generate data by learning to reverse a gradual noising process. They now produce the highest quality samples across many domains.

Forward Process (Noising)

Gradually add Gaussian noise over $T$ steps with variance schedule $\beta_1, \ldots, \beta_T$:

$$q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I})$$

Using $\alpha_t = 1-\beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$, we can sample any step directly:

$$\boxed{q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})}$$

Reverse Process (Denoising)

Learn a neural network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ to predict the noise:

$$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2\mathbf{I})$$

The training loss (simplified) is:

$$\boxed{\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\|^2\right]}$$

This is derived from the variational bound and simplifies to predicting the noise$\boldsymbol{\epsilon}$ added to the clean data.

Score-Based Perspective

The connection to score matching: the model learns the score function(gradient of log-density):

$$\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) \approx -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}, t)}{\sqrt{1-\bar{\alpha}_t}}$$

Sampling then corresponds to Langevin dynamics, following the score to high-density regions.

7. Scientific Applications

Molecular Design

VAEs generate novel drug-like molecules by learning a continuous latent space of molecular structures. Interpolation in latent space produces smooth transitions between molecules, enabling property-guided optimization.

Simulation Acceleration

GANs generate synthetic physics simulation outputs (e.g., particle showers in calorimeters) orders of magnitude faster than full Monte Carlo simulation. CaloGAN and similar models produce detector-level data for high-energy physics.

Cosmological Simulation

Generative models create synthetic dark matter density fields and galaxy catalogs, accelerating Bayesian inference for cosmological parameters by replacing expensive N-body simulations.

Anomaly Detection

VAEs detect anomalies via high reconstruction error: data points unlike the training distribution yield poor reconstructions. This is used for new physics searches in particle collider data and for detecting unusual astronomical transients.

Boltzmann Generators

Normalizing flows can be trained to sample from Boltzmann distributions in statistical mechanics (Noe et al., 2019). Given a target energy function $U(\mathbf{x})$:

$$p(\mathbf{x}) = \frac{e^{-U(\mathbf{x})/k_BT}}{Z}$$

A normalizing flow trained with the loss $D_{\text{KL}}(q_\theta\|p)$ generates independent samples from the target distribution, bypassing the autocorrelation problem of MCMC methods. This enables efficient sampling of molecular configurations.

8. Comparison of Generative Models

PropertyVAEGANFlowDiffusion
Exact likelihoodLower boundNoYesLower bound
Sample qualityBlurrySharpGoodExcellent
TrainingStableUnstableStableStable
Latent spaceStructuredUnstructuredStructuredN/A
Mode coverageGoodMode collapseGoodExcellent

8. Python Simulation: VAE & GAN

This simulation implements a VAE and GAN on a 2D mixture of Gaussians, demonstrating the ELBO, reparameterization trick, and adversarial training.

VAE & GAN on 2D Gaussian Mixture

Python
script.py296 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Summary

  • VAE ELBO: $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$
  • Reparameterization: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ enables backprop through sampling
  • GAN objective: $\min_G\max_D\;\mathbb{E}[\log D(\mathbf{x})] + \mathbb{E}[\log(1-D(G(\mathbf{z})))]$
  • Normalizing flows: Exact likelihood via change of variables and invertible transforms
  • Diffusion: Learn to denoise; train by predicting added noise at random timesteps
  • Science: Generative models accelerate simulations, augment data, and sample from complex distributions