ML for Science/Part II: Deep Learning

Generative Models (VAEs & GANs)

Learning to generate data — from variational inference to adversarial training and diffusion

Introduction

Generative models learn the underlying probability distribution of data and can generate new samples from it. Unlike discriminative models that learn $p(y|\mathbf{x})$, generative models learn$p(\mathbf{x})$ or $p(\mathbf{x}|y)$. In science, generative models enable data augmentation, simulation acceleration, anomaly detection, and sampling from complex distributions.

Key Topics

1. Variational Autoencoders (VAEs): ELBO Derivation
2. The Reparameterization Trick
3. Generative Adversarial Networks (GANs)
4. Normalizing Flows
5. Diffusion Models
6. Scientific Applications

1. Latent Variable Models

We assume data $\mathbf{x}$ is generated by first sampling a latent variable$\mathbf{z} \sim p(\mathbf{z})$ from a simple prior (e.g., Gaussian), then sampling$\mathbf{x} \sim p_\theta(\mathbf{x}|\mathbf{z})$ from a decoder:

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z}$$

The Intractability Problem

We want to maximize the marginal likelihood $\log p_\theta(\mathbf{x})$, but the integral is intractable for neural network decoders. The posterior $p_\theta(\mathbf{z}|\mathbf{x})$ is also intractable:

$$p_\theta(\mathbf{z}|\mathbf{x}) = \frac{p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p_\theta(\mathbf{x})} \quad \text{(intractable denominator)}$$

The VAE resolves this by introducing an approximate posterior $q_\phi(\mathbf{z}|\mathbf{x})$(the encoder) and deriving a tractable lower bound.

2. VAE: Deriving the ELBO

Derivation from KL Divergence

Start with the KL divergence between the approximate and true posterior:

$$D_{\text{KL}}\left(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x})\right) = \mathbb{E}_{q_\phi}\left[\log\frac{q_\phi(\mathbf{z}|\mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x})}\right] \geq 0$$

Substituting Bayes' rule $p_\theta(\mathbf{z}|\mathbf{x}) = p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})/p_\theta(\mathbf{x})$:

$$D_{\text{KL}} = \mathbb{E}_{q_\phi}\left[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log p_\theta(\mathbf{x}|\mathbf{z}) - \log p(\mathbf{z}) + \log p_\theta(\mathbf{x})\right]$$

Rearranging (noting $\log p_\theta(\mathbf{x})$ is constant w.r.t. $\mathbf{z}$):

$$\log p_\theta(\mathbf{x}) = \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) + D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$$

Since the last term is non-negative, we get the Evidence Lower Bound (ELBO):

$$\boxed{\log p_\theta(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{regularization}} = \text{ELBO}}$$

Interpreting the ELBO Terms

Reconstruction term $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})]$: How well can the decoder reconstruct $\mathbf{x}$ from sampled $\mathbf{z}$? For Gaussian decoder, this is negative MSE: $-\|\mathbf{x} - \hat{\mathbf{x}}\|^2/(2\sigma^2)$.
KL regularization $D_{\text{KL}}(q_\phi \| p)$: Keeps the encoder distribution close to the prior. Prevents the latent space from collapsing to point masses.

Closed-Form KL for Gaussians

When $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:

$$\boxed{D_{\text{KL}} = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)}$$

Derivation: Using $D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\|\mathcal{N}(\mathbf{0},\mathbf{I})) = \frac{1}{2}[\text{tr}(\boldsymbol{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - d - \log\det\boldsymbol{\Sigma}]$with diagonal $\boldsymbol{\Sigma}$.

3. The Reparameterization Trick

The Problem

The ELBO involves an expectation over $q_\phi(\mathbf{z}|\mathbf{x})$. We approximate this by sampling, but sampling is not differentiable — we cannot backpropagate through$\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$.

The Solution

Instead of sampling $\mathbf{z}$ directly, reparameterize as a deterministic function of the parameters and an independent noise variable:

$$\boxed{\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}$$

Now $\mathbf{z}$ is a differentiable function of $\phi$ (through $\boldsymbol{\mu}$ and$\boldsymbol{\sigma}$), and the stochasticity comes from $\boldsymbol{\epsilon}$ which does not depend on $\phi$. Gradients flow through the deterministic path.

The expectation becomes $\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})}[\cdot]$, which we estimate with a single sample per data point in each mini-batch.

4. Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) learn to generate data through a two-player minimax game between a generator $G$ and a discriminator $D$.

The Minimax Objective

$$\boxed{\min_G \max_D \; V(D, G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]}$$

Discriminator ($\max_D$): Tries to correctly classify real data as real ($D(\mathbf{x}) \to 1$) and fake data as fake ($D(G(\mathbf{z})) \to 0$).

Generator ($\min_G$): Tries to fool the discriminator by generating data that $D$ classifies as real ($D(G(\mathbf{z})) \to 1$).

Optimal Discriminator

For fixed $G$, the optimal discriminator is:

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})}$$

Proof: The value function can be written as:

$$V(D,G) = \int \left[p_{\text{data}}(\mathbf{x})\log D(\mathbf{x}) + p_G(\mathbf{x})\log(1-D(\mathbf{x}))\right]d\mathbf{x}$$

Taking the derivative w.r.t. $D(\mathbf{x})$ and setting to zero gives the result. The function$a\log y + b\log(1-y)$ is maximized at $y = a/(a+b)$.

Global Optimum

Substituting $D^*$ back into $V$:

$$V(D^*, G) = -\log 4 + 2\,D_{\text{JS}}(p_{\text{data}} \| p_G)$$

where $D_{\text{JS}}$ is the Jensen-Shannon divergence:

$$D_{\text{JS}}(p\|q) = \frac{1}{2}D_{\text{KL}}\left(p\left\|\frac{p+q}{2}\right.\right) + \frac{1}{2}D_{\text{KL}}\left(q\left\|\frac{p+q}{2}\right.\right)$$

The global minimum is achieved when $p_G = p_{\text{data}}$, giving $D_{\text{JS}} = 0$and $D^*(\mathbf{x}) = 1/2$ everywhere.

Training Instabilities

Mode collapse: Generator produces only a few modes of the data distribution
Vanishing gradients: If $D$ is too strong, $\log(1-D(G(\mathbf{z}))) \to 0$ provides no learning signal
Non-saturating loss: Train $G$ to maximize $\log D(G(\mathbf{z}))$ instead (provides stronger gradients early)
Wasserstein GAN: Uses Earth Mover distance instead of JS divergence, providing smoother gradients

5. Normalizing Flows

Normalizing flows transform a simple base distribution through a sequence of invertible maps, providing exact likelihood computation.

Change of Variables

If $\mathbf{x} = f(\mathbf{z})$ where $f$ is invertible and $\mathbf{z} \sim p_Z(\mathbf{z})$:

$$\boxed{p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left|\det\frac{\partial f^{-1}}{\partial \mathbf{x}}\right| = p_Z(f^{-1}(\mathbf{x})) \left|\det\frac{\partial f}{\partial \mathbf{z}}\right|^{-1}}$$

Composition of Flows

A normalizing flow composes $K$ invertible transformations:

$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z}_0), \quad \mathbf{z}_0 \sim p_0(\mathbf{z}_0)$$

The log-likelihood is:

$$\log p(\mathbf{x}) = \log p_0(\mathbf{z}_0) - \sum_{k=1}^{K}\log\left|\det\frac{\partial f_k}{\partial \mathbf{z}_{k-1}}\right|$$

The key design challenge is making the Jacobian determinant efficient to compute. Coupling layers (RealNVP, Glow) achieve $O(d)$ Jacobian computation.

6. Diffusion Models

Diffusion models (Ho et al., 2020; Song & Ermon, 2019) generate data by learning to reverse a gradual noising process. They now produce the highest quality samples across many domains.

Forward Process (Noising)

Gradually add Gaussian noise over $T$ steps with variance schedule $\beta_1, \ldots, \beta_T$:

$$q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I})$$

Using $\alpha_t = 1-\beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$, we can sample any step directly:

$$\boxed{q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})}$$

Reverse Process (Denoising)

Learn a neural network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ to predict the noise:

$$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2\mathbf{I})$$

The training loss (simplified) is:

$$\boxed{\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\|^2\right]}$$

This is derived from the variational bound and simplifies to predicting the noise$\boldsymbol{\epsilon}$ added to the clean data.

Score-Based Perspective

The connection to score matching: the model learns the score function(gradient of log-density):

$$\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) \approx -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}, t)}{\sqrt{1-\bar{\alpha}_t}}$$

Sampling then corresponds to Langevin dynamics, following the score to high-density regions.

7. Scientific Applications

Molecular Design

VAEs generate novel drug-like molecules by learning a continuous latent space of molecular structures. Interpolation in latent space produces smooth transitions between molecules, enabling property-guided optimization.

Simulation Acceleration

GANs generate synthetic physics simulation outputs (e.g., particle showers in calorimeters) orders of magnitude faster than full Monte Carlo simulation. CaloGAN and similar models produce detector-level data for high-energy physics.

Cosmological Simulation

Generative models create synthetic dark matter density fields and galaxy catalogs, accelerating Bayesian inference for cosmological parameters by replacing expensive N-body simulations.

Anomaly Detection

VAEs detect anomalies via high reconstruction error: data points unlike the training distribution yield poor reconstructions. This is used for new physics searches in particle collider data and for detecting unusual astronomical transients.

Boltzmann Generators

Normalizing flows can be trained to sample from Boltzmann distributions in statistical mechanics (Noe et al., 2019). Given a target energy function $U(\mathbf{x})$:

$$p(\mathbf{x}) = \frac{e^{-U(\mathbf{x})/k_BT}}{Z}$$

A normalizing flow trained with the loss $D_{\text{KL}}(q_\theta\|p)$ generates independent samples from the target distribution, bypassing the autocorrelation problem of MCMC methods. This enables efficient sampling of molecular configurations.

8. Comparison of Generative Models

Property	VAE	GAN	Flow	Diffusion
Exact likelihood	Lower bound	No	Yes	Lower bound
Sample quality	Blurry	Sharp	Good	Excellent
Training	Stable	Unstable	Stable	Stable
Latent space	Structured	Unstructured	Structured	N/A
Mode coverage	Good	Mode collapse	Good	Excellent

8. Python Simulation: VAE & GAN

This simulation implements a VAE and GAN on a 2D mixture of Gaussians, demonstrating the ELBO, reparameterization trick, and adversarial training.

VAE & GAN on 2D Gaussian Mixture

Python

script.py296 lines

import numpy as np

# ============================================================
# DATA: 2D Mixture of Gaussians
# ============================================================
np.random.seed(42)

def sample_mog(n=500):
    """Sample from a mixture of 4 Gaussians."""
    centers = np.array([[2, 2], [-2, 2], [-2, -2], [2, -2]])
    k = np.random.randint(0, 4, n)
    x = centers[k] + 0.4 * np.random.randn(n, 2)
    return x

data = sample_mog(1000)

# Helper functions
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def relu(z):
    return np.maximum(0, z)

def relu_deriv(z):
    return (z > 0).astype(float)

# ============================================================
# PART 1: VARIATIONAL AUTOENCODER
# ============================================================
print("=" * 65)
print("VARIATIONAL AUTOENCODER (VAE)")
print("=" * 65)

class SimpleVAE:
    """VAE with 2D input, 2D latent space, single hidden layer."""
    def __init__(self, input_dim=2, hidden_dim=32, latent_dim=2, lr=0.001):
        self.lr = lr
        self.latent_dim = latent_dim
        s = 0.1

# Encoder: input -> hidden -> (mu, log_var)
        self.We1 = np.random.randn(input_dim, hidden_dim) * s
        self.be1 = np.zeros(hidden_dim)
        self.Wmu = np.random.randn(hidden_dim, latent_dim) * s
        self.bmu = np.zeros(latent_dim)
        self.Wlv = np.random.randn(hidden_dim, latent_dim) * s
        self.blv = np.zeros(latent_dim)

# Decoder: latent -> hidden -> output
        self.Wd1 = np.random.randn(latent_dim, hidden_dim) * s
        self.bd1 = np.zeros(hidden_dim)
        self.Wd2 = np.random.randn(hidden_dim, input_dim) * s
        self.bd2 = np.zeros(input_dim)

def encode(self, x):
        self.he = relu(x @ self.We1 + self.be1)
        mu = self.he @ self.Wmu + self.bmu
        log_var = self.he @ self.Wlv + self.blv
        return mu, log_var

def reparameterize(self, mu, log_var):
        std = np.exp(0.5 * log_var)
        self.eps = np.random.randn(*mu.shape)
        return mu + std * self.eps

def decode(self, z):
        self.hd = relu(z @ self.Wd1 + self.bd1)
        x_recon = self.hd @ self.Wd2 + self.bd2
        return x_recon

def train_step(self, x):
        n = x.shape[0]

# Forward
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_recon = self.decode(z)

# Losses
        recon_loss = np.mean(np.sum((x - x_recon)**2, axis=1))
        kl_loss = -0.5 * np.mean(np.sum(1 + log_var - mu**2 - np.exp(log_var), axis=1))
        total_loss = recon_loss + kl_loss

# Backward (simplified gradient computation)
        # Decoder gradients
        dx_recon = 2 * (x_recon - x) / n
        dWd2 = self.hd.T @ dx_recon
        dbd2 = np.sum(dx_recon, axis=0)
        dhd = (dx_recon @ self.Wd2.T) * relu_deriv(z @ self.Wd1 + self.bd1)
        dWd1 = z.T @ dhd
        dbd1 = np.sum(dhd, axis=0)
        dz = dhd @ self.Wd1.T

# KL gradients
        dmu_kl = mu / n
        dlog_var_kl = 0.5 * (np.exp(log_var) - 1) / n

# Reparameterization: dL/dmu and dL/dlog_var
        std = np.exp(0.5 * log_var)
        dmu = dz + dmu_kl
        dlog_var = dz * self.eps * 0.5 * std + dlog_var_kl

# Encoder gradients
        dWmu = self.he.T @ dmu
        dbmu = np.sum(dmu, axis=0)
        dWlv = self.he.T @ dlog_var
        dblv = np.sum(dlog_var, axis=0)
        dhe = (dmu @ self.Wmu.T + dlog_var @ self.Wlv.T) * relu_deriv(x @ self.We1 + self.be1)
        dWe1 = x.T @ dhe
        dbe1 = np.sum(dhe, axis=0)

# Update
        for param, grad in [(self.Wd2, dWd2), (self.bd2, dbd2),
                            (self.Wd1, dWd1), (self.bd1, dbd1),
                            (self.Wmu, dWmu), (self.bmu, dbmu),
                            (self.Wlv, dWlv), (self.blv, dblv),
                            (self.We1, dWe1), (self.be1, dbe1)]:
            param -= self.lr * np.clip(grad, -5, 5)

return total_loss, recon_loss, kl_loss

# Train VAE
vae = SimpleVAE(lr=0.002)
n_epochs = 300
batch_size = 64

print(f"Training on {len(data)} samples from 4-mode Gaussian mixture")
print(f"Latent dim: {vae.latent_dim}, Hidden: 32")
print()

for epoch in range(n_epochs):
    idx = np.random.permutation(len(data))
    total_loss = 0
    for i in range(0, len(data), batch_size):
        batch = data[idx[i:i+batch_size]]
        loss, rl, kl = vae.train_step(batch)
        total_loss += loss
    if (epoch + 1) % 60 == 0:
        n_batches = len(data) // batch_size
        print(f"Epoch {epoch+1:3d}: Total={total_loss/n_batches:.4f}, Recon={rl:.4f}, KL={kl:.4f}")

# Generate samples
z_samples = np.random.randn(200, 2)
generated = vae.decode(z_samples)

print(f"\nGenerated samples statistics:")
print(f"  Mean: ({np.mean(generated[:,0]):.2f}, {np.mean(generated[:,1]):.2f})")
print(f"  Std:  ({np.std(generated[:,0]):.2f}, {np.std(generated[:,1]):.2f})")
print(f"  Data range X: [{np.min(generated[:,0]):.2f}, {np.max(generated[:,0]):.2f}]")
print(f"  Data range Y: [{np.min(generated[:,1]):.2f}, {np.max(generated[:,1]):.2f}]")

# ============================================================
# PART 2: GENERATIVE ADVERSARIAL NETWORK
# ============================================================
print()
print("=" * 65)
print("GENERATIVE ADVERSARIAL NETWORK (GAN)")
print("=" * 65)

class SimpleGAN:
    def __init__(self, noise_dim=2, hidden_dim=32, data_dim=2, lr=0.001):
        self.noise_dim = noise_dim
        self.lr = lr
        s = 0.1

# Generator: noise -> hidden -> data
        self.Wg1 = np.random.randn(noise_dim, hidden_dim) * s
        self.bg1 = np.zeros(hidden_dim)
        self.Wg2 = np.random.randn(hidden_dim, data_dim) * s
        self.bg2 = np.zeros(data_dim)

# Discriminator: data -> hidden -> scalar
        self.Wd1 = np.random.randn(data_dim, hidden_dim) * s
        self.bd1 = np.zeros(hidden_dim)
        self.Wd2 = np.random.randn(hidden_dim, 1) * s
        self.bd2 = np.zeros(1)

def generate(self, z):
        hg = relu(z @ self.Wg1 + self.bg1)
        return hg @ self.Wg2 + self.bg2

def discriminate(self, x):
        hd = relu(x @ self.Wd1 + self.bd1)
        return sigmoid(hd @ self.Wd2 + self.bd2)

def train_step(self, real_data):
        n = real_data.shape[0]
        z = np.random.randn(n, self.noise_dim)

# Forward
        fake_data = self.generate(z)
        d_real = self.discriminate(real_data)
        d_fake = self.discriminate(fake_data)

# Discriminator loss: -[log(D(x)) + log(1-D(G(z)))]
        d_loss = -np.mean(np.log(d_real + 1e-8) + np.log(1 - d_fake + 1e-8))

# Train discriminator
        # Gradient for real: d(sigmoid) * 1/D(x) * x
        hd_real = relu(real_data @ self.Wd1 + self.bd1)
        dd_real = d_real * (1 - d_real) * (1.0 / (d_real + 1e-8))
        dWd2_r = -hd_real.T @ dd_real / n
        dbd2_r = -np.sum(dd_real, axis=0) / n
        dhd_r = -(dd_real @ self.Wd2.T) * relu_deriv(real_data @ self.Wd1 + self.bd1) / n
        dWd1_r = real_data.T @ dhd_r
        dbd1_r = np.sum(dhd_r, axis=0)

# Gradient for fake
        hd_fake = relu(fake_data @ self.Wd1 + self.bd1)
        dd_fake = d_fake * (1 - d_fake) * (-1.0 / (1 - d_fake + 1e-8))
        dWd2_f = -hd_fake.T @ dd_fake / n
        dbd2_f = -np.sum(dd_fake, axis=0) / n
        dhd_f = -(dd_fake @ self.Wd2.T) * relu_deriv(fake_data @ self.Wd1 + self.bd1) / n
        dWd1_f = fake_data.T @ dhd_f
        dbd1_f = np.sum(dhd_f, axis=0)

self.Wd2 -= self.lr * np.clip(dWd2_r + dWd2_f, -1, 1)
        self.bd2 -= self.lr * np.clip(dbd2_r + dbd2_f, -1, 1)
        self.Wd1 -= self.lr * np.clip(dWd1_r + dWd1_f, -1, 1)
        self.bd1 -= self.lr * np.clip(dbd1_r + dbd1_f, -1, 1)

# Train generator (non-saturating loss: maximize log D(G(z)))
        z = np.random.randn(n, self.noise_dim)
        hg = relu(z @ self.Wg1 + self.bg1)
        fake_data = hg @ self.Wg2 + self.bg2
        hd_fake2 = relu(fake_data @ self.Wd1 + self.bd1)
        d_fake2 = sigmoid(hd_fake2 @ self.Wd2 + self.bd2)
        g_loss = -np.mean(np.log(d_fake2 + 1e-8))

# Generator gradients (through discriminator)
        dd = d_fake2 * (1 - d_fake2) * (-1.0 / (d_fake2 + 1e-8))
        dhd_g = (dd @ self.Wd2.T) * relu_deriv(fake_data @ self.Wd1 + self.bd1)
        dfake = dhd_g @ self.Wd1.T
        dWg2 = hg.T @ dfake / n
        dbg2 = np.sum(dfake, axis=0) / n
        dhg = (dfake @ self.Wg2.T) * relu_deriv(z @ self.Wg1 + self.bg1)
        dWg1 = z.T @ dhg / n
        dbg1 = np.sum(dhg, axis=0) / n

self.Wg2 -= self.lr * np.clip(dWg2, -1, 1)
        self.bg2 -= self.lr * np.clip(dbg2, -1, 1)
        self.Wg1 -= self.lr * np.clip(dWg1, -1, 1)
        self.bg1 -= self.lr * np.clip(dbg1, -1, 1)

return d_loss, g_loss, np.mean(d_real), np.mean(d_fake2)

# Train GAN
gan = SimpleGAN(noise_dim=2, hidden_dim=64, lr=0.005)
n_epochs_gan = 500

print(f"Training GAN on {len(data)} samples")
print()

for epoch in range(n_epochs_gan):
    idx = np.random.permutation(len(data))
    for i in range(0, len(data), batch_size):
        batch = data[idx[i:i+batch_size]]
        d_loss, g_loss, d_real_mean, d_fake_mean = gan.train_step(batch)
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1:3d}: D_loss={d_loss:.4f}, G_loss={g_loss:.4f}, D(real)={d_real_mean:.3f}, D(fake)={d_fake_mean:.3f}")

# Generate samples
z_gan = np.random.randn(200, 2)
gen_gan = gan.generate(z_gan)

print(f"\nGAN generated samples statistics:")
print(f"  Mean: ({np.mean(gen_gan[:,0]):.2f}, {np.mean(gen_gan[:,1]):.2f})")
print(f"  Std:  ({np.std(gen_gan[:,0]):.2f}, {np.std(gen_gan[:,1]):.2f})")

# ============================================================
# PART 3: KL DIVERGENCE ANALYSIS
# ============================================================
print()
print("=" * 65)
print("KL DIVERGENCE: CLOSED FORM FOR GAUSSIANS")
print("=" * 65)

# Verify closed-form KL
mu_test = np.array([1.0, -0.5])
logvar_test = np.array([0.5, -0.3])

kl_closed = -0.5 * np.sum(1 + logvar_test - mu_test**2 - np.exp(logvar_test))

# Monte Carlo KL
n_mc = 100000
std_test = np.exp(0.5 * logvar_test)
z_mc = mu_test + std_test * np.random.randn(n_mc, 2)
log_q = -0.5 * np.sum((z_mc - mu_test)**2 / np.exp(logvar_test) + logvar_test + np.log(2*np.pi), axis=1)
log_p = -0.5 * np.sum(z_mc**2 + np.log(2*np.pi), axis=1)
kl_mc = np.mean(log_q - log_p)

print(f"mu = {mu_test}, log_var = {logvar_test}")
print(f"KL (closed form): {kl_closed:.6f}")
print(f"KL (Monte Carlo):  {kl_mc:.6f}")
print(f"Difference: {abs(kl_closed - kl_mc):.6f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Summary

VAE ELBO: $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$
Reparameterization: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ enables backprop through sampling
GAN objective: $\min_G\max_D\;\mathbb{E}[\log D(\mathbf{x})] + \mathbb{E}[\log(1-D(G(\mathbf{z})))]$
Normalizing flows: Exact likelihood via change of variables and invertible transforms
Diffusion: Learn to denoise; train by predicting added noise at random timesteps
Science: Generative models accelerate simulations, augment data, and sample from complex distributions

Share:X Reddit LinkedIn