Part VII — Advanced Topics

Chapter 21: Diffusion Models

Diffusion models define a forward process that gradually destroys data by adding noise, then learn a reverse process that reconstructs data from noise. We derive the closed-form forward marginal, simplify the variational lower bound to a noise-prediction objective, and connect the framework to score matching.

Forward & Reverse Processes

1. Forward Process

The forward process is a fixed (non-learned) Markov chain that gradually adds Gaussian noise:

\[ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\right) \]

where \( \{\beta_t\}_{t=1}^T \) is a fixed variance schedule (e.g. linear from \( 10^{-4} \) to \( 0.02 \)). Define \( \alpha_t = 1 - \beta_t \) and \( \bar{\alpha}_t = \prod_{s=1}^t \alpha_s \).

Closed-Form Marginal q(x_t | x_0)

By induction. At step 1:

\[ \mathbf{x}_1 = \sqrt{\alpha_1}\,\mathbf{x}_0 + \sqrt{1-\alpha_1}\,\boldsymbol{\epsilon}_1, \quad \boldsymbol{\epsilon}_1 \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]

At step 2:

\[ \mathbf{x}_2 = \sqrt{\alpha_2}\,\mathbf{x}_1 + \sqrt{1-\alpha_2}\,\boldsymbol{\epsilon}_2 = \sqrt{\alpha_1\alpha_2}\,\mathbf{x}_0 + \sqrt{\alpha_2(1-\alpha_1)}\,\boldsymbol{\epsilon}_1 + \sqrt{1-\alpha_2}\,\boldsymbol{\epsilon}_2 \]

Using the property that the sum of independent Gaussians \( \mathcal{N}(0,\sigma_1^2) + \mathcal{N}(0,\sigma_2^2) \sim \mathcal{N}(0,\sigma_1^2+\sigma_2^2) \), the noise terms combine:

\[ \text{Var of noise} = \alpha_2(1-\alpha_1) + (1-\alpha_2) = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2 \]

By induction at arbitrary \( t \):

\[ \boxed{q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right)} \]

Or equivalently via reparameterisation: \( \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} \), \( \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) \).

As \( t \to T \), \( \bar{\alpha}_T \to 0 \), so \( \mathbf{x}_T \approx \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \) — the data has been fully destroyed.

2. Reverse Process

The true reverse \( q(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \) is intractable (requires knowledge of the data distribution). We instead learn a parametric approximation:

\[ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \Sigma_\theta(\mathbf{x}_t, t)\right) \]

The reverse conditioned on \( \mathbf{x}_0 \) is tractable by Bayes' rule:

\[ q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde{\beta}_t \mathbf{I}\right) \]

where the posterior mean and variance are:

\[ \tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t, \qquad \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t \]

3. Training Objective — Full Derivation

3.1 Variational Lower Bound (ELBO)

We maximise the log-likelihood \( \log p_\theta(\mathbf{x}_0) \). Since the forward process defines a joint \( q(\mathbf{x}_{1:T}|\mathbf{x}_0) \), we apply the ELBO (Jensen's inequality):

\[ \log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_q\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right] =: -\mathcal{L}_\text{ELBO} \]

Expand the joint probabilities and collect KL divergences at each step:

\[ \mathcal{L}_\text{ELBO} = \underbrace{D_\text{KL}(q(\mathbf{x}_T|\mathbf{x}_0)\|p(\mathbf{x}_T))}_{\mathcal{L}_T,\,\text{constant}} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))}_{\mathcal{L}_{t-1}} - \underbrace{\log p_\theta(\mathbf{x}_0|\mathbf{x}_1)}_{\mathcal{L}_0} \]

Since both distributions in \( \mathcal{L}_{t-1} \) are Gaussian, the KL has a closed form. Parameterising\( \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) \)and substituting the reparameterisation \( \mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol\epsilon \):

\[ \mathcal{L}_{t-1} = \frac{\beta_t^2}{2\sigma_t^2\,\alpha_t\,(1-\bar\alpha_t)}\,\mathbb{E}_{\mathbf{x}_0,\boldsymbol\epsilon}\!\left[\left\|\boldsymbol\epsilon - \boldsymbol\epsilon_\theta\!\left(\sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon,\;t\right)\right\|^2\right] \]

Ho et al. (2020) dropped the prefactor and simplified the sum to a single uniformly sampled term:

\[ \boxed{\mathcal{L}_\text{simple} = \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol\epsilon}\!\left[\left\|\boldsymbol\epsilon - \boldsymbol\epsilon_\theta\!\left(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon,\;t\right)\right\|^2\right]} \]

The network \( \boldsymbol\epsilon_\theta \) simply predicts the noise added to \( \mathbf{x}_0 \). This is the DDPM training objective.

4. Connection to Score Matching

The score function of a distribution is the gradient of its log-density:

\[ \mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}) \]

For the noisy distribution \( q(\mathbf{x}_t|\mathbf{x}_0) \), the score is:

\[ \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t \mid \mathbf{x}_0) = -\frac{\boldsymbol\epsilon}{\sqrt{1-\bar\alpha_t}} \]

Therefore predicting the noise \( \boldsymbol\epsilon_\theta \) is equivalent to estimating the (scaled negative) score:\( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t) \approx -\sqrt{1-\bar\alpha_t}\,\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t) \). This connects DDPM to score-based generative modelling(Song & Ermon, 2019) and the stochastic differential equation (SDE) framework (Song et al., 2021).

5. Sampling: DDPM & DDIM

DDPM Sampling

Sample \( \mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \), then iterate:

\[ \mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\boldsymbol\epsilon_\theta(\mathbf{x}_t,t)\right) + \sigma_t\,\mathbf{z} \]

\( \mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \) for \( t > 1 \), else \( \mathbf{z} = \mathbf{0} \). Requires \( T \) network evaluations (slow: typically T=1000).

DDIM (Deterministic Sampler)

Song et al. (2020) derive a non-Markovian process with the same marginals but deterministic reverse:

\[ \mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\underbrace{\!\left(\frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\boldsymbol\epsilon_\theta}{\sqrt{\bar\alpha_t}}\right)}_{\text{pred. }\mathbf{x}_0} + \sqrt{1-\bar\alpha_{t-1}}\,\boldsymbol\epsilon_\theta \]

DDIM can skip steps, enabling 50-step or 10-step generation with minimal quality loss.

Classifier-Free Guidance

To improve sample quality and controllability, Ho & Salimans (2022) jointly train a conditional\( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t, c) \) and unconditional \( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t, \varnothing) \)model (by randomly dropping condition \( c \) during training). At inference, the guided score is:

\[ \tilde{\boldsymbol\epsilon}_\theta(\mathbf{x}_t, t, c) = (1 + w)\,\boldsymbol\epsilon_\theta(\mathbf{x}_t, t, c) - w\,\boldsymbol\epsilon_\theta(\mathbf{x}_t, t, \varnothing) \]

Guidance weight \( w > 0 \) amplifies the conditional direction. Larger \( w \) yields more faithful samples at the cost of diversity.

6. Modern Architectures

U-Net (DDPM / Stable Diffusion)

Encoder–decoder with skip connections at matching resolutions
Residual blocks + group normalisation
Self-attention layers at lower resolutions for global context
Cross-attention to condition on text embeddings
Sinusoidal time embeddings injected via FiLM scaling

Diffusion Transformer (DiT)

Patchify latent: divide image into patches, embed as tokens
Standard Transformer blocks (self-attention + MLP)
Adaptive LayerNorm: time and class condition via adaLN-Zero
Scales better than U-Net (DiT-XL/2 → SOTA FID)
Used in Sora, Stable Diffusion 3, Flux

Python Simulation: 1D Diffusion Model

We implement a complete 1D diffusion model from scratch using NumPy. A small MLP with sinusoidal time embeddings learns to denoise a mixture-of-Gaussians distribution. We visualise the noise schedule, training loss, and compare the original vs DDPM-sampled distributions.

Python

script.py200 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# ---- Hyperparameters ----
T = 300            # diffusion timesteps
beta_min = 1e-4
beta_max = 0.02
betas = np.linspace(beta_min, beta_max, T)
alphas = 1.0 - betas
alpha_bars = np.cumprod(alphas)   # alpha_bar_t = prod_{s=1}^{t} alpha_s

# ---- 1D data: mixture of 3 Gaussians ----
def sample_data(n=2000):
    c = np.random.choice(3, size=n, p=[0.3, 0.4, 0.3])
    means = [-3.0, 0.0, 3.0]
    stds  = [0.5,  0.4, 0.5]
    return np.array([np.random.normal(means[ci], stds[ci]) for ci in c])

x0 = sample_data(2000)

# ---- Forward process: q(x_t | x_0) = N(sqrt(alpha_bar_t)*x_0, (1-alpha_bar_t)*I) ----
def forward_sample(x0, t):
    # t: scalar or array of ints in [0, T-1]
    ab = alpha_bars[t]
    noise = np.random.randn(*x0.shape)
    return np.sqrt(ab) * x0 + np.sqrt(1.0 - ab) * noise, noise

# ---- Simple denoising network: 1D MLP ----
# Input: (x_t, t_emb) -> predicted noise
# For simplicity: linear model with sinusoidal time embedding

def sinusoidal_emb(t_int, dim=16):
    i = np.arange(dim // 2)
    freqs = 1.0 / (10000.0 ** (2.0 * i / dim))
    angles = t_int[:, None] * freqs[None, :]
    return np.concatenate([np.sin(angles), np.cos(angles)], axis=1)

# Small MLP: [1 + 16] -> 32 -> 32 -> 1
input_dim = 1 + 16
h1_dim = 32
h2_dim = 32

W1 = np.random.randn(input_dim, h1_dim) * 0.05
b1 = np.zeros(h1_dim)
W2 = np.random.randn(h1_dim, h2_dim) * 0.05
b2 = np.zeros(h2_dim)
W3 = np.random.randn(h2_dim, 1) * 0.01
b3 = np.zeros(1)

def relu(x):
    return np.maximum(0, x)

def drelu(x):
    return (x > 0).astype(float)

def forward_mlp(x, t_emb):
    inp = np.concatenate([x[:, None], t_emb], axis=1)
    z1 = inp @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2
    a2 = relu(z2)
    out = a2 @ W3 + b3
    return out.squeeze(1), (inp, z1, a1, z2, a2)

def backward_mlp(eps_pred, eps_true, cache, lr=1e-3):
    global W1, b1, W2, b2, W3, b3
    inp, z1, a1, z2, a2 = cache
    n = len(eps_pred)
    dL = 2.0 * (eps_pred - eps_true) / n   # MSE grad

dW3 = a2.T @ dL[:, None]
    db3 = dL.mean(keepdims=True)
    da2 = dL[:, None] @ W3.T
    dz2 = da2 * drelu(z2)
    dW2 = a1.T @ dz2
    db2 = dz2.mean(axis=0)
    da1 = dz2 @ W2.T
    dz1 = da1 * drelu(z1)
    dW1 = inp.T @ dz1
    db1 = dz1.mean(axis=0)

W3 -= lr * dW3;  b3 -= lr * db3
    W2 -= lr * dW2;  b2 -= lr * db2
    W1 -= lr * dW1;  b1 -= lr * db1

# ---- Training loop ----
batch_size = 256
n_steps = 3000
losses = []

for step in range(n_steps):
    idx = np.random.randint(0, len(x0), batch_size)
    x_batch = x0[idx]
    t_int = np.random.randint(0, T, batch_size)
    t_emb = sinusoidal_emb(t_int)

xt, eps = forward_sample(x_batch, t_int)
    eps_pred, cache = forward_mlp(xt, t_emb)

loss = np.mean((eps_pred - eps)**2)
    losses.append(loss)

backward_mlp(eps_pred, eps, cache, lr=3e-4)

# ---- DDPM Sampling: reverse process ----
def sample_ddpm(n_samples=500):
    x = np.random.randn(n_samples)
    trajectories = [x.copy()]
    for t in reversed(range(T)):
        t_int = np.full(n_samples, t, dtype=int)
        t_emb = sinusoidal_emb(t_int)
        eps_pred, _ = forward_mlp(x, t_emb)

alpha_t = alphas[t]
        ab_t = alpha_bars[t]
        ab_prev = alpha_bars[t-1] if t > 0 else 1.0

# DDPM mean
        coef = (1.0 - alpha_t) / np.sqrt(1.0 - ab_t)
        mean = (x - coef * eps_pred) / np.sqrt(alpha_t)

if t > 0:
            sigma_t = np.sqrt(betas[t] * (1.0 - ab_prev) / (1.0 - ab_t))
            x = mean + sigma_t * np.random.randn(n_samples)
        else:
            x = mean

if t % 60 == 0 or t < 5:
            trajectories.append(x.copy())
    return x, trajectories

x_sampled, trajectories = sample_ddpm(500)

# ---- Plots ----
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.patch.set_facecolor('#0a0a0f')

# Panel 1: Forward process — noise schedules
ax1 = axes[0]
ax1.set_facecolor('#0a0a0f')
t_range = np.arange(T)
ax1.plot(t_range, np.sqrt(alpha_bars), color='#8b5cf6', linewidth=2, label='signal coeff sqrt(alpha_bar)')
ax1.plot(t_range, np.sqrt(1.0 - alpha_bars), color='#34d399', linewidth=2, label='noise coeff sqrt(1-alpha_bar)')
ax1.plot(t_range, betas, color='#fbbf24', linewidth=1.5, linestyle='--', label='beta_t (noise schedule)')
ax1.set_xlabel('Timestep t', color='white', fontsize=11)
ax1.set_ylabel('Coefficient', color='white', fontsize=11)
ax1.set_title('Forward Noising Schedule', color='white', fontsize=12, fontweight='bold')
ax1.legend(facecolor='#1a1a2e', labelcolor='white', fontsize=8, edgecolor='#7c3aed')
ax1.tick_params(colors='white')
for spine in ax1.spines.values():
    spine.set_color('#7c3aed')
ax1.grid(True, alpha=0.2, color='#7c3aed')

# Panel 2: Training loss
ax2 = axes[1]
ax2.set_facecolor('#0a0a0f')
window = 100
smooth = np.convolve(losses, np.ones(window)/window, mode='valid')
ax2.plot(smooth, color='#8b5cf6', linewidth=2)
ax2.fill_between(range(len(smooth)), smooth, alpha=0.2, color='#7c3aed')
ax2.set_xlabel('Training Step', color='white', fontsize=11)
ax2.set_ylabel('L_simple = E[||eps - eps_theta||^2]', color='white', fontsize=11)
ax2.set_title('Diffusion Training Loss', color='white', fontsize=12, fontweight='bold')
ax2.tick_params(colors='white')
for spine in ax2.spines.values():
    spine.set_color('#7c3aed')
ax2.grid(True, alpha=0.2, color='#7c3aed')

# Panel 3: Data distribution vs sampled distribution
ax3 = axes[2]
ax3.set_facecolor('#0a0a0f')
bins = np.linspace(-6, 6, 60)
ax3.hist(x0[:500], bins=bins, density=True, alpha=0.5, color='#8b5cf6', label='Original data x_0')
ax3.hist(x_sampled, bins=bins, density=True, alpha=0.5, color='#34d399', label='DDPM samples')
ax3.set_xlabel('x', color='white', fontsize=11)
ax3.set_ylabel('Density', color='white', fontsize=11)
ax3.set_title('Original vs Sampled Distribution', color='white', fontsize=12, fontweight='bold')
ax3.legend(facecolor='#1a1a2e', labelcolor='white', fontsize=9, edgecolor='#7c3aed')
ax3.tick_params(colors='white')
for spine in ax3.spines.values():
    spine.set_color('#7c3aed')
ax3.grid(True, alpha=0.2, color='#7c3aed')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a0f')

print("1D Diffusion Model Summary")
print(f"Timesteps T = {T}, beta range: [{beta_min}, {beta_max}]")
print(f"Training steps: {n_steps}, final loss: {losses[-1]:.4f}")
print(f"alpha_bar at t=0: {alpha_bars[0]:.4f} (mostly signal)")
print(f"alpha_bar at t={T-1}: {alpha_bars[-1]:.6f} (almost pure noise)")
print(f"Sampled {len(x_sampled)} points via DDPM reverse process")
print(f"Sample mean: {x_sampled.mean():.3f}, std: {x_sampled.std():.3f}")
print(f"True data mean: {x0.mean():.3f}, std: {x0.std():.3f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

Ch 20: Graph Neural Networks Course Overview