Part VII — Advanced Topics

Chapter 21: Diffusion Models

Diffusion models define a forward process that gradually destroys data by adding noise, then learn a reverse process that reconstructs data from noise. We derive the closed-form forward marginal, simplify the variational lower bound to a noise-prediction objective, and connect the framework to score matching.

Forward & Reverse Processes

x₀clean datax₁slightly noisyx_{T-1}very noisyx_Tpure noise···Forward q(x_t | x_{t-1}): adds Gaussian noiseReverse p_theta(x_{t-1} | x_t): learned denoising

1. Forward Process

The forward process is a fixed (non-learned) Markov chain that gradually adds Gaussian noise:

\[ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\right) \]

where \( \{\beta_t\}_{t=1}^T \) is a fixed variance schedule (e.g. linear from \( 10^{-4} \) to \( 0.02 \)). Define \( \alpha_t = 1 - \beta_t \) and \( \bar{\alpha}_t = \prod_{s=1}^t \alpha_s \).

Closed-Form Marginal q(x_t | x_0)

By induction. At step 1:

\[ \mathbf{x}_1 = \sqrt{\alpha_1}\,\mathbf{x}_0 + \sqrt{1-\alpha_1}\,\boldsymbol{\epsilon}_1, \quad \boldsymbol{\epsilon}_1 \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]

At step 2:

\[ \mathbf{x}_2 = \sqrt{\alpha_2}\,\mathbf{x}_1 + \sqrt{1-\alpha_2}\,\boldsymbol{\epsilon}_2 = \sqrt{\alpha_1\alpha_2}\,\mathbf{x}_0 + \sqrt{\alpha_2(1-\alpha_1)}\,\boldsymbol{\epsilon}_1 + \sqrt{1-\alpha_2}\,\boldsymbol{\epsilon}_2 \]

Using the property that the sum of independent Gaussians \( \mathcal{N}(0,\sigma_1^2) + \mathcal{N}(0,\sigma_2^2) \sim \mathcal{N}(0,\sigma_1^2+\sigma_2^2) \), the noise terms combine:

\[ \text{Var of noise} = \alpha_2(1-\alpha_1) + (1-\alpha_2) = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2 \]

By induction at arbitrary \( t \):

\[ \boxed{q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right)} \]

Or equivalently via reparameterisation: \( \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} \), \( \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) \).

As \( t \to T \), \( \bar{\alpha}_T \to 0 \), so \( \mathbf{x}_T \approx \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \) — the data has been fully destroyed.

2. Reverse Process

The true reverse \( q(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \) is intractable (requires knowledge of the data distribution). We instead learn a parametric approximation:

\[ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \Sigma_\theta(\mathbf{x}_t, t)\right) \]

The reverse conditioned on \( \mathbf{x}_0 \) is tractable by Bayes' rule:

\[ q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde{\beta}_t \mathbf{I}\right) \]

where the posterior mean and variance are:

\[ \tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t, \qquad \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t \]

3. Training Objective — Full Derivation

3.1 Variational Lower Bound (ELBO)

We maximise the log-likelihood \( \log p_\theta(\mathbf{x}_0) \). Since the forward process defines a joint \( q(\mathbf{x}_{1:T}|\mathbf{x}_0) \), we apply the ELBO (Jensen's inequality):

\[ \log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_q\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right] =: -\mathcal{L}_\text{ELBO} \]

Expand the joint probabilities and collect KL divergences at each step:

\[ \mathcal{L}_\text{ELBO} = \underbrace{D_\text{KL}(q(\mathbf{x}_T|\mathbf{x}_0)\|p(\mathbf{x}_T))}_{\mathcal{L}_T,\,\text{constant}} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))}_{\mathcal{L}_{t-1}} - \underbrace{\log p_\theta(\mathbf{x}_0|\mathbf{x}_1)}_{\mathcal{L}_0} \]

Since both distributions in \( \mathcal{L}_{t-1} \) are Gaussian, the KL has a closed form. Parameterising\( \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) \)and substituting the reparameterisation \( \mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol\epsilon \):

\[ \mathcal{L}_{t-1} = \frac{\beta_t^2}{2\sigma_t^2\,\alpha_t\,(1-\bar\alpha_t)}\,\mathbb{E}_{\mathbf{x}_0,\boldsymbol\epsilon}\!\left[\left\|\boldsymbol\epsilon - \boldsymbol\epsilon_\theta\!\left(\sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon,\;t\right)\right\|^2\right] \]

Ho et al. (2020) dropped the prefactor and simplified the sum to a single uniformly sampled term:

\[ \boxed{\mathcal{L}_\text{simple} = \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol\epsilon}\!\left[\left\|\boldsymbol\epsilon - \boldsymbol\epsilon_\theta\!\left(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon,\;t\right)\right\|^2\right]} \]

The network \( \boldsymbol\epsilon_\theta \) simply predicts the noise added to \( \mathbf{x}_0 \). This is the DDPM training objective.

4. Connection to Score Matching

The score function of a distribution is the gradient of its log-density:

\[ \mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}) \]

For the noisy distribution \( q(\mathbf{x}_t|\mathbf{x}_0) \), the score is:

\[ \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t \mid \mathbf{x}_0) = -\frac{\boldsymbol\epsilon}{\sqrt{1-\bar\alpha_t}} \]

Therefore predicting the noise \( \boldsymbol\epsilon_\theta \) is equivalent to estimating the (scaled negative) score:\( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t) \approx -\sqrt{1-\bar\alpha_t}\,\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t) \). This connects DDPM to score-based generative modelling(Song & Ermon, 2019) and the stochastic differential equation (SDE) framework (Song et al., 2021).

5. Sampling: DDPM & DDIM

DDPM Sampling

Sample \( \mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \), then iterate:

\[ \mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\boldsymbol\epsilon_\theta(\mathbf{x}_t,t)\right) + \sigma_t\,\mathbf{z} \]

\( \mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \) for \( t > 1 \), else \( \mathbf{z} = \mathbf{0} \). Requires \( T \) network evaluations (slow: typically T=1000).

DDIM (Deterministic Sampler)

Song et al. (2020) derive a non-Markovian process with the same marginals but deterministic reverse:

\[ \mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\underbrace{\!\left(\frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\boldsymbol\epsilon_\theta}{\sqrt{\bar\alpha_t}}\right)}_{\text{pred. }\mathbf{x}_0} + \sqrt{1-\bar\alpha_{t-1}}\,\boldsymbol\epsilon_\theta \]

DDIM can skip steps, enabling 50-step or 10-step generation with minimal quality loss.

Classifier-Free Guidance

To improve sample quality and controllability, Ho & Salimans (2022) jointly train a conditional\( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t, c) \) and unconditional \( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t, \varnothing) \)model (by randomly dropping condition \( c \) during training). At inference, the guided score is:

\[ \tilde{\boldsymbol\epsilon}_\theta(\mathbf{x}_t, t, c) = (1 + w)\,\boldsymbol\epsilon_\theta(\mathbf{x}_t, t, c) - w\,\boldsymbol\epsilon_\theta(\mathbf{x}_t, t, \varnothing) \]

Guidance weight \( w > 0 \) amplifies the conditional direction. Larger \( w \) yields more faithful samples at the cost of diversity.

6. Modern Architectures

U-Net (DDPM / Stable Diffusion)

  • Encoder–decoder with skip connections at matching resolutions
  • Residual blocks + group normalisation
  • Self-attention layers at lower resolutions for global context
  • Cross-attention to condition on text embeddings
  • Sinusoidal time embeddings injected via FiLM scaling

Diffusion Transformer (DiT)

  • Patchify latent: divide image into patches, embed as tokens
  • Standard Transformer blocks (self-attention + MLP)
  • Adaptive LayerNorm: time and class condition via adaLN-Zero
  • Scales better than U-Net (DiT-XL/2 → SOTA FID)
  • Used in Sora, Stable Diffusion 3, Flux

Python Simulation: 1D Diffusion Model

We implement a complete 1D diffusion model from scratch using NumPy. A small MLP with sinusoidal time embeddings learns to denoise a mixture-of-Gaussians distribution. We visualise the noise schedule, training loss, and compare the original vs DDPM-sampled distributions.

Python
script.py201 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server