Chapter 21: Diffusion Models
Diffusion models define a forward process that gradually destroys data by adding noise, then learn a reverse process that reconstructs data from noise. We derive the closed-form forward marginal, simplify the variational lower bound to a noise-prediction objective, and connect the framework to score matching.
Forward & Reverse Processes
1. Forward Process
The forward process is a fixed (non-learned) Markov chain that gradually adds Gaussian noise:
where \( \{\beta_t\}_{t=1}^T \) is a fixed variance schedule (e.g. linear from \( 10^{-4} \) to \( 0.02 \)). Define \( \alpha_t = 1 - \beta_t \) and \( \bar{\alpha}_t = \prod_{s=1}^t \alpha_s \).
Closed-Form Marginal q(x_t | x_0)
By induction. At step 1:
At step 2:
Using the property that the sum of independent Gaussians \( \mathcal{N}(0,\sigma_1^2) + \mathcal{N}(0,\sigma_2^2) \sim \mathcal{N}(0,\sigma_1^2+\sigma_2^2) \), the noise terms combine:
By induction at arbitrary \( t \):
Or equivalently via reparameterisation: \( \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} \), \( \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) \).
As \( t \to T \), \( \bar{\alpha}_T \to 0 \), so \( \mathbf{x}_T \approx \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \) — the data has been fully destroyed.
2. Reverse Process
The true reverse \( q(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \) is intractable (requires knowledge of the data distribution). We instead learn a parametric approximation:
The reverse conditioned on \( \mathbf{x}_0 \) is tractable by Bayes' rule:
where the posterior mean and variance are:
3. Training Objective — Full Derivation
3.1 Variational Lower Bound (ELBO)
We maximise the log-likelihood \( \log p_\theta(\mathbf{x}_0) \). Since the forward process defines a joint \( q(\mathbf{x}_{1:T}|\mathbf{x}_0) \), we apply the ELBO (Jensen's inequality):
Expand the joint probabilities and collect KL divergences at each step:
Since both distributions in \( \mathcal{L}_{t-1} \) are Gaussian, the KL has a closed form. Parameterising\( \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) \)and substituting the reparameterisation \( \mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol\epsilon \):
Ho et al. (2020) dropped the prefactor and simplified the sum to a single uniformly sampled term:
The network \( \boldsymbol\epsilon_\theta \) simply predicts the noise added to \( \mathbf{x}_0 \). This is the DDPM training objective.
4. Connection to Score Matching
The score function of a distribution is the gradient of its log-density:
For the noisy distribution \( q(\mathbf{x}_t|\mathbf{x}_0) \), the score is:
Therefore predicting the noise \( \boldsymbol\epsilon_\theta \) is equivalent to estimating the (scaled negative) score:\( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t) \approx -\sqrt{1-\bar\alpha_t}\,\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t) \). This connects DDPM to score-based generative modelling(Song & Ermon, 2019) and the stochastic differential equation (SDE) framework (Song et al., 2021).
5. Sampling: DDPM & DDIM
DDPM Sampling
Sample \( \mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \), then iterate:
\( \mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I}) \) for \( t > 1 \), else \( \mathbf{z} = \mathbf{0} \). Requires \( T \) network evaluations (slow: typically T=1000).
DDIM (Deterministic Sampler)
Song et al. (2020) derive a non-Markovian process with the same marginals but deterministic reverse:
DDIM can skip steps, enabling 50-step or 10-step generation with minimal quality loss.
Classifier-Free Guidance
To improve sample quality and controllability, Ho & Salimans (2022) jointly train a conditional\( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t, c) \) and unconditional \( \boldsymbol\epsilon_\theta(\mathbf{x}_t, t, \varnothing) \)model (by randomly dropping condition \( c \) during training). At inference, the guided score is:
Guidance weight \( w > 0 \) amplifies the conditional direction. Larger \( w \) yields more faithful samples at the cost of diversity.
6. Modern Architectures
U-Net (DDPM / Stable Diffusion)
- Encoder–decoder with skip connections at matching resolutions
- Residual blocks + group normalisation
- Self-attention layers at lower resolutions for global context
- Cross-attention to condition on text embeddings
- Sinusoidal time embeddings injected via FiLM scaling
Diffusion Transformer (DiT)
- Patchify latent: divide image into patches, embed as tokens
- Standard Transformer blocks (self-attention + MLP)
- Adaptive LayerNorm: time and class condition via adaLN-Zero
- Scales better than U-Net (DiT-XL/2 → SOTA FID)
- Used in Sora, Stable Diffusion 3, Flux
Python Simulation: 1D Diffusion Model
We implement a complete 1D diffusion model from scratch using NumPy. A small MLP with sinusoidal time embeddings learns to denoise a mixture-of-Gaussians distribution. We visualise the noise schedule, training loss, and compare the original vs DDPM-sampled distributions.
Click Run to execute the Python code
Code will be executed with Python 3 on the server