Markov Chain Monte Carlo
Constructing Markov chains whose stationary distributions solve intractable integration problems
Historical Context
The story of Markov chain Monte Carlo begins at Los Alamos National Laboratory in 1953, when Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller published their landmark paper on equation-of-state calculations for systems of interacting particles. The algorithm they proposed—now known as the Metropolis algorithm—used a cleverly constructed random walk to sample from the Boltzmann distribution of a many-body system, circumventing the curse of dimensionality that made direct numerical integration impossible.
In 1970, W. Keith Hastings generalized the Metropolis algorithm by allowing asymmetric proposal distributions, producing what we now call the Metropolis-Hastings algorithm. This generalization went largely unnoticed by the statistics community for over a decade. The revolution came in 1984, when Stuart Geman and Donald Geman introduced Gibbs sampling in the context of Bayesian image restoration, and independently in 1990, when Adrian Smith and Alan Gelfand demonstrated that MCMC methods could transform Bayesian computation. Their work showed that complex posterior distributions that had been analytically intractable could now be routinely explored by simulation. Today, MCMC is the computational backbone of modern Bayesian statistics, with extensions such as Hamiltonian Monte Carlo powering probabilistic programming frameworks like Stan and PyMC.
3.1 Monte Carlo Integration
Monte Carlo methods exploit the law of large numbers to approximate integrals by sample averages. If we wish to compute $I = \int h(x)\,\pi(x)\,dx$ where $\pi$ is a probability density, and we can draw $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \pi$, then the Monte Carlo estimator
converges almost surely by the strong law of large numbers, with a standard error of order $O(n^{-1/2})$ regardless of the dimension of the integration domain.
Definition: Importance Sampling
When direct sampling from $\pi$ is difficult, we introduce a proposal density $q$ such that $q(x) > 0$ whenever $h(x)\pi(x) \neq 0$. Then:
The ratio $w(x) = \pi(x)/q(x)$ is called the importance weight. The self-normalized importance sampling estimator is $\hat{I} = \sum_i w_i h(X_i) / \sum_i w_i$, which is consistent even when $\pi$ is known only up to a normalizing constant.
Definition: Rejection Sampling
Given a target density $\pi(x)$ and a proposal density $q(x)$ with envelope constant $M$ satisfying $\pi(x) \leq M\,q(x)$ for all $x$, rejection sampling proceeds as follows: draw $X \sim q$ and $U \sim \text{Uniform}(0,1)$; accept $X$ if $U \leq \pi(X)/(M\,q(X))$. The accepted samples are exactly distributed as $\pi$, with acceptance probability $1/M$. In high dimensions, $M$ grows exponentially, making rejection sampling impractical and motivating the move to MCMC.
3.2 Markov Chains on General State Spaces
A Markov chain $\{X_t\}_{t \geq 0}$ on a measurable space $(\mathcal{X}, \mathcal{B})$ is defined by a transition kernel $K(x, A) = P(X_{t+1} \in A \mid X_t = x)$ for all $x \in \mathcal{X}$ and $A \in \mathcal{B}$. The chain satisfies the Markov property: conditioned on the present state, the future is independent of the past.
Definition: Stationarity
A distribution $\pi$ is stationary (or invariant) for the kernel $K$ if:
If the chain starts in $\pi$, it remains in $\pi$ at all subsequent times.
Definition: Detailed Balance (Reversibility)
A kernel $K$ satisfies detailed balance with respect to $\pi$ if:
Detailed balance is a sufficient condition for stationarity. In the discrete case, this becomes $\pi_i\,K_{ij} = \pi_j\,K_{ji}$ for all states $i, j$. A chain satisfying detailed balance is called reversible.
Definition: Ergodicity
A Markov chain with stationary distribution $\pi$ is ergodic if for $\pi$-almost all starting points $x$ and all measurable functions $h$ with $\mathbb{E}_\pi[|h|] < \infty$:
Sufficient conditions include $\pi$-irreducibility and aperiodicity. Harris recurrence strengthens this to convergence from every starting point. The ergodic theorem for Markov chains is the foundation that guarantees MCMC estimators converge.
3.3 The Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm constructs a reversible Markov chain whose stationary distribution is a prescribed target $\pi$. Given the current state $X_t = x$, the algorithm proceeds as follows:
- Propose: Draw a candidate $y \sim q(\cdot \mid x)$ from the proposal distribution.
- Compute acceptance probability:$$\alpha(x, y) = \min\!\left(1,\;\frac{\pi(y)\,q(x \mid y)}{\pi(x)\,q(y \mid x)}\right)$$
- Accept or reject: Set $X_{t+1} = y$ with probability $\alpha(x, y)$, otherwise set $X_{t+1} = x$.
The crucial insight is that $\pi$ need only be known up to a normalizing constant, since the ratio $\pi(y)/\pi(x)$ cancels any unknown normalization. This makes the algorithm ideal for Bayesian posterior computation where$\pi(\theta \mid \mathbf{y}) \propto L(\mathbf{y} \mid \theta)\,p(\theta)$.
Theorem: Correctness of Metropolis-Hastings
The transition kernel of the Metropolis-Hastings chain satisfies detailed balance with respect to $\pi$. If, in addition, the chain is $\pi$-irreducible and aperiodic, then for any initial distribution $\mu_0$:
where $\|\cdot\|_{\text{TV}}$ denotes total variation distance. The proof follows from verifying $\pi(x)\,q(y|x)\,\alpha(x,y) = \pi(y)\,q(x|y)\,\alpha(y,x)$for all $x, y$.
Common Proposal Distributions
Random walk Metropolis: $q(y \mid x) = q(y - x)$ with a symmetric kernel (e.g., $y = x + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$). Since $q$ is symmetric, the acceptance ratio simplifies to $\alpha = \min(1, \pi(y)/\pi(x))$. The scale $\sigma$ must be tuned: too small yields high acceptance but slow exploration; too large yields frequent rejection. The celebrated result of Roberts, Gelman, and Gilks (1997) shows that the optimal acceptance rate is approximately 23.4% in high dimensions.
Independence sampler: $q(y \mid x) = q(y)$ does not depend on the current state. This works well only when $q$ closely approximates $\pi$; otherwise the chain can become stuck. If $\pi(y)/q(y) \leq M$ for all $y$, then the chain is uniformly ergodic with geometric convergence rate$1 - 1/M$.
3.4 Gibbs Sampling
Gibbs sampling is a special case of Metropolis-Hastings where the proposal is always accepted. For a target distribution $\pi(x_1, x_2, \ldots, x_p)$, the systematic scan Gibbs sampler iterates through each component, sampling from its full conditional distribution:
At iteration $t$, for $j = 1, \ldots, p$:
Each update can be viewed as a Metropolis-Hastings step with proposal equal to the full conditional, yielding acceptance probability 1. The full conditional of $x_j$ is proportional to the joint as a function of $x_j$ alone.
Data Augmentation
The data augmentation algorithm of Tanner and Wong (1987) introduces latent variables $Z$ to simplify computation. If $\pi(\theta \mid \mathbf{y})$ is intractable but $\pi(\theta \mid \mathbf{y}, Z)$and $\pi(Z \mid \mathbf{y}, \theta)$ are both tractable, we alternate between sampling $Z$ and $\theta$. Classic examples include the EM algorithm's stochastic counterpart, missing data problems, and mixture models where the latent allocation variables make the full conditionals conjugate.
Rao-Blackwellization
Given MCMC output $(X_1^{(t)}, X_2^{(t)})$ from a joint distribution $\pi(x_1, x_2)$, the Rao-Blackwell theorem implies that the conditional expectation estimator:
has variance no greater than the naive estimator $\hat{\mu} = \frac{1}{n}\sum_t h(X_1^{(t)})$. When the conditional expectation is available in closed form, Rao-Blackwellization can dramatically reduce Monte Carlo variance at negligible computational cost.
3.5 Diagnostics and Advanced Methods
Since MCMC produces correlated samples from an asymptotically correct distribution, careful diagnostics are essential to ensure the chain has converged and the Monte Carlo error is acceptably small.
Burn-in and Thinning
The initial portion of the chain, before it has reached its stationary distribution, is discarded as burn-in. The length of burn-in is assessed by examining trace plots and formal diagnostics. Thinning—retaining every $k$-th sample—reduces storage but does not improve estimation efficiency per unit of computation. Link (2012) showed that thinning is suboptimal compared to using the full chain, though it remains common for practical memory constraints.
Gelman-Rubin Diagnostic ($\hat{R}$)
Run $m \geq 2$ chains of length $n$ from overdispersed starting values. Compute the between-chain variance $B$ and within-chain variance $W$:
As $n \to \infty$, $\hat{R} \to 1$. Values of $\hat{R} < 1.01$ (or the more lenient threshold of 1.1) suggest convergence. The split-$\hat{R}$ variant of Vehtari et al. (2021) additionally splits each chain in half, providing better sensitivity to non-stationarity.
Effective Sample Size
The effective sample size accounts for autocorrelation in the chain:
where $\rho_k$ is the lag-$k$ autocorrelation. In practice, the sum is truncated when the autocorrelation estimates become noisy. An$n_{\text{eff}}$ much smaller than $n$ indicates high autocorrelation and inefficient sampling.
Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) augments the parameter space $\theta$ with auxiliary momentum variables $p$ and defines the Hamiltonian $H(\theta, p) = -\log\pi(\theta) + \frac{1}{2}p^T M^{-1} p$. Proposals are generated by simulating Hamilton's equations using the leapfrog integrator for$L$ steps of size $\epsilon$:
The symplectic integrator preserves volume and is time-reversible, so the Metropolis acceptance step corrects only for discretization error. HMC explores the target distribution far more efficiently than random walk Metropolis in high dimensions because proposals follow the geometry of the posterior. The No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) eliminates the need to hand-tune $L$ and $\epsilon$ by adaptively building a trajectory tree.
3.6 Computational Lab
The following simulation implements the Metropolis-Hastings algorithm for a normal posterior, a Gibbs sampler for a bivariate normal distribution, and provides convergence diagnostics including trace plots, autocorrelation analysis, and the Gelman-Rubin statistic.
MCMC Methods: Metropolis-Hastings, Gibbs Sampling, and Diagnostics
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
Summary and Key Takeaways
Monte Carlo Foundation
Monte Carlo integration replaces intractable high-dimensional integrals with sample averages converging at rate $O(n^{-1/2})$ regardless of dimension. Importance and rejection sampling are direct methods limited by the curse of dimensionality.
Metropolis-Hastings
Constructs a reversible Markov chain with the target as its stationary distribution. The acceptance ratio requires $\pi$ only up to proportionality, making it ideal for Bayesian posteriors.
Gibbs Sampling
A coordinate-wise updating scheme that samples from full conditionals with acceptance probability 1. Data augmentation and Rao-Blackwellization are powerful extensions.
Convergence Diagnostics
The Gelman-Rubin $\hat{R}$ statistic, effective sample size, trace plots, and autocorrelation analysis are essential tools. Modern practice demands $\hat{R} < 1.01$ and sufficient ESS for reliable inference.
Hamiltonian Monte Carlo
HMC uses gradient information to make distant, high-acceptance proposals by simulating Hamiltonian dynamics. NUTS automates tuning and powers modern probabilistic programming.