Part III: Methods | Chapter 3

Markov Chain Monte Carlo

Constructing Markov chains whose stationary distributions solve intractable integration problems

Historical Context

The story of Markov chain Monte Carlo begins at Los Alamos National Laboratory in 1953, when Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller published their landmark paper on equation-of-state calculations for systems of interacting particles. The algorithm they proposed—now known as the Metropolis algorithm—used a cleverly constructed random walk to sample from the Boltzmann distribution of a many-body system, circumventing the curse of dimensionality that made direct numerical integration impossible.

In 1970, W. Keith Hastings generalized the Metropolis algorithm by allowing asymmetric proposal distributions, producing what we now call the Metropolis-Hastings algorithm. This generalization went largely unnoticed by the statistics community for over a decade. The revolution came in 1984, when Stuart Geman and Donald Geman introduced Gibbs sampling in the context of Bayesian image restoration, and independently in 1990, when Adrian Smith and Alan Gelfand demonstrated that MCMC methods could transform Bayesian computation. Their work showed that complex posterior distributions that had been analytically intractable could now be routinely explored by simulation. Today, MCMC is the computational backbone of modern Bayesian statistics, with extensions such as Hamiltonian Monte Carlo powering probabilistic programming frameworks like Stan and PyMC.

3.1 Monte Carlo Integration

Monte Carlo methods exploit the law of large numbers to approximate integrals by sample averages. If we wish to compute $I = \int h(x)\,\pi(x)\,dx$ where $\pi$ is a probability density, and we can draw $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \pi$, then the Monte Carlo estimator

$$\hat{I}_n = \frac{1}{n}\sum_{i=1}^{n} h(X_i) \xrightarrow{\text{a.s.}} I$$

converges almost surely by the strong law of large numbers, with a standard error of order $O(n^{-1/2})$ regardless of the dimension of the integration domain.

Definition: Importance Sampling

When direct sampling from $\pi$ is difficult, we introduce a proposal density $q$ such that $q(x) > 0$ whenever $h(x)\pi(x) \neq 0$. Then:

$$I = \int h(x)\,\frac{\pi(x)}{q(x)}\,q(x)\,dx = \mathbb{E}_q\!\left[h(X)\,\frac{\pi(X)}{q(X)}\right]$$

The ratio $w(x) = \pi(x)/q(x)$ is called the importance weight. The self-normalized importance sampling estimator is $\hat{I} = \sum_i w_i h(X_i) / \sum_i w_i$, which is consistent even when $\pi$ is known only up to a normalizing constant.

Definition: Rejection Sampling

Given a target density $\pi(x)$ and a proposal density $q(x)$ with envelope constant $M$ satisfying $\pi(x) \leq M\,q(x)$ for all $x$, rejection sampling proceeds as follows: draw $X \sim q$ and $U \sim \text{Uniform}(0,1)$; accept $X$ if $U \leq \pi(X)/(M\,q(X))$. The accepted samples are exactly distributed as $\pi$, with acceptance probability $1/M$. In high dimensions, $M$ grows exponentially, making rejection sampling impractical and motivating the move to MCMC.

3.2 Markov Chains on General State Spaces

A Markov chain $\{X_t\}_{t \geq 0}$ on a measurable space $(\mathcal{X}, \mathcal{B})$ is defined by a transition kernel $K(x, A) = P(X_{t+1} \in A \mid X_t = x)$ for all $x \in \mathcal{X}$ and $A \in \mathcal{B}$. The chain satisfies the Markov property: conditioned on the present state, the future is independent of the past.

Definition: Stationarity

A distribution $\pi$ is stationary (or invariant) for the kernel $K$ if:

$$\pi(A) = \int_{\mathcal{X}} K(x, A)\,\pi(dx), \quad \forall A \in \mathcal{B}$$

If the chain starts in $\pi$, it remains in $\pi$ at all subsequent times.

Definition: Detailed Balance (Reversibility)

A kernel $K$ satisfies detailed balance with respect to $\pi$ if:

$$\pi(dx)\,K(x, dy) = \pi(dy)\,K(y, dx)$$

Detailed balance is a sufficient condition for stationarity. In the discrete case, this becomes $\pi_i\,K_{ij} = \pi_j\,K_{ji}$ for all states $i, j$. A chain satisfying detailed balance is called reversible.

Definition: Ergodicity

A Markov chain with stationary distribution $\pi$ is ergodic if for $\pi$-almost all starting points $x$ and all measurable functions $h$ with $\mathbb{E}_\pi[|h|] < \infty$:

$$\frac{1}{n}\sum_{t=1}^{n} h(X_t) \xrightarrow{\text{a.s.}} \int h(x)\,\pi(dx)$$

Sufficient conditions include $\pi$-irreducibility and aperiodicity. Harris recurrence strengthens this to convergence from every starting point. The ergodic theorem for Markov chains is the foundation that guarantees MCMC estimators converge.

3.3 The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm constructs a reversible Markov chain whose stationary distribution is a prescribed target $\pi$. Given the current state $X_t = x$, the algorithm proceeds as follows:

Propose: Draw a candidate $y \sim q(\cdot \mid x)$ from the proposal distribution.
Compute acceptance probability:
$$\alpha(x, y) = \min\!\left(1,\;\frac{\pi(y)\,q(x \mid y)}{\pi(x)\,q(y \mid x)}\right)$$
Accept or reject: Set $X_{t+1} = y$ with probability $\alpha(x, y)$, otherwise set $X_{t+1} = x$.

The crucial insight is that $\pi$ need only be known up to a normalizing constant, since the ratio $\pi(y)/\pi(x)$ cancels any unknown normalization. This makes the algorithm ideal for Bayesian posterior computation where$\pi(\theta \mid \mathbf{y}) \propto L(\mathbf{y} \mid \theta)\,p(\theta)$.

Theorem: Correctness of Metropolis-Hastings

The transition kernel of the Metropolis-Hastings chain satisfies detailed balance with respect to $\pi$. If, in addition, the chain is $\pi$-irreducible and aperiodic, then for any initial distribution $\mu_0$:

$$\|\mu_0 K^n - \pi\|_{\text{TV}} \to 0 \quad \text{as } n \to \infty$$

where $\|\cdot\|_{\text{TV}}$ denotes total variation distance. The proof follows from verifying $\pi(x)\,q(y|x)\,\alpha(x,y) = \pi(y)\,q(x|y)\,\alpha(y,x)$for all $x, y$.

Common Proposal Distributions

Random walk Metropolis: $q(y \mid x) = q(y - x)$ with a symmetric kernel (e.g., $y = x + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$). Since $q$ is symmetric, the acceptance ratio simplifies to $\alpha = \min(1, \pi(y)/\pi(x))$. The scale $\sigma$ must be tuned: too small yields high acceptance but slow exploration; too large yields frequent rejection. The celebrated result of Roberts, Gelman, and Gilks (1997) shows that the optimal acceptance rate is approximately 23.4% in high dimensions.

Independence sampler: $q(y \mid x) = q(y)$ does not depend on the current state. This works well only when $q$ closely approximates $\pi$; otherwise the chain can become stuck. If $\pi(y)/q(y) \leq M$ for all $y$, then the chain is uniformly ergodic with geometric convergence rate$1 - 1/M$.

3.4 Gibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings where the proposal is always accepted. For a target distribution $\pi(x_1, x_2, \ldots, x_p)$, the systematic scan Gibbs sampler iterates through each component, sampling from its full conditional distribution:

At iteration $t$, for $j = 1, \ldots, p$:

$$X_j^{(t)} \sim \pi\!\left(x_j \mid x_1^{(t)}, \ldots, x_{j-1}^{(t)}, x_{j+1}^{(t-1)}, \ldots, x_p^{(t-1)}\right)$$

Each update can be viewed as a Metropolis-Hastings step with proposal equal to the full conditional, yielding acceptance probability 1. The full conditional of $x_j$ is proportional to the joint as a function of $x_j$ alone.

Data Augmentation

The data augmentation algorithm of Tanner and Wong (1987) introduces latent variables $Z$ to simplify computation. If $\pi(\theta \mid \mathbf{y})$ is intractable but $\pi(\theta \mid \mathbf{y}, Z)$and $\pi(Z \mid \mathbf{y}, \theta)$ are both tractable, we alternate between sampling $Z$ and $\theta$. Classic examples include the EM algorithm's stochastic counterpart, missing data problems, and mixture models where the latent allocation variables make the full conditionals conjugate.

Rao-Blackwellization

Given MCMC output $(X_1^{(t)}, X_2^{(t)})$ from a joint distribution $\pi(x_1, x_2)$, the Rao-Blackwell theorem implies that the conditional expectation estimator:

$$\hat{\mu}_{\text{RB}} = \frac{1}{n}\sum_{t=1}^{n} \mathbb{E}[h(X_1) \mid X_2^{(t)}]$$

has variance no greater than the naive estimator $\hat{\mu} = \frac{1}{n}\sum_t h(X_1^{(t)})$. When the conditional expectation is available in closed form, Rao-Blackwellization can dramatically reduce Monte Carlo variance at negligible computational cost.

3.5 Diagnostics and Advanced Methods

Since MCMC produces correlated samples from an asymptotically correct distribution, careful diagnostics are essential to ensure the chain has converged and the Monte Carlo error is acceptably small.

Burn-in and Thinning

The initial portion of the chain, before it has reached its stationary distribution, is discarded as burn-in. The length of burn-in is assessed by examining trace plots and formal diagnostics. Thinning—retaining every $k$-th sample—reduces storage but does not improve estimation efficiency per unit of computation. Link (2012) showed that thinning is suboptimal compared to using the full chain, though it remains common for practical memory constraints.

Gelman-Rubin Diagnostic ($\hat{R}$)

Run $m \geq 2$ chains of length $n$ from overdispersed starting values. Compute the between-chain variance $B$ and within-chain variance $W$:

$$\hat{R} = \sqrt{\frac{\hat{V}}{W}}, \quad \hat{V} = \frac{n-1}{n}W + \frac{1}{n}B$$

As $n \to \infty$, $\hat{R} \to 1$. Values of $\hat{R} < 1.01$ (or the more lenient threshold of 1.1) suggest convergence. The split-$\hat{R}$ variant of Vehtari et al. (2021) additionally splits each chain in half, providing better sensitivity to non-stationarity.

Effective Sample Size

The effective sample size accounts for autocorrelation in the chain:

$$n_{\text{eff}} = \frac{n}{1 + 2\sum_{k=1}^{\infty} \rho_k}$$

where $\rho_k$ is the lag-$k$ autocorrelation. In practice, the sum is truncated when the autocorrelation estimates become noisy. An$n_{\text{eff}}$ much smaller than $n$ indicates high autocorrelation and inefficient sampling.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) augments the parameter space $\theta$ with auxiliary momentum variables $p$ and defines the Hamiltonian $H(\theta, p) = -\log\pi(\theta) + \frac{1}{2}p^T M^{-1} p$. Proposals are generated by simulating Hamilton's equations using the leapfrog integrator for$L$ steps of size $\epsilon$:

$$p_{t+\epsilon/2} = p_t + \frac{\epsilon}{2}\nabla_\theta \log\pi(\theta_t), \quad \theta_{t+\epsilon} = \theta_t + \epsilon\,M^{-1} p_{t+\epsilon/2}$$

The symplectic integrator preserves volume and is time-reversible, so the Metropolis acceptance step corrects only for discretization error. HMC explores the target distribution far more efficiently than random walk Metropolis in high dimensions because proposals follow the geometry of the posterior. The No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) eliminates the need to hand-tune $L$ and $\epsilon$ by adaptively building a trajectory tree.

3.6 Computational Lab

The following simulation implements the Metropolis-Hastings algorithm for a normal posterior, a Gibbs sampler for a bivariate normal distribution, and provides convergence diagnostics including trace plots, autocorrelation analysis, and the Gelman-Rubin statistic.

MCMC Methods: Metropolis-Hastings, Gibbs Sampling, and Diagnostics

Python

script.py208 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)
bg_color = "#0f172a"
colors = ["#10b981", "#059669", "#34d399", "#6ee7b7"]

# =============================================================
# Part 1: Metropolis-Hastings for Normal Posterior
# =============================================================
# Model: X_i ~ N(mu, sigma^2_known), prior mu ~ N(mu_0, tau^2)
# Posterior: mu | x ~ N(mu_n, sigma_n^2)

n_obs = 25
true_mu = 3.0
sigma_known = 2.0
mu_0, tau = 0.0, 10.0

data = np.random.normal(true_mu, sigma_known, n_obs)
xbar = np.mean(data)

# Analytic posterior
sigma_n2 = 1.0 / (n_obs / sigma_known**2 + 1.0 / tau**2)
mu_n = sigma_n2 * (n_obs * xbar / sigma_known**2 + mu_0 / tau**2)

def log_posterior(mu):
    ll = -0.5 * np.sum((data - mu)**2) / sigma_known**2
    lp = -0.5 * (mu - mu_0)**2 / tau**2
    return ll + lp

# Run MH with random walk proposal
n_iter = 10000
proposal_sd = 0.5
chains_mh = []
for chain_id in range(3):
    samples = np.zeros(n_iter)
    samples[0] = np.random.uniform(-5, 10)
    accept_count = 0
    for t in range(1, n_iter):
        proposal = samples[t-1] + np.random.normal(0, proposal_sd)
        log_alpha = log_posterior(proposal) - log_posterior(samples[t-1])
        if np.log(np.random.uniform()) < log_alpha:
            samples[t] = proposal
            accept_count += 1
        else:
            samples[t] = samples[t-1]
    chains_mh.append(samples)
    print(f"MH Chain {chain_id+1}: acceptance rate = {accept_count/(n_iter-1):.3f}")

print(f"\nAnalytic posterior: N({mu_n:.4f}, {np.sqrt(sigma_n2):.4f})")
burn = 2000
combined = np.concatenate([c[burn:] for c in chains_mh])
print(f"MCMC posterior mean: {np.mean(combined):.4f}")
print(f"MCMC posterior std:  {np.std(combined):.4f}")

# =============================================================
# Part 2: Gibbs Sampler for Bivariate Normal
# =============================================================
rho_true = 0.8
mu1, mu2 = 0.0, 0.0
sigma1, sigma2 = 1.0, 1.0
n_gibbs = 8000

x1_samples = np.zeros(n_gibbs)
x2_samples = np.zeros(n_gibbs)
x1_samples[0], x2_samples[0] = 0.0, 0.0

for t in range(1, n_gibbs):
    cond_mean1 = mu1 + rho_true * (sigma1/sigma2) * (x2_samples[t-1] - mu2)
    cond_sd1 = sigma1 * np.sqrt(1 - rho_true**2)
    x1_samples[t] = np.random.normal(cond_mean1, cond_sd1)

cond_mean2 = mu2 + rho_true * (sigma2/sigma1) * (x1_samples[t] - mu1)
    cond_sd2 = sigma2 * np.sqrt(1 - rho_true**2)
    x2_samples[t] = np.random.normal(cond_mean2, cond_sd2)

print(f"\n=== Gibbs Sampler (Bivariate Normal, rho={rho_true}) ===")
gb = 500
print(f"Sample mean X1: {np.mean(x1_samples[gb:]):.4f} (true: {mu1})")
print(f"Sample mean X2: {np.mean(x2_samples[gb:]):.4f} (true: {mu2})")
print(f"Sample corr:    {np.corrcoef(x1_samples[gb:], x2_samples[gb:])[0,1]:.4f} (true: {rho_true})")

# =============================================================
# Part 3: Gelman-Rubin Diagnostic
# =============================================================
def gelman_rubin(chains, burn=0):
    chains_post = [c[burn:] for c in chains]
    m = len(chains_post)
    n = len(chains_post[0])
    chain_means = np.array([np.mean(c) for c in chains_post])
    grand_mean = np.mean(chain_means)
    B = n * np.var(chain_means, ddof=1)
    W = np.mean([np.var(c, ddof=1) for c in chains_post])
    V_hat = (n - 1) / n * W + (1.0 / n) * B
    R_hat = np.sqrt(V_hat / W)
    return R_hat

R_hat = gelman_rubin(chains_mh, burn=burn)
print(f"\n=== Convergence Diagnostics ===")
print(f"Gelman-Rubin R-hat: {R_hat:.4f}")

# Effective sample size
def ess(x):
    n = len(x)
    mean_x = np.mean(x)
    var_x = np.var(x)
    if var_x == 0:
        return n
    acf = np.correlate(x - mean_x, x - mean_x, mode='full')[n-1:]
    acf = acf / acf[0]
    total = 0.0
    for k in range(1, n // 2):
        pair_sum = acf[2*k - 1] + acf[2*k] if 2*k < n else 0
        if pair_sum < 0:
            break
        total += pair_sum
    return n / (1 + 2 * total)

for i, c in enumerate(chains_mh):
    print(f"Chain {i+1} ESS: {ess(c[burn:]):.0f} / {len(c[burn:])}")

# =============================================================
# Plotting
# =============================================================
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.patch.set_facecolor(bg_color)
for ax in axes.flat:
    ax.set_facecolor(bg_color)
    ax.tick_params(colors='white')
    for spine in ax.spines.values():
        spine.set_color('#334155')
    ax.xaxis.label.set_color('white')
    ax.yaxis.label.set_color('white')
    ax.title.set_color('white')

# (0,0) Trace plot - MH chains
for i, c in enumerate(chains_mh):
    axes[0,0].plot(c[:3000], alpha=0.7, linewidth=0.5, color=colors[i])
axes[0,0].axhline(mu_n, color=colors[3], linestyle='--', linewidth=1.5, label='True posterior mean')
axes[0,0].set_title('Trace Plot: Metropolis-Hastings', fontsize=12)
axes[0,0].set_xlabel('Iteration')
axes[0,0].set_ylabel('mu')
axes[0,0].legend(facecolor=bg_color, edgecolor='#334155', labelcolor='white', fontsize=9)

# (0,1) Posterior histogram vs analytic
x_grid = np.linspace(mu_n - 4*np.sqrt(sigma_n2), mu_n + 4*np.sqrt(sigma_n2), 200)
analytic_pdf = (1.0 / np.sqrt(2*np.pi*sigma_n2)) * np.exp(-0.5*(x_grid - mu_n)**2 / sigma_n2)
axes[0,1].hist(combined, bins=60, density=True, color=colors[0], alpha=0.6, edgecolor='none')
axes[0,1].plot(x_grid, analytic_pdf, color=colors[3], linewidth=2, label='Analytic posterior')
axes[0,1].set_title('Posterior: MCMC vs Analytic', fontsize=12)
axes[0,1].set_xlabel('mu')
axes[0,1].set_ylabel('Density')
axes[0,1].legend(facecolor=bg_color, edgecolor='#334155', labelcolor='white', fontsize=9)

# (0,2) Autocorrelation
max_lag = 80
for i, c in enumerate(chains_mh):
    x = c[burn:] - np.mean(c[burn:])
    acf_full = np.correlate(x, x, mode='full')[len(x)-1:]
    acf_vals = acf_full[:max_lag] / acf_full[0]
    axes[0,2].plot(range(max_lag), acf_vals, color=colors[i], alpha=0.8, linewidth=1.2)
axes[0,2].axhline(0, color='white', linewidth=0.5, alpha=0.5)
axes[0,2].set_title('Autocorrelation (MH Chains)', fontsize=12)
axes[0,2].set_xlabel('Lag')
axes[0,2].set_ylabel('ACF')

# (1,0) Gibbs sampler scatter plot
axes[1,0].scatter(x1_samples[gb::3], x2_samples[gb::3], alpha=0.15, s=4, color=colors[1])
theta = np.linspace(0, 2*np.pi, 100)
for k in [1, 2]:
    a = k * sigma1
    b = k * sigma2
    x_ell = a * np.cos(theta)
    y_ell = b * np.sin(theta)
    cos_a = np.sqrt((1 + rho_true) / 2)
    sin_a = np.sqrt((1 - rho_true) / 2)
    R = np.array([[cos_a, -sin_a], [sin_a, cos_a]])
    ell = R @ np.array([x_ell, y_ell])
    axes[1,0].plot(ell[0], ell[1], color=colors[3], linewidth=1.5, alpha=0.8)
axes[1,0].set_title(f'Gibbs: Bivariate Normal (rho={rho_true})', fontsize=12)
axes[1,0].set_xlabel('X1')
axes[1,0].set_ylabel('X2')
axes[1,0].set_xlim(-4, 4)
axes[1,0].set_ylim(-4, 4)

# (1,1) Gibbs trace for X1
axes[1,1].plot(x1_samples[:2000], color=colors[2], linewidth=0.5, alpha=0.8)
axes[1,1].axhline(0, color=colors[3], linestyle='--', linewidth=1.2)
axes[1,1].set_title('Trace Plot: Gibbs X1', fontsize=12)
axes[1,1].set_xlabel('Iteration')
axes[1,1].set_ylabel('X1')

# (1,2) Running mean convergence
for i, c in enumerate(chains_mh):
    running = np.cumsum(c[burn:]) / np.arange(1, len(c[burn:])+1)
    axes[1,2].plot(running, color=colors[i], linewidth=1, alpha=0.8)
axes[1,2].axhline(mu_n, color=colors[3], linestyle='--', linewidth=1.5, label='True mean')
axes[1,2].set_title('Running Mean Convergence', fontsize=12)
axes[1,2].set_xlabel('Post-burnin Iteration')
axes[1,2].set_ylabel('Cumulative Mean')
axes[1,2].legend(facecolor=bg_color, edgecolor='#334155', labelcolor='white', fontsize=9)

plt.tight_layout(pad=2.0)
plt.savefig('output.png', dpi=130, facecolor=bg_color, bbox_inches='tight')
print("\nPlot saved to output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Summary and Key Takeaways

Monte Carlo Foundation

Monte Carlo integration replaces intractable high-dimensional integrals with sample averages converging at rate $O(n^{-1/2})$ regardless of dimension. Importance and rejection sampling are direct methods limited by the curse of dimensionality.

Metropolis-Hastings

Constructs a reversible Markov chain with the target as its stationary distribution. The acceptance ratio requires $\pi$ only up to proportionality, making it ideal for Bayesian posteriors.

Gibbs Sampling

A coordinate-wise updating scheme that samples from full conditionals with acceptance probability 1. Data augmentation and Rao-Blackwellization are powerful extensions.

Convergence Diagnostics

The Gelman-Rubin $\hat{R}$ statistic, effective sample size, trace plots, and autocorrelation analysis are essential tools. Modern practice demands $\hat{R} < 1.01$ and sufficient ESS for reliable inference.

Hamiltonian Monte Carlo

HMC uses gradient information to make distant, high-acceptance proposals by simulating Hamiltonian dynamics. NUTS automates tuning and powers modern probabilistic programming.

← Regression Analysis Model Selection →

Share:X Reddit LinkedIn