Part II: Inference | Chapter 3

Bayesian Inference

Bayes' theorem, conjugate priors, posterior estimation, credible intervals, and hierarchical models

Historical Context

Bayesian inference originates from Reverend Thomas Bayes, whose posthumous 1763 paper “An Essay towards solving a Problem in the Doctrine of Chances” laid the foundation. Pierre-Simon Laplace independently developed and greatly extended these ideas in his Théorie analytique des probabilités (1812), applying them to astronomical and demographic problems. For most of the 20th century, the frequentist paradigm dominated, but the Bayesian approach experienced a renaissance starting in the 1950s through the work of Leonard Jimmie Savage, Dennis Lindley, and others. The computational revolution brought by Markov Chain Monte Carlo (MCMC) methods in the 1990s—particularly the Gibbs sampler and the Metropolis–Hastings algorithm—made Bayesian inference practical for complex models.

3.1 The Bayesian Framework

In Bayesian inference, unknown parameters $\theta$ are treated as random variables with a prior distribution reflecting our beliefs before observing data.

Figure. Bayesian updating: the prior (blue) is combined with the likelihood (red) to produce the posterior (green), which lies between the two and is more concentrated where both agree.

Bayes' Theorem for Parameters

Given data $\mathbf{x} = (x_1, \ldots, x_n)$ and parameter $\theta$:

$$\underbrace{p(\theta \mid \mathbf{x})}_{\text{posterior}} = \frac{\overbrace{p(\mathbf{x} \mid \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathbf{x})}_{\text{marginal likelihood}}}$$

The marginal likelihood $p(\mathbf{x}) = \int p(\mathbf{x} \mid \theta) p(\theta) d\theta$is a normalizing constant. Hence the key identity: posterior $\propto$ likelihood $\times$ prior.

Choosing Priors

Common prior choices include:

Informative priors: encode genuine domain knowledge (e.g., previous studies).
Weakly informative priors: provide soft constraints (e.g., $N(0, 10^2)$).
Non-informative (reference) priors: Jeffreys prior $p(\theta) \propto \sqrt{I(\theta)}$, where $I(\theta)$ is the Fisher information.
Improper priors: e.g., $p(\theta) \propto 1$ on $\mathbb{R}$; valid if the posterior is proper.

3.2 Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior, enabling closed-form updates.

Beta–Binomial Conjugacy

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$. Likelihood: $X \mid \theta \sim \text{Binomial}(n, \theta)$ with $x$ successes.

$$\theta \mid x \sim \text{Beta}(\alpha + x, \; \beta + n - x)$$

The posterior mean is $\frac{\alpha + x}{\alpha + \beta + n}$, a weighted average of the prior mean $\frac{\alpha}{\alpha + \beta}$ and the MLE $\frac{x}{n}$. As $n \to \infty$, the posterior concentrates around the MLE.

Normal–Normal Conjugacy

Prior: $\mu \sim N(\mu_0, \sigma_0^2)$. Likelihood: $X_1, \ldots, X_n \mid \mu \sim N(\mu, \sigma^2)$ (known variance).

$$\mu \mid \mathbf{x} \sim N\!\left(\frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, \; \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\right)$$

The posterior precision (inverse variance) equals the sum of the prior precision and data precision. The posterior mean is the precision-weighted average of the prior mean and sample mean.

3.3 Point Estimation: MAP and Posterior Mean

Maximum A Posteriori (MAP)

The MAP estimate is the mode of the posterior:

$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \, p(\theta \mid \mathbf{x}) = \arg\max_\theta \left[\log p(\mathbf{x} \mid \theta) + \log p(\theta)\right]$$

With a uniform prior, the MAP equals the MLE. With a $N(0, \tau^2)$ prior on regression coefficients, MAP yields ridge regression ($L^2$ penalization with $\lambda = \sigma^2/\tau^2$).

Posterior Mean

The posterior mean minimizes the Bayes risk under squared error loss:

$$\hat{\theta}_{\text{Bayes}} = E[\theta \mid \mathbf{x}] = \int \theta \, p(\theta \mid \mathbf{x}) \, d\theta$$

Under absolute error loss, the optimal estimate is the posterior median. Under 0–1 loss, it is the MAP. The choice of loss function determines which summary of the posterior is optimal.

3.4 Credible Intervals and Bayes Factors

Credible Intervals

A $100(1-\alpha)\%$ credible interval $C$ satisfies

$$P(\theta \in C \mid \mathbf{x}) = 1 - \alpha$$

Two common types:

Equal-tailed interval: $[\theta_{\alpha/2}, \theta_{1-\alpha/2}]$ where $\theta_q$ is the $q$-th posterior quantile.
Highest Posterior Density (HPD): the shortest interval containing $1-\alpha$ posterior probability; preferred for skewed posteriors.

Unlike frequentist confidence intervals, a credible interval directly gives the probability that the parameter lies in the interval, given the observed data.

Bayes Factors

For model comparison between $M_1$ and $M_2$:

$$BF_{12} = \frac{p(\mathbf{x} \mid M_1)}{p(\mathbf{x} \mid M_2)} = \frac{\int p(\mathbf{x} \mid \theta_1, M_1) p(\theta_1 \mid M_1) d\theta_1}{\int p(\mathbf{x} \mid \theta_2, M_2) p(\theta_2 \mid M_2) d\theta_2}$$

$BF_{12} > 1$ favors $M_1$. Jeffreys' scale:$BF > 10$ is “strong evidence,” $BF > 100$ is “decisive.” Unlike p-values, Bayes factors naturally penalize model complexity via the marginal likelihood integration.

3.5 Hierarchical Models

Structure

In a hierarchical (multilevel) model, parameters themselves have distributions governed by hyperparameters:

$$X_i \mid \theta_i \sim f(x \mid \theta_i), \quad \theta_i \mid \phi \sim g(\theta \mid \phi), \quad \phi \sim h(\phi)$$

This enables partial pooling: individual estimates are shrunk toward the group mean, borrowing strength across units. The amount of shrinkage depends on the relative precision of within-group and between-group variation.

Example: Normal Hierarchical Model

Consider $J$ groups, each with $n_j$ observations:

$$X_{ij} \mid \theta_j \sim N(\theta_j, \sigma^2), \quad \theta_j \mid \mu, \tau^2 \sim N(\mu, \tau^2)$$

The posterior mean of $\theta_j$ is approximately

$$\hat{\theta}_j^{\text{Bayes}} \approx \frac{\frac{n_j}{\sigma^2}\bar{x}_j + \frac{1}{\tau^2}\mu}{\frac{n_j}{\sigma^2} + \frac{1}{\tau^2}}$$

Groups with fewer observations are shrunk more toward the grand mean $\mu$. This is the Bayesian analog of James–Stein estimation.

Computational Laboratory

We demonstrate Beta–Binomial conjugate updating, posterior evolution with increasing data, and credible interval computation.

Beta-Binomial Update, Normal Conjugacy, Credible Intervals & Prior Sensitivity

Python

script.py110 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.patch.set_facecolor('#0f172a')
for ax in axes.flat:
    ax.set_facecolor('#0f172a')
    ax.tick_params(colors='#d1d5db')
    ax.xaxis.label.set_color('#d1d5db')
    ax.yaxis.label.set_color('#d1d5db')
    for spine in ax.spines.values():
        spine.set_color('#10b981')

theta_grid = np.linspace(0, 1, 500)

# --- Panel 1: Beta-Binomial conjugate update ---
ax1 = axes[0, 0]
alpha_prior, beta_prior = 2, 5
true_theta = 0.6
data = np.random.binomial(1, true_theta, 50)
ns = [0, 5, 15, 50]
colors_list = ['#6ee7b7', '#34d399', '#059669', '#10b981']
for n_obs, c in zip(ns, colors_list):
    x = data[:n_obs].sum() if n_obs > 0 else 0
    a_post = alpha_prior + x
    b_post = beta_prior + n_obs - x
    ax1.plot(theta_grid, stats.beta.pdf(theta_grid, a_post, b_post),
             color=c, linewidth=2, label=f'n={n_obs} (a={a_post}, b={b_post})')
ax1.axvline(true_theta, color='#f87171', linestyle='--', linewidth=2, label=f'True theta={true_theta}')
ax1.set_title('Beta-Binomial Conjugate Update', color='#10b981', fontsize=13, fontweight='bold')
ax1.set_xlabel('theta')
ax1.set_ylabel('Posterior density')
ax1.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 2: Normal-Normal conjugate update ---
ax2 = axes[0, 1]
mu_0, sigma_0 = 0, 3
sigma = 1
true_mu = 2.5
data_norm = np.random.normal(true_mu, sigma, 100)
mu_grid = np.linspace(-4, 6, 500)
ns2 = [0, 5, 20, 100]
colors2 = ['#6ee7b7', '#34d399', '#059669', '#10b981']
for n_obs, c in zip(ns2, colors2):
    xbar = data_norm[:n_obs].mean() if n_obs > 0 else 0
    prec_prior = 1 / sigma_0**2
    prec_data = n_obs / sigma**2
    prec_post = prec_prior + prec_data
    mu_post = (prec_prior * mu_0 + prec_data * xbar) / prec_post
    sigma_post = np.sqrt(1 / prec_post)
    ax2.plot(mu_grid, stats.norm.pdf(mu_grid, mu_post, sigma_post),
             color=c, linewidth=2, label=f'n={n_obs}')
ax2.axvline(true_mu, color='#f87171', linestyle='--', linewidth=2, label=f'True mu={true_mu}')
ax2.set_title('Normal-Normal Conjugate Update', color='#10b981', fontsize=13, fontweight='bold')
ax2.set_xlabel('mu')
ax2.set_ylabel('Posterior density')
ax2.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 3: Credible intervals (Beta-Binomial) ---
ax3 = axes[1, 0]
n_obs_vals = np.arange(1, 51)
lower_et, upper_et = [], []
lower_hpd, upper_hpd = [], []
for n_obs in n_obs_vals:
    x = data[:n_obs].sum()
    a_post = alpha_prior + x
    b_post = beta_prior + n_obs - x
    rv = stats.beta(a_post, b_post)
    lo, hi = rv.ppf(0.025), rv.ppf(0.975)
    lower_et.append(lo)
    upper_et.append(hi)
ax3.fill_between(n_obs_vals, lower_et, upper_et, alpha=0.3, color='#10b981', label='95% Equal-tailed CI')
ax3.plot(n_obs_vals, [(alpha_prior + data[:n].sum()) / (alpha_prior + beta_prior + n) for n in n_obs_vals],
         color='#34d399', linewidth=2, label='Posterior mean')
ax3.axhline(true_theta, color='#f87171', linestyle='--', linewidth=2, label=f'True theta={true_theta}')
ax3.set_title('Credible Interval Shrinkage', color='#10b981', fontsize=13, fontweight='bold')
ax3.set_xlabel('Number of observations')
ax3.set_ylabel('theta')
ax3.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 4: Prior sensitivity ---
ax4 = axes[1, 1]
n_obs = 20
x_obs = data[:n_obs].sum()
priors = [(1, 1, 'Uniform'), (2, 5, 'Skeptical'), (5, 2, 'Optimistic'), (0.5, 0.5, 'Jeffreys')]
colors4 = ['#6ee7b7', '#10b981', '#059669', '#34d399']
for (a, b, name), c in zip(priors, colors4):
    a_post = a + x_obs
    b_post = b + n_obs - x_obs
    ax4.plot(theta_grid, stats.beta.pdf(theta_grid, a_post, b_post),
             color=c, linewidth=2, label=f'{name}: Beta({a},{b})')
ax4.axvline(true_theta, color='#f87171', linestyle='--', linewidth=2)
ax4.axvline(x_obs / n_obs, color='#fbbf24', linestyle=':', linewidth=2, label=f'MLE={x_obs/n_obs:.2f}')
ax4.set_title(f'Prior Sensitivity (n={n_obs}, x={x_obs})', color='#10b981', fontsize=13, fontweight='bold')
ax4.set_xlabel('theta')
ax4.set_ylabel('Posterior density')
ax4.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0f172a')
print("Bayesian inference laboratory complete.")
print(f"Beta-Binomial: n=50, x={data.sum()}, posterior Beta({alpha_prior + data.sum()}, {beta_prior + 50 - data.sum()})")
print(f"Posterior mean = {(alpha_prior + data.sum()) / (alpha_prior + beta_prior + 50):.4f}")
print(f"MLE = {data.mean():.4f}, True theta = {true_theta}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

←Limit Theorems Hypothesis Testing→

Share:X Reddit LinkedIn