Part II: Inference | Chapter 3

Bayesian Inference

Bayes' theorem, conjugate priors, posterior estimation, credible intervals, and hierarchical models

Historical Context

Bayesian inference originates from Reverend Thomas Bayes, whose posthumous 1763 paper “An Essay towards solving a Problem in the Doctrine of Chances” laid the foundation. Pierre-Simon Laplace independently developed and greatly extended these ideas in his Théorie analytique des probabilités (1812), applying them to astronomical and demographic problems. For most of the 20th century, the frequentist paradigm dominated, but the Bayesian approach experienced a renaissance starting in the 1950s through the work of Leonard Jimmie Savage, Dennis Lindley, and others. The computational revolution brought by Markov Chain Monte Carlo (MCMC) methods in the 1990s—particularly the Gibbs sampler and the Metropolis–Hastings algorithm—made Bayesian inference practical for complex models.

3.1 The Bayesian Framework

In Bayesian inference, unknown parameters $\theta$ are treated as random variables with a prior distribution reflecting our beliefs before observing data.

θ (parameter)DensityPrior p(θ)Likelihood p(x|θ)Posterior p(θ|x)× likelihood
Figure. Bayesian updating: the prior (blue) is combined with the likelihood (red) to produce the posterior (green), which lies between the two and is more concentrated where both agree.

Bayes' Theorem for Parameters

Given data $\mathbf{x} = (x_1, \ldots, x_n)$ and parameter $\theta$:

$$\underbrace{p(\theta \mid \mathbf{x})}_{\text{posterior}} = \frac{\overbrace{p(\mathbf{x} \mid \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathbf{x})}_{\text{marginal likelihood}}}$$

The marginal likelihood $p(\mathbf{x}) = \int p(\mathbf{x} \mid \theta) p(\theta) d\theta$is a normalizing constant. Hence the key identity: posterior $\propto$ likelihood $\times$ prior.

Choosing Priors

Common prior choices include:

  • Informative priors: encode genuine domain knowledge (e.g., previous studies).
  • Weakly informative priors: provide soft constraints (e.g., $N(0, 10^2)$).
  • Non-informative (reference) priors: Jeffreys prior $p(\theta) \propto \sqrt{I(\theta)}$, where $I(\theta)$ is the Fisher information.
  • Improper priors: e.g., $p(\theta) \propto 1$ on $\mathbb{R}$; valid if the posterior is proper.

3.2 Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior, enabling closed-form updates.

Beta–Binomial Conjugacy

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$. Likelihood: $X \mid \theta \sim \text{Binomial}(n, \theta)$ with $x$ successes.

$$\theta \mid x \sim \text{Beta}(\alpha + x, \; \beta + n - x)$$

The posterior mean is $\frac{\alpha + x}{\alpha + \beta + n}$, a weighted average of the prior mean $\frac{\alpha}{\alpha + \beta}$ and the MLE $\frac{x}{n}$. As $n \to \infty$, the posterior concentrates around the MLE.

Normal–Normal Conjugacy

Prior: $\mu \sim N(\mu_0, \sigma_0^2)$. Likelihood: $X_1, \ldots, X_n \mid \mu \sim N(\mu, \sigma^2)$ (known variance).

$$\mu \mid \mathbf{x} \sim N\!\left(\frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, \; \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\right)$$

The posterior precision (inverse variance) equals the sum of the prior precision and data precision. The posterior mean is the precision-weighted average of the prior mean and sample mean.

3.3 Point Estimation: MAP and Posterior Mean

Maximum A Posteriori (MAP)

The MAP estimate is the mode of the posterior:

$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \, p(\theta \mid \mathbf{x}) = \arg\max_\theta \left[\log p(\mathbf{x} \mid \theta) + \log p(\theta)\right]$$

With a uniform prior, the MAP equals the MLE. With a $N(0, \tau^2)$ prior on regression coefficients, MAP yields ridge regression ($L^2$ penalization with $\lambda = \sigma^2/\tau^2$).

Posterior Mean

The posterior mean minimizes the Bayes risk under squared error loss:

$$\hat{\theta}_{\text{Bayes}} = E[\theta \mid \mathbf{x}] = \int \theta \, p(\theta \mid \mathbf{x}) \, d\theta$$

Under absolute error loss, the optimal estimate is the posterior median. Under 0–1 loss, it is the MAP. The choice of loss function determines which summary of the posterior is optimal.

3.4 Credible Intervals and Bayes Factors

Credible Intervals

A $100(1-\alpha)\%$ credible interval $C$ satisfies

$$P(\theta \in C \mid \mathbf{x}) = 1 - \alpha$$

Two common types:

  • Equal-tailed interval: $[\theta_{\alpha/2}, \theta_{1-\alpha/2}]$ where $\theta_q$ is the $q$-th posterior quantile.
  • Highest Posterior Density (HPD): the shortest interval containing $1-\alpha$ posterior probability; preferred for skewed posteriors.

Unlike frequentist confidence intervals, a credible interval directly gives the probability that the parameter lies in the interval, given the observed data.

Bayes Factors

For model comparison between $M_1$ and $M_2$:

$$BF_{12} = \frac{p(\mathbf{x} \mid M_1)}{p(\mathbf{x} \mid M_2)} = \frac{\int p(\mathbf{x} \mid \theta_1, M_1) p(\theta_1 \mid M_1) d\theta_1}{\int p(\mathbf{x} \mid \theta_2, M_2) p(\theta_2 \mid M_2) d\theta_2}$$

$BF_{12} > 1$ favors $M_1$. Jeffreys' scale:$BF > 10$ is “strong evidence,” $BF > 100$ is “decisive.” Unlike p-values, Bayes factors naturally penalize model complexity via the marginal likelihood integration.

3.5 Hierarchical Models

Structure

In a hierarchical (multilevel) model, parameters themselves have distributions governed by hyperparameters:

$$X_i \mid \theta_i \sim f(x \mid \theta_i), \quad \theta_i \mid \phi \sim g(\theta \mid \phi), \quad \phi \sim h(\phi)$$

This enables partial pooling: individual estimates are shrunk toward the group mean, borrowing strength across units. The amount of shrinkage depends on the relative precision of within-group and between-group variation.

Example: Normal Hierarchical Model

Consider $J$ groups, each with $n_j$ observations:

$$X_{ij} \mid \theta_j \sim N(\theta_j, \sigma^2), \quad \theta_j \mid \mu, \tau^2 \sim N(\mu, \tau^2)$$

The posterior mean of $\theta_j$ is approximately

$$\hat{\theta}_j^{\text{Bayes}} \approx \frac{\frac{n_j}{\sigma^2}\bar{x}_j + \frac{1}{\tau^2}\mu}{\frac{n_j}{\sigma^2} + \frac{1}{\tau^2}}$$

Groups with fewer observations are shrunk more toward the grand mean $\mu$. This is the Bayesian analog of James–Stein estimation.

Computational Laboratory

We demonstrate Beta–Binomial conjugate updating, posterior evolution with increasing data, and credible interval computation.

Beta-Binomial Update, Normal Conjugacy, Credible Intervals & Prior Sensitivity

Python
script.py110 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Rate this chapter: