Bayesian Inference
Bayes' theorem, conjugate priors, posterior estimation, credible intervals, and hierarchical models
Historical Context
Bayesian inference originates from Reverend Thomas Bayes, whose posthumous 1763 paper “An Essay towards solving a Problem in the Doctrine of Chances” laid the foundation. Pierre-Simon Laplace independently developed and greatly extended these ideas in his Théorie analytique des probabilités (1812), applying them to astronomical and demographic problems. For most of the 20th century, the frequentist paradigm dominated, but the Bayesian approach experienced a renaissance starting in the 1950s through the work of Leonard Jimmie Savage, Dennis Lindley, and others. The computational revolution brought by Markov Chain Monte Carlo (MCMC) methods in the 1990s—particularly the Gibbs sampler and the Metropolis–Hastings algorithm—made Bayesian inference practical for complex models.
3.1 The Bayesian Framework
In Bayesian inference, unknown parameters $\theta$ are treated as random variables with a prior distribution reflecting our beliefs before observing data.
Bayes' Theorem for Parameters
Given data $\mathbf{x} = (x_1, \ldots, x_n)$ and parameter $\theta$:
The marginal likelihood $p(\mathbf{x}) = \int p(\mathbf{x} \mid \theta) p(\theta) d\theta$is a normalizing constant. Hence the key identity: posterior $\propto$ likelihood $\times$ prior.
Choosing Priors
Common prior choices include:
- Informative priors: encode genuine domain knowledge (e.g., previous studies).
- Weakly informative priors: provide soft constraints (e.g., $N(0, 10^2)$).
- Non-informative (reference) priors: Jeffreys prior $p(\theta) \propto \sqrt{I(\theta)}$, where $I(\theta)$ is the Fisher information.
- Improper priors: e.g., $p(\theta) \propto 1$ on $\mathbb{R}$; valid if the posterior is proper.
3.2 Conjugate Priors
A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior, enabling closed-form updates.
Beta–Binomial Conjugacy
Prior: $\theta \sim \text{Beta}(\alpha, \beta)$. Likelihood: $X \mid \theta \sim \text{Binomial}(n, \theta)$ with $x$ successes.
The posterior mean is $\frac{\alpha + x}{\alpha + \beta + n}$, a weighted average of the prior mean $\frac{\alpha}{\alpha + \beta}$ and the MLE $\frac{x}{n}$. As $n \to \infty$, the posterior concentrates around the MLE.
Normal–Normal Conjugacy
Prior: $\mu \sim N(\mu_0, \sigma_0^2)$. Likelihood: $X_1, \ldots, X_n \mid \mu \sim N(\mu, \sigma^2)$ (known variance).
The posterior precision (inverse variance) equals the sum of the prior precision and data precision. The posterior mean is the precision-weighted average of the prior mean and sample mean.
3.3 Point Estimation: MAP and Posterior Mean
Maximum A Posteriori (MAP)
The MAP estimate is the mode of the posterior:
With a uniform prior, the MAP equals the MLE. With a $N(0, \tau^2)$ prior on regression coefficients, MAP yields ridge regression ($L^2$ penalization with $\lambda = \sigma^2/\tau^2$).
Posterior Mean
The posterior mean minimizes the Bayes risk under squared error loss:
Under absolute error loss, the optimal estimate is the posterior median. Under 0–1 loss, it is the MAP. The choice of loss function determines which summary of the posterior is optimal.
3.4 Credible Intervals and Bayes Factors
Credible Intervals
A $100(1-\alpha)\%$ credible interval $C$ satisfies
Two common types:
- Equal-tailed interval: $[\theta_{\alpha/2}, \theta_{1-\alpha/2}]$ where $\theta_q$ is the $q$-th posterior quantile.
- Highest Posterior Density (HPD): the shortest interval containing $1-\alpha$ posterior probability; preferred for skewed posteriors.
Unlike frequentist confidence intervals, a credible interval directly gives the probability that the parameter lies in the interval, given the observed data.
Bayes Factors
For model comparison between $M_1$ and $M_2$:
$BF_{12} > 1$ favors $M_1$. Jeffreys' scale:$BF > 10$ is “strong evidence,” $BF > 100$ is “decisive.” Unlike p-values, Bayes factors naturally penalize model complexity via the marginal likelihood integration.
3.5 Hierarchical Models
Structure
In a hierarchical (multilevel) model, parameters themselves have distributions governed by hyperparameters:
This enables partial pooling: individual estimates are shrunk toward the group mean, borrowing strength across units. The amount of shrinkage depends on the relative precision of within-group and between-group variation.
Example: Normal Hierarchical Model
Consider $J$ groups, each with $n_j$ observations:
The posterior mean of $\theta_j$ is approximately
Groups with fewer observations are shrunk more toward the grand mean $\mu$. This is the Bayesian analog of James–Stein estimation.
Computational Laboratory
We demonstrate Beta–Binomial conjugate updating, posterior evolution with increasing data, and credible interval computation.
Beta-Binomial Update, Normal Conjugacy, Credible Intervals & Prior Sensitivity
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server