Chapter 13: Bayesian Inference
Bayesian inference is a principled framework for learning from data: instead of a single parameter estimate, we maintain a full probability distribution over hypotheses, updating it as evidence arrives. Every prior belief, every observation, and every prediction are part of a coherent calculus of uncertainty.
1. Bayes' Theorem as a Learning Rule
Let \(\theta\) be a parameter (e.g. the bias of a coin) and \(\mathcal{D} = \{x_1,\ldots,x_n\}\) be observed data. Bayes' theorem states:
The evidence \(P(\mathcal{D})\) is typically intractable for continuous \(\theta\) — this is why approximations like MCMC and variational inference are needed. For conjugate priors, the normaliser cancels analytically.
2. Conjugate Priors
A prior \(P(\theta)\) is conjugate to a likelihood \(P(\mathcal{D}|\theta)\) if the posterior is in the same distributional family as the prior. Conjugacy makes Bayesian updating a simple parameter update.
2.1 Beta-Bernoulli (Coin Flipping)
For coin flips \(x_i \in \{0,1\}\) with unknown bias \(p\), the likelihood of \(h\) heads in \(n\) flips is:
Choose a Beta prior: \(P(p) = \mathrm{Beta}(\alpha_0, \beta_0) \propto p^{\alpha_0-1}(1-p)^{\beta_0-1}\). Then:
The hyper-parameters \(\alpha_0, \beta_0\) act as pseudo-counts: \(\alpha_0\) is like having seen \(\alpha_0-1\) extra heads before any real data. The posterior mean is \(\hat{p} = \frac{\alpha_0+h}{\alpha_0+\beta_0+n}\), a weighted average between the prior mean \(\frac{\alpha_0}{\alpha_0+\beta_0}\) and the MLE \(\frac{h}{n}\).
2.2 Normal-Normal (Known Variance)
Suppose observations \(x_i \sim \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\), and prior \(\mu \sim \mathcal{N}(\mu_0, \tau_0^2)\). The likelihood for \(n\) observations is:
Adding the log-prior \(-\frac{1}{2\tau_0^2}(\mu-\mu_0)^2\) and completing the square in \(\mu\):
So the posterior is \(\mu \mid \mathcal{D} \sim \mathcal{N}(\mu_n, \tau_n^2)\). The posterior mean \(\mu_n\) is a precision-weighted average of the prior mean and the MLE \(\bar{x}\). As \(n \to \infty\), the data dominates and \(\mu_n \to \bar{x}\).
3. Bayesian Updating Diagram
4. Predictive Distribution
Rather than plugging in a point estimate of \(\theta\), Bayesian prediction integrates over the posterior — this accounts for parameter uncertainty:
For the Beta-Bernoulli model with posterior \(\mathrm{Beta}(\alpha_n, \beta_n)\), the probability of the next flip being heads is simply \(\mathbb{E}[p \mid \mathcal{D}] = \frac{\alpha_n}{\alpha_n+\beta_n}\). For predicting \(k\) heads in the next \(m\) flips, the result is the Beta-Binomial distribution:
5. MCMC: Metropolis-Hastings
For complex posteriors where conjugacy fails, Markov Chain Monte Carlo (MCMC) draws samples from \(P(\theta|\mathcal{D})\) without needing to compute the normaliser. The Metropolis-Hastings algorithm works as follows:
- Start at some \(\theta^{(0)}\).
- At step \(t\), propose \(\theta^* \sim q(\theta^* \mid \theta^{(t-1)})\) (e.g., Gaussian random walk).
- Compute the acceptance ratio:\[ \alpha = \min\!\left(1,\; \frac{P(\theta^* \mid \mathcal{D})\, q(\theta^{(t-1)} \mid \theta^*)}{P(\theta^{(t-1)} \mid \mathcal{D})\, q(\theta^* \mid \theta^{(t-1)})}\right) \]Note: since \(P(\theta^*|\mathcal{D}) \propto P(\mathcal{D}|\theta^*)P(\theta^*)\), the normaliser cancels in the ratio.
- Accept \(\theta^{(t)} = \theta^*\) with probability \(\alpha\); otherwise stay: \(\theta^{(t)} = \theta^{(t-1)}\).
- Discard the burn-in period. The remaining samples are (approximately) i.i.d. from the posterior.
Why does MH converge to the correct distribution?
The chain satisfies detailed balance: \(\pi(\theta)\,T(\theta^*|\theta) = \pi(\theta^*)\,T(\theta|\theta^*)\), where \(\pi = P(\theta|\mathcal{D})\) is the target and \(T\) is the transition kernel. Detailed balance ensures \(\pi\) is the stationary distribution of the chain.
5.1 Gibbs Sampling
When \(\theta = (\theta_1,\ldots,\theta_K)\) is multivariate but each full conditional \(P(\theta_j | \theta_{-j}, \mathcal{D})\) is tractable, Gibbs sampling cycles through sampling each variable from its full conditional. It is a special case of MH with acceptance ratio always 1.
Cross-Disciplinary Connection
This Bayesian framework is the same one used to model musical expectations in the Music & Mathematics: Neuroscience & Perception chapter. The brain maintains a prior over upcoming notes and updates it with each heard sound — exactly the Beta-Bernoulli update derived above, with musical context playing the role of \(\alpha_0, \beta_0\).
6. Python Simulation: Bayesian Coin Flip
We simulate a coin with true bias \(p=0.65\) and watch how three different Beta priors converge to the true value as evidence accumulates. We also implement Metropolis-Hastings and compare its histogram to the exact Beta posterior, and plot the Beta-Binomial predictive distribution.
Click Run to execute the Python code
Code will be executed with Python 3 on the server