Chapter 14: Gaussian Processes
A Gaussian Process (GP) is a distribution over functions: instead of a finite parameter vector, it places a prior over the entire function space. GP regression yields an exact Bayesian posterior with analytic uncertainty estimates β the gold standard for small-data, high-uncertainty settings.
1. Multivariate Gaussian Conditioning
The GP posterior derivation rests on conditioning a multivariate Gaussian. Suppose:
Then the conditional distribution is:
Derivation sketch: Complete the square in the joint Gaussian density with respect to \(\mathbf{x}_1\) while treating \(\mathbf{x}_2\) as fixed. The resulting quadratic form identifies both the conditional mean (linear in \(\mathbf{x}_2\)) and the conditional covariance (independent of \(\mathbf{x}_2\)).
2. Gaussian Process Definition
A Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. It is fully specified by a mean function and a covariance (kernel) function:
For any finite set of inputs \(\mathbf{X} = \{x_1,\ldots,x_n\}\), the function values are jointly Gaussian: \(\mathbf{f} \sim \mathcal{N}(\mathbf{m},\mathbf{K})\) where \(K_{ij} = k(x_i, x_j)\).
2.1 Kernel Functions
Infinitely differentiable β models very smooth functions. \(\ell\) controls the length scale (how quickly the correlation decays), \(\sigma^2\) controls the output variance.
Once differentiable β rougher than RBF, more realistic for physical processes and spatial data.
Models periodic functions with period \(p\) β useful for time series with daily/seasonal patterns.
3. GP Regression: Full Posterior Derivation
Given training data \(\mathcal{D} = \{(\mathbf{X}, \mathbf{y})\}\) with noise model \(y_i = f(x_i) + \varepsilon_i\), \(\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)\), we want the posterior over function values \(\mathbf{f}_* = f(\mathbf{X}_*)\) at test points \(\mathbf{X}_*\).
The joint prior over training and test function values is:
Applying the conditional Gaussian formulas with \(\Sigma_{22} = K(\mathbf{X},\mathbf{X})+\sigma_n^2 I\), \(\Sigma_{12} = K(\mathbf{X}_*,\mathbf{X})\):
The vector \(\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}\) is computed once. The posterior mean \(\mu_*(x_*) = \mathbf{k}(x_*,\mathbf{X})\boldsymbol{\alpha}\) is a kernel-weighted sum over training points β a non-parametric function that passes close to the data.
Computational note
Naive inversion is \(O(n^3)\). In practice use Cholesky: \(L = \mathrm{chol}(K + \sigma_n^2 I)\), then solve \(L^\top L\boldsymbol{\alpha} = \mathbf{y}\) in \(O(n^2)\) per test point. Memory is \(O(n^2)\).
3.1 Marginal Likelihood for Hyperparameter Optimisation
The log marginal likelihood (integrating out \(\mathbf{f}\)) has a closed form:
The first term rewards data fit; the second penalises model complexity (log-determinant grows with model flexibility). Maximising this w.r.t. kernel hyperparameters \(\boldsymbol{\theta} = (\sigma^2, \ell, \sigma_n^2)\) via gradient descent automatically balances fit and complexity β no cross-validation needed.
4. GP Prior and Posterior Diagram
5. Python Simulation: GP Regression from Scratch
We implement GP regression using only NumPy and SciPy. The simulation shows: (1) prior samples from the RBF kernel, (2) the posterior mean and 95% confidence band after observing 9 noisy data points, (3) posterior samples, and (4) a comparison of RBF, MatΓ©rn-3/2, and periodic kernels, each with their log marginal likelihood score.
Click Run to execute the Python code
Code will be executed with Python 3 on the server