Module 1 · Foundations

MAML — Mathematical Formulation

Model-Agnostic Meta-Learning (Finn, Abbeel & Levine, 2017) is a framework for learning an initialisation of model parameters that is maximally sensitive to task-specific gradient information — enabling rapid adaptation to new tasks with minimal data.

1.1 The Task Distribution

Let \(\mathcal{T}\) be a distribution over tasks. Each task \(\tau_i \sim p(\mathcal{T})\) consists of a support set \(\mathcal{S}_i\) (adaptation data) and a query set \(\mathcal{Q}_i\) (evaluation data).

Task Definition for Protein Simulation

A task \(\tau_i = (\text{scaffold}_i, T_i, \text{isotope}_i)\) specifies a protein microenvironment (scaffold geometry, temperature, isotopic substitution). The support set \(\mathcal{S}_i\) contains a small number of QM reference configurations; the query set contains held-out configurations for generalisation.

1.2 Bi-Level Optimisation

The MAML objective is a two-level optimisation. The inner loop adapts the model to task \(\tau_i\) via gradient descent on the support loss. The outer loop updates the meta-initialisation using the adapted model evaluated on the query set.

\[ \theta^* = \arg\min_{\theta} \sum_{\tau_i \sim p(\mathcal{T})} \mathcal{L}_{\tau_i}\!\left(f_{\phi_i};\, \mathcal{Q}_i\right) \tag{1.1}\]

where the adapted parameters after \(k\) inner steps are:

\[ \phi_i^{(k)} = \theta - \alpha \sum_{s=0}^{k-1} \nabla_\theta \mathcal{L}_{\tau_i}\!\left(f_{\theta^{(s)}};\, \mathcal{S}_i\right), \quad \theta^{(0)} = \theta \tag{1.2}\]

The outer-loop meta-gradient is computed by differentiating through the inner loop, requiring second-order derivatives of \(\mathcal{L}\) with respect to \(\theta\):

\[ \nabla_\theta \mathcal{L}(f_{\phi_i};\, \mathcal{Q}_i) = \nabla_{\phi_i} \mathcal{L} \cdot \frac{\partial \phi_i}{\partial \theta} = \nabla_{\phi_i} \mathcal{L} \cdot \left(I - \alpha \nabla^2_\theta \mathcal{L}_{\tau_i}^{\mathcal{S}}\right) \tag{1.3}\]

The term \(\nabla^2_\theta \mathcal{L}\) (the Hessian of the task loss) is the source of computational cost. First-Order MAML (FOMAML) drops this term; Reptile (Nichol et al. 2018) approximates the meta-gradient by moving \(\theta\)toward \(\phi_i\).

1.3 The Energy–Force Loss

For molecular systems, the task loss combines energy and force supervision:

\[ \mathcal{L}_\tau = \lambda_E \sum_{k=1}^{|\mathcal{S}|} \bigl(E_\theta(\mathbf{R}_k) - E_k^{\text{QM}}\bigr)^2 + \lambda_F \sum_{k=1}^{|\mathcal{S}|} \bigl\|\mathbf{F}_\theta(\mathbf{R}_k) - \mathbf{F}_k^{\text{QM}}\bigr\|^2 \tag{1.4}\]

Force supervision is critical: forces are \(3N\)-dimensional and provide much richer geometric information than scalar energies. Typical values are \(\lambda_F / \lambda_E \approx 10\)\(100\).

1.4 Variants and Approximations

FOMAML — First-Order Approximation

Drop the second-order term in Eq. (1.3): \(\nabla_\theta \mathcal{L}(f_{\phi_i}) \approx \nabla_{\phi_i} \mathcal{L}\). Reduces memory from \(O(N^2)\) to \(O(N)\) but loses curvature information. Adequate when inner-loop steps are small (\(k \leq 3\)) and the loss landscape is sufficiently quadratic. For equivariant NNPs with Clebsch–Gordan coupling, FOMAML is the practical default.

Reptile — Interpolation Meta-Gradient

Update \(\theta \leftarrow \theta + \beta(\phi_i - \theta)\) — simply move the meta-initialisation toward the task-adapted parameters. Equivalent to FOMAML in expectation under certain conditions. Extremely simple to implement; requires no second-order computation. Good baseline for quick experiments.

ANIL — Almost No Inner Loop

Only the final linear layer (head) is adapted in the inner loop; the feature extractor (body) is updated only in the outer loop. Efficient for transfer-learning settings where the molecular representation is stable across tasks. Relevant when the NequIP body is pretrained on a diverse molecular dataset (e.g. SPICE or ANI-2x) and only the energy head requires task-specific adaptation.

Uncertainty-Aware MAML with Gaussian Processes

Replace the deterministic NNP with a GP posterior or a deep ensemble. MAML adapts the GP kernel hyperparameters (or the ensemble diversity) to each task. Provides calibrated uncertainty estimates on the adapted potential — critical for active-learning workflows where one wants to sample new QM configurations in regions of high uncertainty.