ML for Science/Part III: Scientific Applications

Symbolic Regression

Discovering equations from data — from genetic programming to SINDy and the search for physical laws

Introduction

Symbolic regression is the task of finding a mathematical expression that best describes a dataset. Unlike standard regression, which fits parameters of a fixed functional form, symbolic regression searches over the space of possible equationsthemselves. This makes it uniquely powerful for scientific discovery: it can rediscover known physical laws and potentially find new ones.

While neural networks are powerful function approximators, they produce opaque black boxes. Symbolic regression produces interpretable, generalizable equationsthat scientists can analyze, verify against dimensional analysis, and use for extrapolation.

Key Topics

  • 1. The Symbolic Regression Problem
  • 2. Genetic Programming
  • 3. Sparse Regression (SINDy)
  • 4. Dimensional Analysis Constraints
  • 5. PySR and Modern Approaches
  • 6. Discovering Physical Laws

1. The Symbolic Regression Problem

Problem Statement

Given data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, find a symbolic expression $f^*$ that minimizes:

$$\boxed{f^* = \arg\min_{f \in \mathcal{F}} \left[\sum_{i=1}^{n}(y_i - f(\mathbf{x}_i))^2 + \lambda \cdot C(f)\right]}$$

where $\mathcal{F}$ is the space of symbolic expressions built from a set of operations ($+, -, \times, \div, \sin, \cos, \exp, \log, \ldots$), and $C(f)$ is a complexity penalty (e.g., expression length) that promotes parsimony.

This is an NP-hard combinatorial optimization problem. The search space is enormous: the number of possible expressions of length $L$ with $k$ operations grows as $O(k^L)$.

Expression Trees

Symbolic expressions are represented as expression trees:

  • Leaves: Variables ($x_1, x_2, \ldots$) and constants ($c_1, c_2, \ldots$)
  • Internal nodes: Unary operations ($\sin, \cos, \exp, \log, \text{abs}, \sqrt{\cdot}$)
  • Binary nodes: Binary operations ($+, -, \times, \div, \text{pow}$)

For example, $f(x) = \sin(2x) + x^2$ is a tree with "+" at the root, "sin(2*x)" and "pow(x, 2)" as subtrees. The complexity $C(f)$ is often the number of nodes.

Pareto Optimality

Instead of fixing $\lambda$, we seek the Pareto front of accuracy vs. complexity. An expression is Pareto-optimal if no other expression is both simpler and more accurate:

$$\text{Pareto front} = \{f \in \mathcal{F} : \nexists g \text{ with } \text{MSE}(g) \leq \text{MSE}(f) \text{ and } C(g) \leq C(f)\}$$

The scientist then examines the Pareto front and selects the equation at the "elbow" where accuracy improves sharply with a small increase in complexity.

2. Genetic Programming

Genetic programming (GP) is the classic approach to symbolic regression. It evolves a population of expression trees using bio-inspired operators.

The Evolutionary Algorithm

  1. Initialization: Generate a population of $P$ random expression trees
  2. Fitness evaluation: Compute $\text{fitness}(f) = -\text{MSE}(f) - \lambda C(f)$for each individual
  3. Selection: Choose parents using tournament selection (pick $k$random individuals, select the fittest)
  4. Crossover: Swap random subtrees between two parent trees to create offspring
  5. Mutation: Randomly modify nodes (change operation, replace subtree, add/remove nodes)
  6. Replacement: Form the next generation from offspring and possibly elite individuals
  7. Repeat steps 2-6 for $G$ generations

Genetic Operators

  • Crossover: Select random nodes in two parent trees and swap the subtrees rooted at those nodes. This combines useful subexpressions from different individuals.
  • Point mutation: Replace a random node with another of the same arity (e.g., change $+$ to $\times$, or $\sin$ to $\cos$).
  • Subtree mutation: Replace a random subtree with a newly generated random subtree. Introduces fresh genetic material.
  • Constant optimization: After tree structure is fixed, optimize numerical constants using gradient descent or BFGS.

Bloat Control

GP suffers from bloat: expressions grow increasingly large without improving fitness (intron code). Strategies to control bloat:

  • Maximum tree depth or size limits
  • Parsimony pressure in the fitness function
  • Lexicographic tournament: among equally fit individuals, prefer smaller ones
  • Simplification: periodically simplify expressions (e.g., $x + 0 \to x$)

3. Sparse Regression: SINDy

The Sparse Identification of Nonlinear Dynamics (SINDy) method (Brunton, Proctor & Kutz, 2016) takes a fundamentally different approach. Instead of searching over tree structures, it constructs a library of candidate functions and uses sparse regression to select the active terms.

The SINDy Framework

For a dynamical system $\dot{\mathbf{x}} = f(\mathbf{x})$:

  1. Construct the library $\boldsymbol{\Theta}(\mathbf{X})$: a matrix where each column is a candidate function evaluated on the data. For state variables $(x_1, x_2)$:
    $$\boldsymbol{\Theta}(\mathbf{X}) = [1 \;|\; x_1 \;|\; x_2 \;|\; x_1^2 \;|\; x_1 x_2 \;|\; x_2^2 \;|\; x_1^3 \;|\; \cdots \;|\; \sin(x_1) \;|\; \cdots]$$
  2. Compute derivatives $\dot{\mathbf{X}}$ from data (using finite differences, polynomial fitting, or total variation regularized differentiation).
  3. Solve the sparse regression problem:
    $$\boxed{\dot{\mathbf{X}} = \boldsymbol{\Theta}(\mathbf{X})\boldsymbol{\Xi}, \quad \text{minimize } \|\dot{\mathbf{X}} - \boldsymbol{\Theta}\boldsymbol{\Xi}\|_2^2 + \lambda\|\boldsymbol{\Xi}\|_1}$$

The L1 penalty promotes sparsity in $\boldsymbol{\Xi}$, selecting only the most important terms. Each nonzero entry identifies an active term in the governing equation.

Sequential Thresholded Least Squares (STLS)

SINDy uses a simple iterative algorithm instead of Lasso:

  1. Solve the least-squares problem: $\boldsymbol{\Xi} = (\boldsymbol{\Theta}^T\boldsymbol{\Theta})^{-1}\boldsymbol{\Theta}^T\dot{\mathbf{X}}$
  2. Set small coefficients to zero: $\Xi_{ij} \leftarrow 0$ if $|\Xi_{ij}| < \tau$
  3. Re-solve least squares using only the remaining (nonzero) terms
  4. Repeat until convergence

The threshold $\tau$ controls sparsity: larger $\tau$ gives simpler equations.

Example: Lotka-Volterra Discovery

Given time-series data from a predator-prey system, SINDy can recover:

$$\dot{x}_1 = \alpha x_1 - \beta x_1 x_2$$
$$\dot{x}_2 = -\gamma x_2 + \delta x_1 x_2$$

from a library containing $\{1, x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots\}$. The sparse solution correctly identifies only the four active terms.

4. Dimensional Analysis Constraints

In physics, valid equations must be dimensionally consistent. This powerful constraint dramatically prunes the search space.

Buckingham Pi Theorem

If a physical relationship involves $n$ variables with $k$ independent dimensions (mass, length, time, ...), it can be expressed in terms of $n - k$ dimensionless groups ($\Pi$ groups):

$$\boxed{f(\Pi_1, \Pi_2, \ldots, \Pi_{n-k}) = 0}$$

Example: For drag force $F$ on a sphere with density $\rho$, velocity $v$, diameter $d$, and viscosity $\mu$ (5 variables, 3 dimensions):

$$\frac{F}{\rho v^2 d^2} = g\left(\frac{\rho v d}{\mu}\right) = g(\text{Re})$$

The symbolic regression problem reduces from finding $F(\rho, v, d, \mu)$ to finding the function $g(\text{Re})$ of a single dimensionless variable.

Dimensional Constraints in Expression Trees

Dimensional analysis constrains which operations are valid:

  • Addition/subtraction: Operands must have the same dimensions
  • Multiplication: Dimensions multiply ($[ab] = [a][b]$)
  • Division: Dimensions divide ($[a/b] = [a]/[b]$)
  • Transcendental functions ($\sin, \exp, \log$): Arguments must be dimensionless
  • Powers: Exponents must be dimensionless; the base's dimension is raised to the power

Enforcing these constraints during search can reduce the search space by orders of magnitude.

5. PySR and Modern Approaches

PySR (Cranmer, 2023)

PySR is the state-of-the-art open-source symbolic regression tool. Key innovations:

  • Multi-population evolution: Multiple island populations evolve independently with occasional migration, promoting diversity
  • Simulated annealing: Temperature-controlled acceptance of mutations helps escape local optima
  • BFGS constant optimization: After every structural mutation, numerical constants are optimized by gradient-based methods
  • Pareto front tracking: Returns the full complexity vs. accuracy tradeoff
  • Custom operators: Users can define domain-specific operations

Neural-Guided Symbolic Regression

Recent approaches use neural networks to guide the symbolic search:

  • AI Feynman (Udrescu & Tegmark, 2020): Uses neural networks to detect symmetries, separability, and compositionality in data, then decomposes the problem into simpler subproblems.
  • Deep Symbolic Regression: Train a neural network (e.g., RNN or Transformer) to generate expression trees token-by-token, using reinforcement learning with MSE reward.
  • Graph Neural Networks: Learn to propose likely equation structures from dataset features (e.g., number of variables, symmetries).

Information-Theoretic Model Selection

We can formalize model selection using information criteria. For an expression $f$ with$p$ free parameters fit to $n$ data points:

$$\text{AIC} = 2p - 2\log\hat{L}, \quad \text{BIC} = p\log n - 2\log\hat{L}$$

The Minimum Description Length (MDL) principle provides a deeper justification:

$$\text{MDL}(f, \text{data}) = \underbrace{L(\text{model})}_{\text{description of } f} + \underbrace{L(\text{data}|\text{model})}_{\text{residuals}}$$

The best model minimizes the total description length — the number of bits needed to describe both the model and the data residuals.

6. Discovering Physical Laws

Kepler's Third Law

Given orbital period $T$ and semi-major axis $a$ for planets:

$T^2 = \frac{4\pi^2}{GM}a^3$

Symbolic regression on $(a_i, T_i)$ data recovers $T \propto a^{3/2}$.

Newton's Law of Gravity

Given force measurements between masses at varying distances:

$F = G\frac{m_1 m_2}{r^2}$

Symbolic regression recovers the inverse-square law from data.

Conservation Laws

From trajectory data of colliding particles, symbolic regression can discover:

$E = \frac{1}{2}mv^2 + mgh = \text{const}$

Energy conservation emerges as an invariant in the data.

Dark Matter Acceleration

Cranmer et al. (2020) used symbolic regression on GNN-discovered functions to find:

New analytic approximations to dark matter dynamics from N-body simulations.

7. Python Simulation: Symbolic Regression & SINDy

This simulation implements a basic genetic programming symbolic regressor and the SINDy algorithm for discovering dynamical systems.

Symbolic Regression & SINDy from Scratch

Python
script.py303 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Challenges and Open Problems

Current Limitations

  • Scalability: GP search becomes intractable for many variables ($d > 5-10$)
  • Noise sensitivity: Noisy data makes it hard to distinguish signal from noise in equation structure
  • Non-uniqueness: Many algebraically different expressions compute the same function
  • Numerical constants: Finding the right constants is itself a hard optimization problem
  • PDEs: Extending to partial differential equations is far more challenging

Frontier Directions

  • LLM-guided search: Using large language models to propose equation structures based on domain knowledge
  • Equivariant symbolic regression: Building in symmetries (rotation, translation) as hard constraints
  • Causal discovery: Combining symbolic regression with causal inference to distinguish correlation from causation
  • Multi-scale equations: Discovering equations that operate across different length/time scales

Summary

  • Symbolic regression: Search for mathematical expressions that explain data, balancing accuracy and simplicity
  • Genetic programming: Evolve expression trees using crossover, mutation, and selection
  • SINDy: Sparse regression over a library of candidate functions; $\dot{\mathbf{X}} = \boldsymbol{\Theta}\boldsymbol{\Xi}$ with $\|\boldsymbol{\Xi}\|_1$ penalty
  • Dimensional analysis: Buckingham Pi theorem constrains valid expressions to dimensionally consistent forms
  • PySR: State-of-the-art tool using multi-population evolution with constant optimization
  • Discovery: Symbolic regression has recovered Kepler's laws, conservation laws, and novel physics from data