ML for Science/Part III: Scientific Applications

Symbolic Regression

Discovering equations from data — from genetic programming to SINDy and the search for physical laws

Introduction

Symbolic regression is the task of finding a mathematical expression that best describes a dataset. Unlike standard regression, which fits parameters of a fixed functional form, symbolic regression searches over the space of possible equationsthemselves. This makes it uniquely powerful for scientific discovery: it can rediscover known physical laws and potentially find new ones.

While neural networks are powerful function approximators, they produce opaque black boxes. Symbolic regression produces interpretable, generalizable equationsthat scientists can analyze, verify against dimensional analysis, and use for extrapolation.

Key Topics

1. The Symbolic Regression Problem
2. Genetic Programming
3. Sparse Regression (SINDy)
4. Dimensional Analysis Constraints
5. PySR and Modern Approaches
6. Discovering Physical Laws

1. The Symbolic Regression Problem

Problem Statement

Given data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, find a symbolic expression $f^*$ that minimizes:

$$\boxed{f^* = \arg\min_{f \in \mathcal{F}} \left[\sum_{i=1}^{n}(y_i - f(\mathbf{x}_i))^2 + \lambda \cdot C(f)\right]}$$

where $\mathcal{F}$ is the space of symbolic expressions built from a set of operations ($+, -, \times, \div, \sin, \cos, \exp, \log, \ldots$), and $C(f)$ is a complexity penalty (e.g., expression length) that promotes parsimony.

This is an NP-hard combinatorial optimization problem. The search space is enormous: the number of possible expressions of length $L$ with $k$ operations grows as $O(k^L)$.

Expression Trees

Symbolic expressions are represented as expression trees:

Leaves: Variables ($x_1, x_2, \ldots$) and constants ($c_1, c_2, \ldots$)
Internal nodes: Unary operations ($\sin, \cos, \exp, \log, \text{abs}, \sqrt{\cdot}$)
Binary nodes: Binary operations ($+, -, \times, \div, \text{pow}$)

For example, $f(x) = \sin(2x) + x^2$ is a tree with "+" at the root, "sin(2*x)" and "pow(x, 2)" as subtrees. The complexity $C(f)$ is often the number of nodes.

Pareto Optimality

Instead of fixing $\lambda$, we seek the Pareto front of accuracy vs. complexity. An expression is Pareto-optimal if no other expression is both simpler and more accurate:

$$\text{Pareto front} = \{f \in \mathcal{F} : \nexists g \text{ with } \text{MSE}(g) \leq \text{MSE}(f) \text{ and } C(g) \leq C(f)\}$$

The scientist then examines the Pareto front and selects the equation at the "elbow" where accuracy improves sharply with a small increase in complexity.

2. Genetic Programming

Genetic programming (GP) is the classic approach to symbolic regression. It evolves a population of expression trees using bio-inspired operators.

The Evolutionary Algorithm

Initialization: Generate a population of $P$ random expression trees
Fitness evaluation: Compute $\text{fitness}(f) = -\text{MSE}(f) - \lambda C(f)$for each individual
Selection: Choose parents using tournament selection (pick $k$random individuals, select the fittest)
Crossover: Swap random subtrees between two parent trees to create offspring
Mutation: Randomly modify nodes (change operation, replace subtree, add/remove nodes)
Replacement: Form the next generation from offspring and possibly elite individuals
Repeat steps 2-6 for $G$ generations

Genetic Operators

Crossover: Select random nodes in two parent trees and swap the subtrees rooted at those nodes. This combines useful subexpressions from different individuals.
Point mutation: Replace a random node with another of the same arity (e.g., change $+$ to $\times$, or $\sin$ to $\cos$).
Subtree mutation: Replace a random subtree with a newly generated random subtree. Introduces fresh genetic material.
Constant optimization: After tree structure is fixed, optimize numerical constants using gradient descent or BFGS.

Bloat Control

GP suffers from bloat: expressions grow increasingly large without improving fitness (intron code). Strategies to control bloat:

Maximum tree depth or size limits
Parsimony pressure in the fitness function
Lexicographic tournament: among equally fit individuals, prefer smaller ones
Simplification: periodically simplify expressions (e.g., $x + 0 \to x$)

3. Sparse Regression: SINDy

The Sparse Identification of Nonlinear Dynamics (SINDy) method (Brunton, Proctor & Kutz, 2016) takes a fundamentally different approach. Instead of searching over tree structures, it constructs a library of candidate functions and uses sparse regression to select the active terms.

The SINDy Framework

For a dynamical system $\dot{\mathbf{x}} = f(\mathbf{x})$:

Construct the library $\boldsymbol{\Theta}(\mathbf{X})$: a matrix where each column is a candidate function evaluated on the data. For state variables $(x_1, x_2)$:
$$\boldsymbol{\Theta}(\mathbf{X}) = [1 \;|\; x_1 \;|\; x_2 \;|\; x_1^2 \;|\; x_1 x_2 \;|\; x_2^2 \;|\; x_1^3 \;|\; \cdots \;|\; \sin(x_1) \;|\; \cdots]$$
Compute derivatives $\dot{\mathbf{X}}$ from data (using finite differences, polynomial fitting, or total variation regularized differentiation).
Solve the sparse regression problem:
$$\boxed{\dot{\mathbf{X}} = \boldsymbol{\Theta}(\mathbf{X})\boldsymbol{\Xi}, \quad \text{minimize } \|\dot{\mathbf{X}} - \boldsymbol{\Theta}\boldsymbol{\Xi}\|_2^2 + \lambda\|\boldsymbol{\Xi}\|_1}$$

The L1 penalty promotes sparsity in $\boldsymbol{\Xi}$, selecting only the most important terms. Each nonzero entry identifies an active term in the governing equation.

Sequential Thresholded Least Squares (STLS)

SINDy uses a simple iterative algorithm instead of Lasso:

Solve the least-squares problem: $\boldsymbol{\Xi} = (\boldsymbol{\Theta}^T\boldsymbol{\Theta})^{-1}\boldsymbol{\Theta}^T\dot{\mathbf{X}}$
Set small coefficients to zero: $\Xi_{ij} \leftarrow 0$ if $|\Xi_{ij}| < \tau$
Re-solve least squares using only the remaining (nonzero) terms
Repeat until convergence

The threshold $\tau$ controls sparsity: larger $\tau$ gives simpler equations.

Example: Lotka-Volterra Discovery

Given time-series data from a predator-prey system, SINDy can recover:

$$\dot{x}_1 = \alpha x_1 - \beta x_1 x_2$$

$$\dot{x}_2 = -\gamma x_2 + \delta x_1 x_2$$

from a library containing $\{1, x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots\}$. The sparse solution correctly identifies only the four active terms.

4. Dimensional Analysis Constraints

In physics, valid equations must be dimensionally consistent. This powerful constraint dramatically prunes the search space.

Buckingham Pi Theorem

If a physical relationship involves $n$ variables with $k$ independent dimensions (mass, length, time, ...), it can be expressed in terms of $n - k$ dimensionless groups ($\Pi$ groups):

$$\boxed{f(\Pi_1, \Pi_2, \ldots, \Pi_{n-k}) = 0}$$

Example: For drag force $F$ on a sphere with density $\rho$, velocity $v$, diameter $d$, and viscosity $\mu$ (5 variables, 3 dimensions):

$$\frac{F}{\rho v^2 d^2} = g\left(\frac{\rho v d}{\mu}\right) = g(\text{Re})$$

The symbolic regression problem reduces from finding $F(\rho, v, d, \mu)$ to finding the function $g(\text{Re})$ of a single dimensionless variable.

Dimensional Constraints in Expression Trees

Dimensional analysis constrains which operations are valid:

Addition/subtraction: Operands must have the same dimensions
Multiplication: Dimensions multiply ($[ab] = [a][b]$)
Division: Dimensions divide ($[a/b] = [a]/[b]$)
Transcendental functions ($\sin, \exp, \log$): Arguments must be dimensionless
Powers: Exponents must be dimensionless; the base's dimension is raised to the power

Enforcing these constraints during search can reduce the search space by orders of magnitude.

5. PySR and Modern Approaches

PySR (Cranmer, 2023)

PySR is the state-of-the-art open-source symbolic regression tool. Key innovations:

Multi-population evolution: Multiple island populations evolve independently with occasional migration, promoting diversity
Simulated annealing: Temperature-controlled acceptance of mutations helps escape local optima
BFGS constant optimization: After every structural mutation, numerical constants are optimized by gradient-based methods
Pareto front tracking: Returns the full complexity vs. accuracy tradeoff
Custom operators: Users can define domain-specific operations

Neural-Guided Symbolic Regression

Recent approaches use neural networks to guide the symbolic search:

AI Feynman (Udrescu & Tegmark, 2020): Uses neural networks to detect symmetries, separability, and compositionality in data, then decomposes the problem into simpler subproblems.
Deep Symbolic Regression: Train a neural network (e.g., RNN or Transformer) to generate expression trees token-by-token, using reinforcement learning with MSE reward.
Graph Neural Networks: Learn to propose likely equation structures from dataset features (e.g., number of variables, symmetries).

Information-Theoretic Model Selection

We can formalize model selection using information criteria. For an expression $f$ with$p$ free parameters fit to $n$ data points:

$$\text{AIC} = 2p - 2\log\hat{L}, \quad \text{BIC} = p\log n - 2\log\hat{L}$$

The Minimum Description Length (MDL) principle provides a deeper justification:

$$\text{MDL}(f, \text{data}) = \underbrace{L(\text{model})}_{\text{description of } f} + \underbrace{L(\text{data}|\text{model})}_{\text{residuals}}$$

The best model minimizes the total description length — the number of bits needed to describe both the model and the data residuals.

6. Discovering Physical Laws

Kepler's Third Law

Given orbital period $T$ and semi-major axis $a$ for planets:

$T^2 = \frac{4\pi^2}{GM}a^3$

Symbolic regression on $(a_i, T_i)$ data recovers $T \propto a^{3/2}$.

Newton's Law of Gravity

Given force measurements between masses at varying distances:

$F = G\frac{m_1 m_2}{r^2}$

Symbolic regression recovers the inverse-square law from data.

Conservation Laws

From trajectory data of colliding particles, symbolic regression can discover:

$E = \frac{1}{2}mv^2 + mgh = \text{const}$

Energy conservation emerges as an invariant in the data.

Dark Matter Acceleration

Cranmer et al. (2020) used symbolic regression on GNN-discovered functions to find:

New analytic approximations to dark matter dynamics from N-body simulations.

7. Python Simulation: Symbolic Regression & SINDy

This simulation implements a basic genetic programming symbolic regressor and the SINDy algorithm for discovering dynamical systems.

Symbolic Regression & SINDy from Scratch

Python

script.py303 lines

import numpy as np

# ============================================================
# PART 1: SIMPLE GENETIC PROGRAMMING SYMBOLIC REGRESSION
# ============================================================
print("=" * 65)
print("GENETIC PROGRAMMING SYMBOLIC REGRESSION")
print("=" * 65)

# Expression tree nodes
class Const:
    def __init__(self, val): self.val = val
    def eval(self, x): return np.full_like(x, self.val)
    def __str__(self): return f"{self.val:.3f}"
    def complexity(self): return 1

class Var:
    def eval(self, x): return x
    def __str__(self): return "x"
    def complexity(self): return 1

class BinOp:
    def __init__(self, op, left, right):
        self.op = op
        self.left = left
        self.right = right
    def eval(self, x):
        l = self.left.eval(x)
        r = self.right.eval(x)
        if self.op == '+': return l + r
        elif self.op == '-': return l - r
        elif self.op == '*': return l * r
        elif self.op == '/': return l / (np.abs(r) + 1e-10)
    def __str__(self):
        return f"({self.left} {self.op} {self.right})"
    def complexity(self):
        return 1 + self.left.complexity() + self.right.complexity()

class UnaryOp:
    def __init__(self, op, child):
        self.op = op
        self.child = child
    def eval(self, x):
        c = self.child.eval(x)
        if self.op == 'sin': return np.sin(c)
        elif self.op == 'cos': return np.cos(c)
        elif self.op == 'exp': return np.exp(np.clip(c, -10, 10))
        elif self.op == 'sq': return c**2
    def __str__(self):
        return f"{self.op}({self.child})"
    def complexity(self):
        return 1 + self.child.complexity()

def random_tree(depth=0, max_depth=3):
    if depth >= max_depth or (depth > 0 and np.random.random() < 0.4):
        if np.random.random() < 0.5:
            return Var()
        else:
            return Const(np.random.uniform(-3, 3))
    if np.random.random() < 0.7:
        op = np.random.choice(['+', '-', '*', '/'])
        return BinOp(op, random_tree(depth+1, max_depth), random_tree(depth+1, max_depth))
    else:
        op = np.random.choice(['sin', 'cos', 'sq'])
        return UnaryOp(op, random_tree(depth+1, max_depth))

def fitness(tree, x, y):
    try:
        pred = tree.eval(x)
        if np.any(np.isnan(pred)) or np.any(np.isinf(pred)):
            return 1e10
        mse = np.mean((pred - y)**2)
        return mse + 0.01 * tree.complexity()
    except:
        return 1e10

def mutate(tree, depth=0, max_depth=3):
    """Point mutation or subtree replacement."""
    if np.random.random() < 0.3:
        return random_tree(depth, max_depth)
    if isinstance(tree, BinOp):
        if np.random.random() < 0.5:
            return BinOp(tree.op, mutate(tree.left, depth+1), tree.right)
        else:
            return BinOp(tree.op, tree.left, mutate(tree.right, depth+1))
    elif isinstance(tree, UnaryOp):
        return UnaryOp(tree.op, mutate(tree.child, depth+1))
    return random_tree(0, 2)

# Target function: y = 2*sin(x) + x^2
np.random.seed(42)
n = 50
x_data = np.linspace(-3, 3, n)
y_target = 2*np.sin(x_data) + x_data**2

# Run GP
pop_size = 200
n_gen = 80
population = [random_tree(max_depth=4) for _ in range(pop_size)]

print(f"Target: y = 2*sin(x) + x^2")
print(f"Population: {pop_size}, Generations: {n_gen}")
print()

best_ever = None
best_fitness_ever = 1e10

for gen in range(n_gen):
    fits = [fitness(t, x_data, y_target) for t in population]

# Track best
    best_idx = np.argmin(fits)
    if fits[best_idx] < best_fitness_ever:
        best_fitness_ever = fits[best_idx]
        best_ever = population[best_idx]

if (gen + 1) % 20 == 0:
        print(f"Gen {gen+1:3d}: best_fitness={fits[best_idx]:.6f}, best_expr={str(population[best_idx])[:60]}")

# Tournament selection + mutation
    new_pop = [population[best_idx]]  # elitism
    for _ in range(pop_size - 1):
        # Tournament
        candidates = np.random.choice(pop_size, 3, replace=False)
        winner = candidates[np.argmin([fits[c] for c in candidates])]
        child = mutate(population[winner])
        new_pop.append(child)
    population = new_pop

print(f"\nBest expression found: {str(best_ever)[:80]}")
print(f"Best fitness: {best_fitness_ever:.6f}")
print(f"Complexity: {best_ever.complexity()}")
pred = best_ever.eval(x_data)
mse = np.mean((pred - y_target)**2)
print(f"MSE: {mse:.6f}")

# ============================================================
# PART 2: SINDy - Sparse Identification of Nonlinear Dynamics
# ============================================================
print()
print("=" * 65)
print("SINDy: DISCOVERING DYNAMICAL SYSTEMS")
print("=" * 65)

# Generate Lotka-Volterra data
def lotka_volterra_rk4(x0, t_span, dt, alpha=1.0, beta=0.5, gamma=0.5, delta=0.1):
    """4th-order Runge-Kutta integration."""
    t = np.arange(t_span[0], t_span[1], dt)
    x = np.zeros((len(t), 2))
    x[0] = x0

def f(state):
        return np.array([
            alpha*state[0] - beta*state[0]*state[1],
            -gamma*state[1] + delta*state[0]*state[1]
        ])

for i in range(len(t)-1):
        k1 = f(x[i])
        k2 = f(x[i] + 0.5*dt*k1)
        k3 = f(x[i] + 0.5*dt*k2)
        k4 = f(x[i] + dt*k3)
        x[i+1] = x[i] + dt/6*(k1 + 2*k2 + 2*k3 + k4)

return t, x

# Generate data
alpha, beta, gamma, delta_param = 1.0, 0.5, 0.5, 0.1
t, X_lv = lotka_volterra_rk4([5.0, 2.0], [0, 20], 0.01,
                              alpha, beta, gamma, delta_param)

print(f"Lotka-Volterra system:")
print(f"  dx1/dt = {alpha}*x1 - {beta}*x1*x2")
print(f"  dx2/dt = -{gamma}*x2 + {delta_param}*x1*x2")
print(f"  Initial: x0 = [5.0, 2.0], t in [0, 20], dt = 0.01")
print()

# Compute derivatives (finite differences)
dXdt = np.zeros_like(X_lv)
dXdt[1:-1] = (X_lv[2:] - X_lv[:-2]) / (2*0.01)
dXdt[0] = (X_lv[1] - X_lv[0]) / 0.01
dXdt[-1] = (X_lv[-1] - X_lv[-2]) / 0.01

# Subsample for efficiency
step = 10
X_s = X_lv[::step]
dX_s = dXdt[::step]

# Build candidate library
x1 = X_s[:, 0:1]
x2 = X_s[:, 1:2]
Theta = np.column_stack([
    np.ones_like(x1),   # 1
    x1,                  # x1
    x2,                  # x2
    x1**2,               # x1^2
    x1*x2,               # x1*x2
    x2**2,               # x2^2
    x1**3,               # x1^3
    x1**2 * x2,          # x1^2*x2
    x1 * x2**2,          # x1*x2^2
    x2**3,               # x2^3
])
lib_names = ['1', 'x1', 'x2', 'x1^2', 'x1*x2', 'x2^2',
             'x1^3', 'x1^2*x2', 'x1*x2^2', 'x2^3']

# Sequential Thresholded Least Squares (STLS)
def sindy_stls(Theta, dX, threshold=0.1, max_iter=20):
    n_targets = dX.shape[1]
    n_features = Theta.shape[1]
    Xi = np.linalg.lstsq(Theta, dX, rcond=None)[0]

for iteration in range(max_iter):
        small = np.abs(Xi) < threshold
        Xi[small] = 0
        for j in range(n_targets):
            active = ~small[:, j]
            if np.sum(active) > 0:
                Xi[active, j] = np.linalg.lstsq(
                    Theta[:, active], dX[:, j], rcond=None
                )[0]
    return Xi

Xi = sindy_stls(Theta, dX_s, threshold=0.05)

print("SINDy Results (Sequential Thresholded Least Squares):")
print(f"Library: {lib_names}")
print(f"Threshold: 0.05")
print()

for j, var_name in enumerate(['dx1/dt', 'dx2/dt']):
    terms = []
    for i, name in enumerate(lib_names):
        if abs(Xi[i, j]) > 1e-6:
            terms.append(f"{Xi[i,j]:+.4f}*{name}")
    eq = " ".join(terms) if terms else "0"
    print(f"  {var_name} = {eq}")

print()
print("True equations:")
print(f"  dx1/dt = +1.0000*x1 -0.5000*x1*x2")
print(f"  dx2/dt = -0.5000*x2 +0.1000*x1*x2")

# Compute coefficient errors
true_xi = np.zeros((10, 2))
true_xi[1, 0] = 1.0   # x1 in dx1/dt
true_xi[4, 0] = -0.5  # x1*x2 in dx1/dt
true_xi[2, 1] = -0.5  # x2 in dx2/dt
true_xi[4, 1] = 0.1   # x1*x2 in dx2/dt

error = np.linalg.norm(Xi - true_xi)
print(f"\nCoefficient error (L2 norm): {error:.6f}")
print(f"Correctly identified sparsity pattern: {np.all((np.abs(Xi) > 1e-6) == (np.abs(true_xi) > 1e-6))}")

# ============================================================
# PART 3: DIMENSIONAL ANALYSIS
# ============================================================
print()
print("=" * 65)
print("DIMENSIONAL ANALYSIS: BUCKINGHAM PI THEOREM")
print("=" * 65)

# Example: Pendulum period T = f(L, g, m)
# Dimensions: T=[T], L=[L], g=[L*T^{-2}], m=[M]
# Dimension matrix (rows: M, L, T; cols: T, L, g, m):
#       T   L   g   m
# M  [  0   0   0   1 ]
# L  [  0   1   1   0 ]
# T  [  1   0  -2   0 ]

D = np.array([
    [0, 0, 0, 1],   # mass dimension
    [0, 1, 1, 0],   # length dimension
    [1, 0, -2, 0],  # time dimension
])

rank = np.linalg.matrix_rank(D)
n_vars = D.shape[1]
n_pi = n_vars - rank

print(f"Pendulum: T = f(L, g, m)")
print(f"Dimension matrix rank: {rank}")
print(f"Number of variables: {n_vars}")
print(f"Number of Pi groups: {n_pi}")
print(f"=> T/sqrt(L/g) = const  =>  T = C * sqrt(L/g)")
print(f"True: T = 2*pi*sqrt(L/g)")
print()

# Verify with data
L_vals = np.array([0.5, 1.0, 1.5, 2.0, 3.0, 5.0])
g = 9.81
T_vals = 2*np.pi*np.sqrt(L_vals/g)

# Fit T = c * L^a using log-linear regression
log_T = np.log(T_vals)
log_L = np.log(L_vals)
A = np.column_stack([np.ones_like(log_L), log_L])
coeffs = np.linalg.lstsq(A, log_T, rcond=None)[0]

print(f"Power law fit: T = {np.exp(coeffs[0]):.4f} * L^{coeffs[1]:.4f}")
print(f"Expected:      T = {2*np.pi/np.sqrt(g):.4f} * L^0.5000")
print(f"Recovered exponent: {coeffs[1]:.4f} (true: 0.5)")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Challenges and Open Problems

Current Limitations

Scalability: GP search becomes intractable for many variables ($d > 5-10$)
Noise sensitivity: Noisy data makes it hard to distinguish signal from noise in equation structure
Non-uniqueness: Many algebraically different expressions compute the same function
Numerical constants: Finding the right constants is itself a hard optimization problem
PDEs: Extending to partial differential equations is far more challenging

Frontier Directions

LLM-guided search: Using large language models to propose equation structures based on domain knowledge
Equivariant symbolic regression: Building in symmetries (rotation, translation) as hard constraints
Causal discovery: Combining symbolic regression with causal inference to distinguish correlation from causation
Multi-scale equations: Discovering equations that operate across different length/time scales

Summary

Symbolic regression: Search for mathematical expressions that explain data, balancing accuracy and simplicity
Genetic programming: Evolve expression trees using crossover, mutation, and selection
SINDy: Sparse regression over a library of candidate functions; $\dot{\mathbf{X}} = \boldsymbol{\Theta}\boldsymbol{\Xi}$ with $\|\boldsymbol{\Xi}\|_1$ penalty
Dimensional analysis: Buckingham Pi theorem constrains valid expressions to dimensionally consistent forms
PySR: State-of-the-art tool using multi-population evolution with constant optimization
Discovery: Symbolic regression has recovered Kepler's laws, conservation laws, and novel physics from data

Share:X Reddit LinkedIn