ML for Science/Part III: Scientific Applications

Genetic Algorithms & Evolutionary Computation

Optimization inspired by natural selection — evolving populations of candidate solutions through selection, crossover, and mutation

Introduction

Genetic algorithms (GAs) are a class of metaheuristic optimization methods inspired by Darwinian evolution. The core idea is elegantly simple: maintain a population of candidate solutions (chromosomes), evaluate their quality (fitness), and apply biologically-inspired operators — selection, crossover (recombination), and mutation — to produce new generations of progressively better solutions.

Unlike gradient-based methods that require differentiable objectives, GAs work with any fitness function — discrete, discontinuous, stochastic, or multi-modal. They are particularly powerful when the search landscape is rugged, high-dimensional, or poorly understood, making them invaluable in scientific applications where analytical solutions are intractable.

The theoretical foundation rests on Holland's Schema Theorem and the building blocks hypothesis: GAs succeed by implicitly identifying and combining short, high-fitness patterns (building blocks) into increasingly fit complete solutions. This implicit parallelism allows GAs to process an exponential number of schemata with only a linear population.

Key Topics

  • 1. Fitness-Proportionate Selection & Tournament Selection
  • 2. Schema Theorem & Building Blocks Hypothesis
  • 3. Crossover Operators & Schema Disruption
  • 4. Mutation & Exploration-Exploitation Balance
  • 5. Convergence Theory & No Free Lunch Theorem
  • 6. Multi-Objective Optimization & Pareto Fronts
  • 7. Advanced Variants: GP, DE, CMA-ES, NEAT
  • 8. Applications in Science

The Genetic Algorithm Framework

Every generation of a GA follows this cycle:

$$\text{Population} \xrightarrow{\text{Evaluate}} \text{Fitness} \xrightarrow{\text{Select}} \text{Parents} \xrightarrow{\text{Crossover + Mutate}} \text{Offspring} \xrightarrow{\text{Replace}} \text{New Population}$$

The algorithm terminates when a stopping criterion is met: maximum generations, fitness threshold, or convergence detection. The best individual found across all generations is returned as the solution.

1. Fitness-Proportionate Selection (Roulette Wheel)

The simplest selection mechanism assigns each individual a probability of being selected proportional to its fitness. Given a population of $N$ individuals with fitness values $f_1, f_2, \ldots, f_N$ (all non-negative), the selection probability for individual $i$ is:

$$\boxed{P_i = \frac{f_i}{\sum_{j=1}^{N} f_j} = \frac{f_i}{N \cdot \bar{f}}}$$

where $\bar{f} = \frac{1}{N}\sum_{j=1}^N f_j$ is the average fitness. The expected number of copies of individual $i$ in the next generation is:

$$E[n_i] = N \cdot P_i = N \cdot \frac{f_i}{N \cdot \bar{f}} = \frac{f_i}{\bar{f}}$$

An individual with twice the average fitness is expected to receive two copies. This creates selection pressure toward fitter individuals.

Selection Intensity & Takeover Time

The selection intensity measures how strongly selection shifts the population mean. Under proportionate selection, the best individual with fitness $f_{\max}$ receives $f_{\max}/\bar{f}$ expected copies per generation. After $t$ generations of selection only (no crossover or mutation), the number of copies of the best individual grows as:

$$n_{\text{best}}(t) \approx \left(\frac{f_{\max}}{\bar{f}}\right)^t$$

The takeover time $t^*$ is when the best individual fills the entire population ($n_{\text{best}} = N$):

$$N = \left(\frac{f_{\max}}{\bar{f}}\right)^{t^*} \implies \boxed{t^* = \frac{\ln N}{\ln(f_{\max}/\bar{f})}}$$

Problem: Premature Convergence

Proportionate selection suffers from premature convergence: early in the search, a single super-individual can dominate, reducing diversity before the search space is adequately explored. Several alternatives address this.

Tournament Selection

Select $k$ individuals uniformly at random, then choose the best. For tournament size $k$, the probability that the $i$-th ranked individual (rank 1 = best) wins is:

$$P(\text{rank } i \text{ wins}) = \frac{\binom{N-i}{k-1} - \binom{N-i-1}{k-1}}{\binom{N}{k}} \approx \frac{i^k - (i-1)^k}{N^k}$$

Tournament selection is preferred in practice because: (a) selection pressure is controlled by $k$, independent of fitness scaling; (b) it requires only ordinal comparisons; (c) it parallelizes trivially.

Rank Selection

Assign selection probabilities based on rank rather than raw fitness. For linear ranking with selective pressure $s \in [1.0, 2.0]$:

$$P_i = \frac{1}{N}\left(2 - s + 2(s-1)\frac{\text{rank}_i - 1}{N - 1}\right)$$

Boltzmann Selection

Inspired by simulated annealing, use a temperature $T$ to control selection pressure:$P_i \propto e^{f_i/T}$. High $T$ gives near-uniform selection (exploration); low $T$ sharpens selection toward the best (exploitation). Annealing $T$ over generations provides adaptive control.

2. Schema Theorem (Holland 1975)

The Schema Theorem provides the theoretical foundation for understanding why genetic algorithms work. A schema (plural: schemata)$H$ is a template over the alphabet $\lbrace 0, 1, *\rbrace$ where$*$ is a wildcard that matches either 0 or 1.

Schema Properties

For a schema $H$ over chromosomes of length $L$:

  • Order $o(H)$: The number of fixed (non-$*$) positions. Example: $o(1**0*1) = 3$
  • Defining length $\delta(H)$: The distance between the outermost fixed positions. Example: $\delta(1**0*1) = 5$ (positions 0 to 5)
  • Fitness $f(H)$: The average fitness of all individuals matching schema $H$
  • Count $m(H, t)$: The number of individuals matching$H$ at generation $t$

Deriving the Growth Equation

Step 1: Selection. Under proportionate selection, the expected number of representatives of $H$ after selection is:

$$E[m(H, t+1)]_{\text{sel}} = m(H, t) \cdot \frac{f(H)}{\bar{f}}$$

Step 2: Crossover survival. A schema survives single-point crossover if the crossover point does not fall within the defining length. The crossover point is chosen uniformly from $L-1$ positions, so the disruption probability is:

$$P_d(H) = \frac{\delta(H)}{L - 1}$$

With crossover probability $p_c$, the survival probability is at least$1 - p_c \cdot \delta(H)/(L-1)$ (a lower bound, since crossover with a matching partner preserves the schema).

Step 3: Mutation survival. Each fixed position must survive mutation (probability $1 - p_m$ per bit). With $o(H)$ fixed positions:

$$P_{\text{mut survive}} = (1 - p_m)^{o(H)}$$

Combining all three effects yields the Schema Theorem:

$$\boxed{E[m(H, t+1)] \geq m(H, t) \cdot \frac{f(H)}{\bar{f}} \cdot \left(1 - p_c \cdot \frac{\delta(H)}{L-1}\right) \cdot (1 - p_m)^{o(H)}}$$

Implications: Building Blocks

The Schema Theorem shows that schemata which are simultaneously:

  • Short (small $\delta(H)$): survive crossover with high probability
  • Low-order (small $o(H)$): survive mutation with high probability
  • Above-average ($f(H) > \bar{f}$): receive more copies through selection

grow exponentially in the population. These are called building blocks. The building blocks hypothesis states that GAs work by identifying, preserving, and combining these short, fit schemata into complete high-fitness solutions.

Implicit parallelism: A population of $N$ binary strings of length $L$ contains instances of at least $2^k$ different schemata (where $k \approx L$ for reasonable populations), yet requires only $N$ fitness evaluations per generation. Holland estimated that $O(N^3)$ schemata are usefully processed — a massive parallel search with linear computational cost.

3. Crossover Operators & Schema Disruption

Crossover is the primary exploration operator in GAs, combining genetic material from two parents to produce offspring. The key design question is: how much does a crossover operator disrupt useful schemata?

Single-Point Crossover

Choose a random crossover point $c \in \lbrace 1, 2, \ldots, L-1\rbrace$ uniformly. Swap the segments after position $c$ between two parents. A schema $H$ with defining length $\delta(H)$ is disrupted if $c$ falls between the outermost fixed positions:

$$P_{\text{disrupt}}^{1\text{pt}}(H) = \frac{\delta(H)}{L - 1}$$

Therefore, the survival probability is:$P_{\text{survive}}^{1\text{pt}}(H) = 1 - \frac{\delta(H)}{L-1}$. Short schemata ($\delta(H) \ll L$) are almost never disrupted.

Two-Point Crossover

Choose two crossover points and swap the segment between them. A schema survives if neither crossover point falls within the defining length, or both fall within it (the defining region is swapped out and back). The disruption probability is:

$$P_{\text{disrupt}}^{2\text{pt}}(H) = \frac{2\delta(H)(L - 1 - \delta(H))}{(L-1)(L-2)}$$

Two-point crossover is less positional-biased than single-point: it treats the chromosome more like a ring than a linear string, reducing the advantage of schemata located at the ends.

Uniform Crossover

At each position independently, exchange the bit between parents with probability$p_0$ (typically $p_0 = 0.5$). A schema $H$ with order$o(H)$ survives if none of its fixed positions are exchanged:

$$\boxed{P_{\text{survive}}^{\text{uniform}}(H) = (1 - p_0)^{o(H)}}$$

Note that this depends only on the order, not the defining length. This is a fundamental difference: uniform crossover is less disruptive for compact schemata (many fixed positions clustered together) but more explorative overallsince it can recombine any subset of positions.

For $p_0 = 0.5$ and $o(H) = 5$:$P_{\text{survive}} = 0.5^5 = 0.03125$, while single-point crossover on the same schema might yield $P_{\text{survive}} = 1 - 5/99 = 0.95$ for$L = 100$. Thus uniform crossover is far more disruptive for high-order schemata but provides superior mixing of alleles from different regions of the chromosome.

4. Mutation & Exploration-Exploitation Balance

Mutation introduces random perturbations into chromosomes, serving as the primary mechanism for introducing new genetic material not present in the current population. While crossover recombines existing alleles, mutation can reach any point in the search space.

Optimal Mutation Rate

For a binary chromosome of length $L$, each bit is flipped independently with probability $p_m$. The expected number of mutations per chromosome is$L \cdot p_m$. The classic result for the optimal mutation rate is:

$$\boxed{p_m^* = \frac{1}{L}}$$

This gives exactly one expected bit flip per chromosome. The derivation follows from balancing exploration and exploitation:

Exploration requirement: The probability that a specific bit position is mutated in at least one individual across the population of size $N$ is:

$$P_{\text{explore}} = 1 - (1 - p_m)^N$$

For adequate exploration, we need $P_{\text{explore}}$ to be significant, requiring $p_m \gtrsim 1/N$.

Exploitation requirement: The probability that a good chromosome (a building block of order $o(H)$) survives mutation is $(1 - p_m)^{o(H)}$. For the entire chromosome ($o(H) = L$), the survival probability is:

$$P_{\text{survive}} = (1 - p_m)^L \approx e^{-p_m L}$$

At $p_m = 1/L$, this gives $P_{\text{survive}} \approx e^{-1} \approx 0.37$, meaning about 37% of individuals survive without any mutation — a reasonable balance. Higher rates destroy too much structure; lower rates explore too slowly.

Mutation & Deceptive Problems

A problem is deceptive if low-order building blocks mislead the search toward a suboptimal solution. Formally, a deceptive function of order $k$ has the property that:

$$\bar{f}(H_{\text{complement}}) > \bar{f}(H_{\text{optimal}}) \quad \text{for all schemata of order} < k$$

Without mutation, a GA using only selection and crossover converges to a suboptimal solution with probability 1 on deceptive problems. Mutation provides the random jumps needed to escape deceptive basins. This can be proven by noting that without mutation, once an allele is lost from the population it can never be recovered — making the GA equivalent to a greedy search on schemata.

5. Convergence & No Free Lunch Theorem

The No Free Lunch Theorem (Wolpert & Macready, 1997)

The NFL theorem is one of the most profound results in optimization theory. It states:

"For any two optimization algorithms $A_1$ and $A_2$, when performance is averaged uniformly over all possible fitness functions, they perform identically."

Formally, let $\mathcal{F}$ be the set of all functions $f: \mathcal{X} \to \mathbb{R}$on a finite search space $\mathcal{X}$. For any performance measure$\Phi$ (e.g., number of evaluations to reach optimum):

$$\boxed{\sum_{f \in \mathcal{F}} \Phi(A_1, f) = \sum_{f \in \mathcal{F}} \Phi(A_2, f)}$$

This holds for any two algorithms, including GAs, random search, gradient descent, and exhaustive search.

Consequences for Genetic Algorithms

The NFL theorem implies that GAs must exploit problem structureto outperform random search. The specific structure GAs exploit is the building block structure: fitness functions where short, low-order schemata provide reliable information about the global optimum.

GAs provably outperform hill climbing on problems with:

  • Multimodal landscapes: Many local optima trap hill climbers, but GAs maintain population diversity to search multiple basins simultaneously
  • Deceptive landscapes: Where local gradient information leads away from the global optimum, but building blocks at higher orders point correctly
  • Rugged landscapes: Where small changes in the input cause large changes in fitness, making gradient estimation unreliable
  • Non-differentiable or discrete spaces: Where gradient methods are inapplicable (e.g., combinatorial optimization, symbolic search)

Conversely, on smooth unimodal landscapes, gradient descent vastly outperforms GAs — and the NFL theorem tells us this advantage must be balanced by the GA's advantage elsewhere.

6. Multi-Objective Optimization (Pareto)

Many real-world problems involve optimizing multiple conflicting objectives simultaneously. Rather than producing a single solution, multi-objective optimization finds the set of Pareto-optimal trade-offs.

Pareto Dominance

For a minimization problem with $M$ objectives, solution $\mathbf{x}$ dominates solution $\mathbf{y}$ (written $\mathbf{x} \prec \mathbf{y}$) if and only if:

$$\boxed{\mathbf{x} \prec \mathbf{y} \iff \forall i \in \lbrace 1,\ldots,M\rbrace: f_i(\mathbf{x}) \leq f_i(\mathbf{y}) \;\wedge\; \exists j: f_j(\mathbf{x}) < f_j(\mathbf{y})}$$

The Pareto front is the set of all non-dominated solutions: no solution in the Pareto set can improve on one objective without worsening another.

NSGA-II: Non-Dominated Sorting

The NSGA-II algorithm (Deb et al., 2002) is the most widely used multi-objective GA. It ranks solutions by non-dominated sorting:

  • Front 1 (rank 1): All non-dominated solutions in the population
  • Front 2 (rank 2): Non-dominated solutions after removing Front 1
  • Front $k$: Non-dominated solutions after removing Fronts $1, \ldots, k-1$

The algorithm for each solution $p$: compute $n_p$ (number of solutions dominating $p$) and $S_p$ (set of solutions dominated by $p$). Front 1 consists of all $p$ with $n_p = 0$. For each $p$ in Front 1, decrement $n_q$ for all $q \in S_p$; those reaching $n_q = 0$ form Front 2, and so on. Complexity: $O(MN^2)$.

Crowding Distance

Within each front, NSGA-II uses crowding distance to maintain diversity. For solution $i$ in a front, the crowding distance is:

$$d_i = \sum_{m=1}^{M} \frac{f_m^{(i+1)} - f_m^{(i-1)}}{f_m^{\max} - f_m^{\min}}$$

where $f_m^{(i+1)}$ and $f_m^{(i-1)}$ are the objective values of the neighboring solutions when sorted by objective $m$. Solutions with higher crowding distance are preferred, promoting a well-spread Pareto front.

Hypervolume Indicator

The hypervolume indicator (HV) measures the quality of a Pareto front approximation. Given a reference point $\mathbf{r}$ (dominated by all solutions), the hypervolume is the $M$-dimensional volume of the space dominated by the Pareto front and bounded by $\mathbf{r}$:

$$\text{HV}(\mathcal{P}, \mathbf{r}) = \text{Vol}\left(\bigcup_{\mathbf{x} \in \mathcal{P}} [\mathbf{f}(\mathbf{x}), \mathbf{r}]\right)$$

The hypervolume is the only unary quality indicator known to be strictly monotonic with respect to Pareto dominance: if set $A$ dominates set $B$, then$\text{HV}(A) > \text{HV}(B)$. For two objectives, computing HV is $O(n \log n)$; for $M$ objectives, the exact computation is $\#P$-hard.

7. Advanced Variants

Genetic Programming (GP)

Introduced by Koza (1992), GP evolves computer programsrepresented as expression trees rather than fixed-length binary strings. The chromosome is a tree where internal nodes are functions ($+, -, \times, \sin, \text{if-then}$) and leaves are terminals (variables, constants). Crossover swaps subtrees; mutation replaces a subtree with a random new one.

GP has been used for symbolic regression, automatic circuit design, game strategies, and evolving robot controllers. A key challenge is bloat: programs tend to grow in size without improving fitness, requiring parsimony pressure.

Differential Evolution (DE)

DE (Storn & Price, 1997) operates on real-valued vectors. For each target vector $\mathbf{x}_i$, a donor vector is created by:

$$\boxed{\mathbf{v}_i = \mathbf{x}_{r_1} + F \cdot (\mathbf{x}_{r_2} - \mathbf{x}_{r_3})}$$

where $r_1, r_2, r_3$ are distinct random indices and $F \in [0, 2]$ is the scale factor. The mutation$F(\mathbf{x}_{r_2} - \mathbf{x}_{r_3})$ is a scaled difference vector, which automatically adapts to the local geometry of the search space: when the population is spread out, mutations are large; as it converges, they shrink.

A trial vector is formed by binomial crossover with rate $CR$, and replaces$\mathbf{x}_i$ only if it has better fitness (greedy selection). DE is remarkably effective for continuous optimization with few control parameters.

CMA-ES: Covariance Matrix Adaptation

CMA-ES (Hansen & Ostermeier, 2001) is arguably the most sophisticated evolutionary strategy. It maintains a multivariate normal distributionand adapts its covariance matrix to learn the local landscape geometry:

$$\mathbf{x}_k \sim \mathcal{N}(\mathbf{m}, \sigma^2 \mathbf{C})$$

The covariance matrix adaptation updates $\mathbf{C}$using two complementary mechanisms:

  • Rank-one update (evolution path): $\mathbf{C} \leftarrow (1-c_1)\mathbf{C} + c_1 \mathbf{p}_c \mathbf{p}_c^T$, where $\mathbf{p}_c$ tracks the cumulative mean shift
  • Rank-$\mu$ update: $\mathbf{C} \leftarrow (1-c_\mu)\mathbf{C} + c_\mu \sum_{i=1}^\mu w_i \mathbf{y}_{i:\lambda}\mathbf{y}_{i:\lambda}^T$, using the $\mu$ best samples

CMA-ES is rotation-invariant, scale-free, and regarded as the gold standard for derivative-free optimization in continuous spaces up to a few hundred dimensions.

Neuroevolution: NEAT

NeuroEvolution of Augmenting Topologies (NEAT, Stanley & Miikkulainen, 2002) evolves both the weights and topology of neural networks. Key innovations:

  • Historical markings: Each gene (connection) has a global innovation number enabling meaningful crossover between networks with different topologies
  • Speciation: Networks are grouped into species based on structural similarity, protecting innovations through niche protection
  • Complexification: Start minimal and grow topology, avoiding the search through unnecessarily large spaces

8. Applications in Science

Protein Structure & Drug Discovery

Before AlphaFold, GAs were primary tools for protein structure prediction. The Rosetta suite used fragment assembly with evolutionary search, encoding backbone dihedral angles ($\phi, \psi$) as chromosomes with physics-based fitness (van der Waals, electrostatics, solvation). In drug discovery, evolutionary algorithms optimize molecular structures represented as SMILES strings for binding affinity, solubility, and toxicity using multi-objective Pareto optimization.

NASA ST5 Evolved Antenna

In 2006, NASA's ST5 mission flew the first evolved hardware in space: an X-band antenna designed entirely by a GA. The algorithm evolved 3D wire structures evaluated by EM simulation, producing an irregular, asymmetric design no human would create — yet it met all performance requirements with superior radiation patterns.

Materials Science, Astrophysics & Engineering

GAs optimize multi-component alloy compositions for mechanical properties, fit parametric galaxy models to photometry (handling multimodal likelihood surfaces from age-metallicity-dust degeneracies), evolve electronic circuits (analog filters, oscillators), and solve complex scheduling problems (telescope observations, accelerator beam time).

9. Historical Context

  • 1960s: Rechenberg/Schwefel develop evolution strategies (Germany); Fogel/Owens/Walsh create evolutionary programming (US)
  • 1975: Holland publishes Adaptation in Natural and Artificial Systems — Schema Theorem, implicit parallelism
  • 1989: Goldberg's Genetic Algorithms in Search, Optimization, and Machine Learning formalizes building blocks hypothesis
  • 1992: Koza introduces Genetic Programming — evolving expression trees
  • 1997: Wolpert & Macready prove the No Free Lunch theorems
  • 2002: Deb publishes NSGA-II — fast non-dominated sorting with crowding distance
  • 2001-03: Hansen develops CMA-ES — gold standard for derivative-free continuous optimization
  • 2006: NASA ST5 flies the first computer-evolved antenna in space

10. Python Simulation

We implement a complete genetic algorithm in numpy to solve two challenging problems: (1) the 10-dimensional Rastrigin function (highly multimodal with $10^{10}$ local minima), and (2) the ZDT1 multi-objective benchmark with NSGA-II-style non-dominated sorting and Pareto front visualization.

Genetic Algorithm: Rastrigin Optimization & Pareto Front

Python
script.py282 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Summary

  • Genetic algorithms optimize by evolving a population through selection, crossover, and mutation — requiring no gradients or differentiability
  • Schema Theorem: $E[m(H,t\!+\!1)] \geq m(H,t) \cdot \frac{f(H)}{\bar{f}} \cdot (1 - p_c \frac{\delta(H)}{L-1}) \cdot (1-p_m)^{o(H)}$ — short, low-order, above-average schemata grow exponentially
  • Selection: Tournament selection ($k$-way) is preferred over roulette wheel for controllable, fitness-scale-independent selection pressure
  • Crossover: Uniform crossover survival $(1-p_0)^{o(H)}$depends on order; single-point survival $1 - \delta(H)/(L-1)$ depends on defining length
  • Mutation rate: Optimal $p_m = 1/L$ balances exploration of new schemata with preservation of good building blocks
  • No Free Lunch: GAs must exploit building block structure to outperform random search; they excel on multimodal and deceptive landscapes
  • Multi-objective: NSGA-II with non-dominated sorting and crowding distance produces well-spread Pareto-optimal trade-offs
  • Variants: DE ($\mathbf{v} = \mathbf{x}_{r_1} + F(\mathbf{x}_{r_2} - \mathbf{x}_{r_3})$), CMA-ES (covariance adaptation), GP (evolving programs), NEAT (evolving network topology)
  • Applications: Protein structure, drug design, antenna evolution (NASA ST5), materials optimization, astrophysical model fitting
Rate this chapter: