Module 1 · Canonical GA

Genetic Algorithms

The canonical genetic algorithm (GA), in the form codified by Holland and his students, is the most-cited member of the evolutionary-computation family. It is also the most misunderstood: textbook descriptions emphasise the metaphor of biological evolution and understate the profound differences from real population genetics. This module gives the algorithm cleanly, then states what is and is not known about why it works.

1. Representation

A candidate solution is encoded as a fixed-length string \(\mathbf{x} \in \{0,1\}^L\) (the chromosome or genotype). The fitness function \(f : \{0,1\}^L \to \mathbb{R}_{\ge 0}\)maps a chromosome to a non-negative scalar. The decoding from genotype to phenotype (the actual object whose performance is measured) is problem-specific and is one of the hardest design choices in any GA application.

Common alternatives to binary encoding: integer strings (for combinatorial problems), permutations (TSP, scheduling), real-valued vectors(handled more naturally by ES, Module 2). Each demands its own variation operators.

2. The Outer Loop

At each generation, a population of \(N\) individuals is evaluated, parents selected, offspring produced by recombination and mutation, and the population replaced. The pseudocode:

initialise population P_0 of size N at random
evaluate fitness f(x) for each x in P_0
for t = 0, 1, 2, ... :
    parents  = select(P_t, f)            # selection
    children = []
    while len(children) < N:
        a, b = pick two parents
        c, d = crossover(a, b)            # recombination
        c    = mutate(c)
        d    = mutate(d)
        children.extend([c, d])
    P_{t+1} = replace(P_t, children, f)   # generational or steady-state
    record best-so-far

Two replacement schemes dominate: generational(children entirely replace parents) and steady-state(a few children replace a few of the worst parents per iteration). Generational with elitism — preserving the top-\(k\) unmodified — is the most common compromise.

3. Selection

Selection injects the search bias toward higher fitness. Three workhorses:

  • Fitness-proportionate (roulette-wheel): Pick parent \(i\) with probability \(p_i = f_i / \sum_j f_j\). Conceptually clean, but pathological when fitnesses span many orders of magnitude (one superfit individual hijacks the wheel) or are nearly equal (no selection pressure at all). Rarely used in modern practice.
  • Rank selection: Sort the population by fitness, assign selection probability as a function of rank only. Robust to fitness rescaling. The standard linear-rank scheme: \(p_i \propto 2 - s + 2(s-1)(i-1)/(N-1)\)with selection pressure \(s \in [1, 2]\).
  • Tournament selection: Sample \(k\)individuals uniformly at random and return the best. Tournament size \(k=2\) is gentle, \(k=7\) is aggressive. Easy to implement, parallelisable, and the de facto standard in modern GA codebases.

The selection pressure — expected number of copies of the best individual after one selection step — controls the exploration/exploitation trade-off: too high and the population converges prematurely on a local optimum; too low and the search drifts.

4. Crossover

Crossover combines material from two parents to produce two children. Standard variants for fixed-length strings:

  • One-point: pick a cut \(c\) uniformly in \([1,L-1]\), swap suffixes.
  • Two-point: pick two cuts, swap the middle segment.
  • Uniform: for each locus, with probability \(p_x=0.5\), swap parents’ bits.

Crossover rate \(p_c\) (typically \(0.6 - 1.0\)) is the per-mating probability that crossover is applied; otherwise the parents are passed through unchanged.

The positional bias of one- and two-point crossover — loci that are physically far apart on the chromosome are more likely to be split than nearby loci — is a feature, not a bug, when the encoding has been chosen so that interacting genes lie close together. When you cannot do that, uniform crossover is safer.

5. Mutation

For binary strings, mutation flips each bit independently with probability \(p_m\). The classical recommendation, due to De Jong and refined by Mühlenbein, is

\[ p_m \;=\; \frac{1}{L} \]

i.e. one bit-flip per chromosome on average. This rate emerges naturally from analyses of OneMax, Royal Roads, and similar test functions, and remains the default in the absence of problem-specific information. Crossover is the dominant force in classical GA; mutation is the safety net that maintains diversity and prevents loss of alleles.

6. The Schema Theorem

A schema \(H\) is a similarity template over \(\{0,1,*\}^L\) where \(*\) is a wild-card. The schema \(1{*}{*}0{*}\) matches all strings whose first bit is 1 and fourth is 0. Two key descriptors:

  • Order \(o(H)\): number of fixed positions.
  • Defining length \(\delta(H)\): distance between leftmost and rightmost fixed positions.

Holland’s schema theorem (1975), in its simplest form, bounds the expected number\(m(H,t+1)\) of representatives of \(H\) in the next generation:

\[ \mathbb{E}[\,m(H,t+1)\,] \;\ge\; m(H,t)\,\frac{f(H,t)}{\bar f(t)}\,\Bigl[\,1 - p_c\,\frac{\delta(H)}{L-1} - o(H)\,p_m\,\Bigr] \]

In words: short, low-order, above-average schemata receive an exponentially increasing number of trials over time. This is the building-block hypothesis — that GAs work by combining good short building-blocks into ever-better solutions.

7. The Schema Theorem, Critically

The schema theorem is a true statement about expectations — but it is much weaker than it first appears. Its limitations, well documented since the late 1990s:

  • It is an inequality on a one-step expected change. It does not predict the long-run dynamics, and \(f(H,t)\) itself drifts as the population composition changes.
  • It says nothing about why the building-block hypothesis should hold for an arbitrary fitness function. Deceptive functions (Goldberg 1987) have low-order schemata pointing away from the optimum — the GA is actively misled.
  • Wolpert & Macready’s no-free-lunch theorems (1997): averaged over all possible fitness functions, every black-box algorithm has identical expected performance. There is no “the GA is good” without specifying the class of problems.
  • Vose (1999) gave an exact Markov-chain analysis of the simple GA showing the schema theorem is essentially the first-order term of a much richer dynamical system.

The modern view: GAs work well on problems where the encoding has been chosen so that fitness is approximately separable into short, low-order interactions. When this linkage assumption fails, more sophisticated methods (estimation-of-distribution algorithms, Module 2’s CMA-ES, structured GP) are required.

8. Worked Example: OneMax

The OneMax function \(f(\mathbf{x}) = \sum_{i=1}^L x_i\) is the “hello world” of GAs: maximise the number of 1s. Trivial gradient information is present in every locus. With population size \(N=100\), tournament size 2, uniform crossover at \(p_c=1.0\), and bit-flip mutation at \(p_m=1/L\), a simple GA solves OneMax with \(L=100\) in roughly 30–50 generations. The expected first hitting time scales as\(\Theta(L \log L)\) for the (1+1)-EA — a result now standard in runtime analysis, the rigorous-theory wing of evolutionary computation that has matured since the 2000s (Droste, Jansen, Wegener; Doerr’s 2020 textbook).

OneMax is the simplest problem on which to tune your GA hyperparameters and to verify your implementation actually evolves rather than drifts.

9. Practical Pitfalls

  • Premature convergence: the population collapses onto a single point before the search has explored. Detect by tracking population diversity (Hamming-distance variance, entropy of allele frequencies). Mitigate by increasing population size, lowering selection pressure, or introducing explicit diversity mechanisms (sharing, crowding, niching — Module 5).
  • Poor encoding: if epistatic genes are placed far apart on the chromosome, one-point crossover destroys them. Either redesign the encoding or switch to uniform crossover / a linkage-learning EDA.
  • Fitness scaling: with proportional selection, raw fitness values often need rescaling (linear, sigma-truncation, Boltzmann) to maintain useful selection pressure throughout the run. Or just use tournament selection and stop worrying.
  • Hyperparameter sensitivity: \(N, p_c, p_m\), tournament size, elitism — each matters. The dominant cost of GA work is hyperparameter sweep; budget for it.