Part 9: Recombination and Mobile Elements

Genetic Shuffling: Engines of Genome Evolution

Recombination and mobile genetic elements are the primary engines of genome plasticity. Homologous recombination enables high-fidelity DNA repair and generates genetic diversity during meiosis. Site-specific recombination drives programmed DNA rearrangements from phage integration to antibody diversification. Transposable elements, comprising nearly half the human genome, reshape chromosomal architecture over evolutionary time. Together with CRISPR-Cas adaptive immunity in prokaryotes, these mechanisms reveal a dynamic genome in constant flux.

~45%

Human genome from transposons

~30

Crossovers per human meiosis

1011

Possible antibodies via V(D)J

1.1M

Alu copies in human genome

1. Homologous Recombination (HR)

Homologous recombination exchanges genetic information between DNA molecules that share extensive sequence similarity. It is essential for accurate repair of double-strand breaks (DSBs), restart of collapsed replication forks, and generation of crossovers during meiosis. The core reaction involves strand invasion of a homologous duplex by a single-stranded 3' overhang, catalyzed by RecA-family recombinases.

1.1 The Holliday Model (1964)

Robin Holliday proposed the first mechanistic model for homologous recombination. Although later refined, it introduced fundamental concepts still central to our understanding.

Step 1: Strand Nicking and Invasion

Endonuclease nicks occur at corresponding positions on homologous duplexes. One strand from each duplex dissociates from its complement and invades the other duplex, forming heteroduplex DNA where strands from different parental molecules are paired.

Step 2: Branch Migration

The crossover point (branch point) can migrate along the DNA, extending the region of heteroduplex. Branch migration is spontaneous but slow; in cells, the RuvAB complex (or Rad54 in eukaryotes) catalyzes rapid, directional branch migration using ATP hydrolysis. RuvA recognizes the junction structure while RuvB acts as a hexameric ring helicase that pumps DNA through the junction.

Step 3: Holliday Junction (Four-Way DNA Junction)

The intermediate contains a four-stranded DNA structure called the Holliday junction (HJ). In solution, the HJ adopts a stacked X-structure with two continuous and two exchanging strands. The junction can isomerize between two conformations by exchanging which pair of arms is stacked, and this isomerization determines the outcome of resolution. Crystal structures reveal the junction is stabilized by base stacking at the branch point with a 60-degree angle between the arms.

Step 4: Resolution

The RuvC resolvase (or GEN1/Yen1 in eukaryotes) cleaves the Holliday junction symmetrically. Two orientations of cleavage are possible:

Horizontal cleavage (non-crossover)

Cuts the exchanging strands. Produces patched duplexes with a region of heteroduplex but no exchange of flanking markers. Gene conversion without crossover.

Vertical cleavage (crossover)

Cuts the continuous strands. Produces reciprocal exchange of flanking markers — a true crossover event. This is the basis of genetic recombination observed in meiosis.

1.2 Double-Strand Break Repair Model (Szostak et al., 1983)

The Szostak/DSBR model superseded the Holliday model as the primary pathway for meiotic recombination. It begins with a deliberate DSB rather than single-strand nicks, explaining both gene conversion and crossover outcomes.

5' to 3' End Resection

After DSB formation, the 5' ends are resected by the MRN complex (Mre11-Rad50-Nbs1) together with CtIP (Sae2 in yeast) for initial short-range resection. Long-range resection is carried out by Exo1 (a 5'→3' exonuclease) or the Sgs1-Dna2 helicase-nuclease complex. This generates long 3' single-stranded overhangs (typically 1-2 kb) that become substrates for RecA/Rad51 loading.

3' Overhang Strand Invasion and D-loop Formation

One 3' overhang, coated with the Rad51 nucleoprotein filament, invades the homologous duplex. The invading strand displaces one strand of the target, forming a displacement loop (D-loop). The 3' end of the invading strand primes new DNA synthesis using the homologous strand as template, copying genetic information across.

Second End Capture

The displaced strand of the D-loop anneals with the second 3' overhang from the other side of the DSB. This “second end capture” is facilitated by Rad52 (in yeast) or BRCA2-mediated annealing. DNA synthesis from both 3' ends fills the gaps, and ligation produces two intact Holliday junctions.

Double Holliday Junction (dHJ) and Resolution

The dHJ intermediate can be resolved in two ways: (1) Symmetric cleavage by resolvases (Mus81-Mms4, GEN1/Yen1, SLX1-SLX4) at both junctions, yielding crossover or non-crossover depending on cleavage orientation. (2) Dissolution by the BLM-TopIIIα-RMI1-RMI2 (BTR) complex, which converges the two junctions via branch migration and decatenates them topologically. Dissolution exclusively produces non-crossover products and is the dominant pathway in mitotic cells to suppress loss of heterozygosity.

Gene Conversion

The region of new DNA synthesis around the DSB results in gene conversion — non-reciprocal transfer of information from the donor to the recipient. Gene conversion tracts are typically 1-2 kb in yeast and can be associated with or without crossover of flanking markers. Mismatch repair of heteroduplex DNA within the conversion tract can alter the ratio of 6:2 vs 5:3 segregation patterns in tetrad analysis.

Derivation: Holliday Junction Energetics — Free Energy of Branch Migration

Starting from base-pairing thermodynamics, we derive the free energy landscape of branch migration through a Holliday junction.

Step 1: Define the branch migration step

Each step of branch migration exchanges one base pair on each of the four strands. The branch point moves by one position, breaking two base pairs and forming two new ones:

$$\text{Step}: \;\text{bp}_{A:B}^{(i)} + \text{bp}_{C:D}^{(i)} \rightarrow \text{bp}_{A:D}^{(i)} + \text{bp}_{C:B}^{(i)}$$

Step 2: Calculate the free energy change per step

For perfectly homologous sequences, old and new base pairs are identical, so the net free energy change per step is approximately zero:

$$\Delta G_{\text{step}} = \Delta G_{\text{bp,new}} - \Delta G_{\text{bp,old}} \approx 0 \qquad \text{(homologous sequences)}$$

Step 3: Account for the stacking energy barrier

However, each step requires unstacking bases at the junction. Using nearest-neighbor stacking parameters, the activation barrier for a single step involves disrupting two stacking interactions:

$$\Delta G^\ddagger_{\text{step}} = 2 \times \Delta G_{\text{unstack}} \approx 2 \times 5\;\text{kcal/mol} \approx 10\;\text{kcal/mol}$$

Step 4: Derive the spontaneous migration rate

From transition state theory, the rate of spontaneous branch migration per step is:

$$k_{\text{step}} = \frac{k_BT}{h}\exp\left(-\frac{\Delta G^\ddagger}{RT}\right) \approx 6 \times 10^{12} \times e^{-16.8} \approx 3\;\text{s}^{-1}$$

Step 5: Model as a random walk

Since ΔGstep ≈ 0 for homologous DNA, branch migration is an unbiased random walk. The mean distance migrated after n steps is:

$$\langle x^2 \rangle = n \cdot d^2 \implies \text{distance} = d\sqrt{n} \qquad \text{(} d = 3.4\;\text{\AA per bp)}$$

To migrate 100 bp requires n = 1002 = 104 steps (random walk), taking ~104/3 ≈ 3,000 s (~50 min). This is far too slow for in vivo recombination.

Step 6: RuvAB-catalyzed directional migration

The RuvAB complex converts random walk to directed motion using ATP hydrolysis. RuvB hexameric rings pump DNA through the junction at ~10–20 bp/s, corresponding to an effective driving force:

$$\Delta G_{\text{ATP}} = -7.3\;\text{kcal/mol per ATP} \qquad v_{\text{RuvAB}} \approx 10\text{-}20\;\text{bp/s}$$

With RuvAB, migrating 1,000 bp takes ~50–100 s instead of the ~8 hours needed for spontaneous migration by random walk.

Derivation: Gene Conversion Tract Length from the DSBR Model

Starting from the double-strand break repair (DSBR) model, we derive the expected length of gene conversion tracts.

Step 1: Identify the conversion tract boundaries

Gene conversion tracts arise from new DNA synthesis that copies information from the donor template. The tract extends from the point of strand invasion to where synthesis terminates. In the DSBR model, two conversion tracts originate from each side of the DSB.

$$L_{\text{tract}} = L_{\text{resection}} + L_{\text{synthesis}} - L_{\text{overlap}}$$

Step 2: Model DNA synthesis as a processivity-limited process

The repair polymerase (Pol δ in eukaryotes) extends the invading strand. Synthesis terminates stochastically with probability p per bp, giving an exponential tract length distribution:

$$P(L > x) = e^{-x/\lambda} \qquad \text{where } \lambda = 1/p \text{ is the mean processivity}$$

Step 3: Account for 5' resection length

The minimum tract includes the resected region (typically 1–2 kb). The total heteroduplex/conversion region has a minimum length set by resection:

$$\langle L_{\text{tract}} \rangle = L_{\text{resection}} + \lambda \approx 1\;\text{kb} + \lambda$$

Step 4: Include mismatch repair effects

If the donor and recipient differ at heterologous positions, the heteroduplex DNA within the conversion tract is subject to mismatch repair (MMR). Repair direction determines whether conversion is toward the donor (gene conversion) or recipient (restoration):

$$P(\text{conversion at marker}) = P(\text{in tract}) \times P(\text{repair toward donor}) \approx e^{-d/\lambda} \times 0.5$$

Step 5: Predict observable tract lengths

In yeast meiosis, measured conversion tracts average 1–2 kb (ranging from ~100 bp to >10 kb), consistent with λ ≈ 0.5–1.5 kb for repair polymerase processivity in the D-loop context:

$$\langle L_{\text{observed}} \rangle \approx 1\text{-}2\;\text{kb (yeast)} \qquad \langle L_{\text{observed}} \rangle \approx 0.2\text{-}1\;\text{kb (mammals)}$$

Mammalian tracts are shorter, consistent with the SDSA (synthesis-dependent strand annealing) pathway being dominant in mitotic cells, where the D-loop is disrupted after limited synthesis.

1.3 The RecBCD Pathway (E. coli)

In E. coli, the RecBCD enzyme initiates homologous recombination at double-strand breaks, playing a critical role in DNA repair and in degrading foreign linear DNA.

RecBCD Enzyme: A Molecular Machine

  • RecB: 3'→5' helicase and nuclease; the primary nuclease domain that degrades the 3'-terminated strand
  • RecC: Recognizes the Chi site; no enzymatic activity but serves as the molecular “scanner”
  • RecD: 5'→3' helicase (faster motor); drives the leading edge of unwinding
  • Processes dsDNA at ~1,000 bp/sec, one of the fastest helicases known

Chi Site Recognition

The Chi (crossover hotspot instigator) sequence is 5'-GCTGGTGG-3'. It occurs every ~4.5 kb in E. coli and is over-represented in the genome.

Upon encountering Chi, RecBCD undergoes a dramatic conformational change: the RecB nuclease switches from degrading the 3'-strand to degrading the 5'-strand. This produces a 3' single-stranded tail with the Chi sequence at its tip — the ideal substrate for RecA loading. RecBCD itself loads RecA onto this tail, preventing SSB from competing for binding.

Biological Significance of Chi

Chi sites serve dual functions: (1) They mark DNA as “self” — foreign DNA (phage, plasmid) lacking Chi is degraded by RecBCD, providing innate immunity. (2) They stimulate recombination up to 10-fold locally, acting as recombination hotspots. Chi is recognized only on one strand during RecBCD translocation, explaining its orientation-dependent stimulation of recombination in genetic crosses.

1.4 RecA/Rad51: The Universal Recombinase

RecA (bacteria) and Rad51 (eukaryotes) are the central catalysts of homologous recombination. They form a helical nucleoprotein filament on ssDNA that performs the remarkable feat of searching the entire genome for homologous sequences and catalyzing strand exchange.

Nucleoprotein Filament Assembly

RecA binds ssDNA cooperatively in an ATP-dependent manner. Each RecA monomer binds 3 nucleotides, and the filament has a helical pitch of ~95 Angstroms with 6 monomers per turn. The DNA within the filament is stretched to ~150% of B-form length (5.1 Angstrom rise per bp vs 3.4 Angstrom). This stretching is critical: it underwound the bases, making them accessible for homology testing. In eukaryotes, Rad51 filament assembly requires mediators: BRCA2 loads Rad51 onto RPA-coated ssDNA, displacing RPA. Loss of BRCA2 function causes HR deficiency and predisposition to breast and ovarian cancer.

Homology Search: 3D Genome Scanning

The filament must find a homologous sequence among millions of base pairs. Single-molecule studies reveal this occurs by: (1) 3D diffusion-driven sampling — the filament randomly contacts dsDNA segments. (2) Rapid 1D sliding(~70 bp on average) along contacted DNA for local homology testing. (3) Microhomology testing in 8-nucleotide steps: the filament tests triplets of bases through the major groove without full strand separation. Only ~8 nt of contiguous homology triggers stable pairing; ~15 nt leads to irreversible strand exchange. The entire search process takes minutes in vivo despite the genome's vast size.

Strand Exchange Mechanism

Strand exchange proceeds through two intermediates: (1) Paranemic joint— initial pairing where the incoming ssDNA wraps around the dsDNA without topological intertwining. Watson-Crick base pairs form between the incoming strand and the complementary strand of the duplex, but the displaced strand is not fully released. (2) Plectonemic joint — the strands become topologically interlocked, with the incoming strand fully replacing the displaced strand. This transition requires DNA rotation and is essentially irreversible. ATP hydrolysis by RecA drives filament disassembly from the 5' end, allowing recycling but is not required for strand exchange per se (RecA-ATPγS supports exchange but not turnover).

1.5 Meiotic Recombination

Meiotic recombination is a programmed process that generates crossovers between homologous chromosomes. These crossovers (chiasmata) provide the physical connections needed for proper chromosome segregation at meiosis I, while also generating genetic diversity.

Spo11: Programmed DSB Formation

Meiotic recombination initiates with deliberate DSBs created by Spo11, a topoisomerase II-like transesterase. Spo11 acts as a homodimer, with each subunit covalently attached to a 5' end of the break via a tyrosine-phosphodiester bond (Tyr135 in S. cerevisiae). Release of Spo11 with a short oligonucleotide (Spo11-oligo complex) by MRN/Sae2 generates the substrate for long-range resection. Humans form ~200-300 DSBs per meiosis, but only ~30 become crossovers; the rest are repaired as non-crossovers.

Crossover vs Non-Crossover Decision

The CO/NCO decision occurs early, likely at or before stable strand exchange. Two pathways produce crossovers: (1) Class I COs (~85% in yeast): require the ZMM proteins (Zip1-4, Msh4-Msh5, Mer3, Spo16). Msh4-Msh5 forms a sliding clamp that stabilizes dHJ intermediates and promotes their resolution as crossovers by the MutLγ complex (Mlh1-Mlh3). Class I COs exhibit interference. (2) Class II COs (~15%): generated by Mus81-Mms4 resolvase; do not exhibit interference. Non-crossovers are produced primarily by synthesis-dependent strand annealing (SDSA), where the invading strand is displaced after limited synthesis and reanneals with the other resected end.

Crossover Interference

Crossover interference is the phenomenon where one crossover inhibits the formation of additional crossovers nearby on the same chromosome. This ensures even spacing of crossovers and guarantees the “obligate crossover” — at least one CO per chromosome (or per chromosome arm in some organisms). Interference is modeled by the counting model (or gamma model): a fixed number m of non-crossover intermediates must be resolved between successive COs. The coefficient of coincidence (COC) measures interference:

$$\text{COC} = \frac{\text{Observed double crossovers}}{\text{Expected double crossovers}} = \frac{f_{AB}}{f_A \times f_B}$$

COC = 0 means complete interference; COC = 1 means no interference. The interference value I = 1 - COC. Physical basis remains debated: leading models include mechanical stress propagation along the synaptonemal complex and diffusion of a crossover-promoting factor that is locally depleted by each CO event.

Obligate Crossover and ZMM Proteins

Every chromosome pair must receive at least one crossover for proper disjunction. Failure causes nondisjunction, aneuploidy, and conditions like Down syndrome (trisomy 21). The ZMM group proteins are essential for this: Zip1/SYCP1 forms the transverse filaments of the synaptonemal complex, Zip3/RNF212 marks crossover-designated sites with SUMO/ubiquitin, and Msh4-Msh5 (MutSγ) stabilizes recombination intermediates at these sites. Mer3 is a helicase that promotes 3' extension during D-loop formation specifically at crossover-designated sites.

2. Site-Specific Recombination

Unlike homologous recombination, site-specific recombination occurs between specific DNA sequences recognized by dedicated recombinases. These enzymes catalyze precise DNA rearrangements including integration, excision, inversion, and resolution. Two major families exist, distinguished by their catalytic mechanism and the amino acid that forms the covalent DNA-protein intermediate.

2.1 Integrase Family (Tyrosine Recombinases)

Tyrosine recombinases use a conserved tyrosine residue to attack the DNA phosphodiester backbone, forming a 3'-phosphotyrosine covalent intermediate. Recombination proceeds through a Holliday junction intermediate with sequential pairs of strand exchanges.

Lambda Phage Integration

The paradigm for site-specific recombination. Phage λ integrase (Int) catalyzes recombination between attP (240 bp on phage) and attB(25 bp on E. coli chromosome) to produce the integrated prophage flanked byattL and attR:

Integration: attB × attP → attL — prophage — attR

Excision: attL × attR → attB + attP (requires Xis + IHF)

Int has two DNA-binding domains: the N-terminal “arm” domain binds arm-type sites (P1, P2, P'1-3) in attP, while the C-terminal catalytic domain binds core-type sites flanking the 7-bp overlap region where strand exchange occurs. Integration requires Int + IHF (integration host factor); excision additionally requires Xis (excisionase), which bends DNA to favor the excision topology. This directional control ensures lysogeny is stable until the SOS response induces Xis expression.

Cre-lox System

Cre recombinase (from phage P1) catalyzes recombination between 34-bp loxPsites without accessory factors. Each loxP site has two 13-bp inverted repeats flanking an 8-bp asymmetric spacer that determines directionality. The outcome depends on loxP orientation: direct repeats → excision (deletion) of intervening DNA;inverted repeats → inversion; intermolecular → integration. In molecular biology, Cre-lox is used extensively for conditional knockouts: a gene is flanked by loxP sites (“floxed”), and Cre expression under a tissue-specific promoter deletes the gene only in target tissues. Inducible versions (CreERT2, activated by tamoxifen) add temporal control. Cre-dependent reporters (e.g., ROSA26-loxP-STOP-loxP-GFP) enable lineage tracing.

FLP-FRT System

The FLP recombinase from the yeast 2-micron plasmid acts on FRT (FLP recombinase target) sites. Mechanistically analogous to Cre-lox, FLP-FRT is widely used in Drosophilagenetics for generating mitotic clones (MARCM system) and in mammalian cells for recombinase-mediated cassette exchange (RMCE). FLP has a lower optimal temperature (30°C) than Cre (37°C), making FLPe (enhanced thermostable variant) preferred for mammalian applications.

2.2 Resolvase/Invertase Family (Serine Recombinases)

Serine recombinases use a serine nucleophile to form a 5'-phosphoserine intermediate. All four strands are cleaved simultaneously, and recombination proceeds through a 180-degree subunit rotation mechanism without a Holliday junction intermediate.

Hin Recombinase

Controls flagellar phase variation in Salmonella typhimurium. Hin inverts a ~1 kb DNA segment containing the promoter for the H2 flagellin gene. In one orientation, H2 flagellin and a repressor of H1 are expressed; in the other, H1 is expressed. This allows the bacterium to evade the host immune response by alternating surface antigens at a frequency of ~10-3 per cell per generation. The inversion requires Hin recombinase, an enhancer sequence, and the Fis protein.

γδ Resolvase

Resolves cointegrate intermediates during replicative transposition of the γδ (Tn1000) element. Two copies of the res site are brought together (one from each copy of the transposon in the cointegrate), and resolvase catalyzes recombination to separate the donor and target replicons. The reaction requires supercoiled substrate and occurs within a synaptic complex called the “synapse” that traps three negative supercoils between the res sites. Crystal structures of the γδ resolvase tetramer revealed the 180-degree rotation mechanism.

2.3 V(D)J Recombination: Generating Antibody Diversity

V(D)J recombination is the somatic DNA rearrangement that assembles functional immunoglobulin and T-cell receptor genes from variable (V), diversity (D), and joining (J) gene segments. This process generates the enormous diversity of the adaptive immune system.

RAG1/RAG2 Recombinase

The recombination-activating genes RAG1 and RAG2 form the V(D)J recombinase. RAG1 contains the catalytic DDE motif (shared with transposases) and mediates DNA cleavage. RAG2 is required for DNA binding and stabilization of the synaptic complex. Structurally, the RAG1/2 complex resembles a transposase, supporting the hypothesis that V(D)J recombination evolved from an ancient transposon that was “domesticated” by the vertebrate immune system ~500 million years ago. In vitro, RAG1/2 can catalyze transposition, reinforcing this evolutionary connection.

Recombination Signal Sequences (RSS) and the 12/23 Rule

Each V, D, and J segment is flanked by RSS consisting of conserved heptamer (5'-CACAGTG-3') and nonamer (5'-ACAAAAACC-3') sequences separated by a spacer of either 12 or 23 bp. The 12/23 rule dictates that recombination occurs only between a 12-RSS and a 23-RSS, ensuring that V segments join to D (not other V segments) and D joins to J. The spacer lengths correspond to one or two turns of the DNA helix, placing the heptamer and nonamer on the same face of the DNA.

V segment —[23-RSS]— × —[12-RSS]— D segment —[12-RSS]— × —[23-RSS]— J segment

Cleavage and Joining Mechanism

RAG1/2 introduces a nick between the coding segment and the RSS heptamer, generating a 3'-OH. This hydroxyl attacks the phosphodiester bond on the opposite strand in a transesterification reaction, creating a hairpin at the coding end and a blunt signal end. Signal ends are joined precisely (signal joint). Coding ends are opened by the Artemis nuclease (activated by DNA-PKcs) at variable positions, creating diversity. The opened ends are then joined by non-homologous end joining (NHEJ): Ku70/Ku80 → DNA-PKcs → Artemis → polymerases μ/λ → XLF-XRCC4-Ligase IV.

Junctional Diversity

P-nucleotides

Palindromic nucleotides added when the hairpin is opened asymmetrically. The resulting single-stranded overhang is filled in, creating a short palindromic insertion at the coding joint.

N-nucleotides (TdT)

Terminal deoxynucleotidyl transferase (TdT) adds random nucleotides (up to ~15) to coding ends in a template-independent manner. TdT is expressed only in lymphocytes and is the major source of junctional diversity. It preferentially adds G nucleotides.

Total Antibody Diversity

Combinatorial diversity: ~50 VH × ~27 DH × ~6 JH ≈ 8,100 heavy chains; ~40 Vκ × ~5 Jκ ≈ 200 light chains. Heavy × light pairing: 8,100 × 200 ≈ 1.6 × 106. Adding junctional diversity (P- and N-nucleotides) increases this to an estimated ~1011unique antibody specificities. Somatic hypermutation (AID-dependent) in germinal centers further increases diversity after antigen encounter.

3. Transposable Elements (TEs)

Transposable elements are DNA sequences capable of moving (transposing) from one genomic location to another. Discovered by Barbara McClintock in maize (1948, Nobel Prize 1983), TEs are now recognized as the single largest component of most eukaryotic genomes. They are classified by transposition mechanism into Class I (retrotransposons, copy-and-paste via RNA) and Class II (DNA transposons, cut-and-paste via DNA).

3.1 DNA Transposons (Class II): Cut-and-Paste

Structure and Mechanism

DNA transposons encode a transposase flanked by terminal inverted repeats (TIRs) (typically 10-40 bp). The transposase recognizes and binds the TIRs, excises the element from the donor site, and inserts it into a new target site. Upon insertion, the transposase makes staggered cuts in the target DNA; repair of the resulting gaps generates target site duplications (TSDs)of characteristic length (e.g., 9 bp for IS10, 5 bp for Tn3). TSDs are the hallmark of transposition and are used to identify transposon insertions computationally.

Tn10 / IS10 (Conservative Transposition)

Tn10 is a composite transposon carrying tetracycline resistance flanked by IS10 elements. IS10 transposase catalyzes cut-and-paste transposition via a hairpin intermediate at the donor site (similar to V(D)J recombination). The excised transposon is not replicated, so copy number does not increase. Transposition is tightly regulated: (1) antisense RNA inhibits transposase translation, (2) Dam methylation of GATC sites in the IS10 promoter represses transcription (hemimethylated DNA after replication transiently derepresses), (3) the transposase acts preferentially in cis on the element that encoded it.

Tn3 (Replicative Transposition)

Tn3 (ampicillin resistance) transposes via a cointegrate intermediate, where donor and target replicons are fused with a copy of the transposon at each junction. The transposase (TnpA) catalyzes the initial strand transfer that creates the cointegrate. Resolution by the site-specific resolvase (TnpR) at the ressite separates the replicons, each now carrying a copy of Tn3. This replicative mechanism increases transposon copy number. TnpR also acts as a repressor of bothtnpA and tnpR transcription, providing autoregulation.

Eukaryotic DNA Transposons

DNA transposons comprise ~3% of the human genome but are all inactive (“fossils”). Major superfamilies include hAT (hobo/Ac/Tam3), Tc1/mariner, and PiggyBac. TheSleeping Beauty transposon was “resurrected” from salmonid fish fossils by correcting inactivating mutations, creating a powerful gene transfer tool. PiggyBac is used in gene therapy and transgenesis due to its large cargo capacity (>100 kb) and precise excision (leaving no footprint). Helitrons, a distinct class, transpose by a rolling-circle mechanism and have captured thousands of host gene fragments in maize and other plants.

3.2 Retrotransposons (Class I): Copy-and-Paste via RNA

LTR Retrotransposons

Structurally and mechanistically similar to retroviruses. Flanked by long terminal repeats (LTRs, 200-500 bp) containing promoter and polyadenylation signals. Internal genes encode: gag (structural proteins forming virus-like particles),pol (protease, reverse transcriptase, RNase H, integrase). The element is transcribed into mRNA, which is both translated (to produce Gag-Pol polyprotein) and packaged into VLPs. Reverse transcription within the VLP produces dsDNA, which is imported into the nucleus and integrated by the integrase (DDE motif, same fold as retroviral integrase and RAG1). LTR retrotransposons include Ty1/Ty3 in yeast, copia/gypsy in Drosophila, and ERVs (endogenous retroviruses) in mammals. Human ERVs (HERVs) comprise ~8% of our genome; HERV-W envelope protein was co-opted as syncytin for placental development.

LINE Elements (Long Interspersed Nuclear Elements)

LINEs are non-LTR retrotransposons that are the dominant TEs in mammalian genomes. Human LINE-1 (L1) is 6 kb full-length and encodes:

  • ORF1p: RNA-binding protein with nucleic acid chaperone activity. Forms trimers that coat L1 RNA (essential for retrotransposition but not catalytic).
  • ORF2p: Bifunctional enzyme with (1) an N-terminal endonuclease (APE-like, nicks target DNA at 5'-TTTT/AA-3' consensus) and (2) a C-terminal reverse transcriptase that copies L1 RNA into DNA at the target site.

L1 retrotransposes by target-primed reverse transcription (TPRT): (1) ORF2p EN nicks the bottom strand of target DNA. (2) The exposed 3'-OH primes reverse transcription of L1 RNA by ORF2p RT. (3) Second-strand synthesis and integration complete the insertion. Most L1 copies are 5'-truncated due to premature termination of reverse transcription. Of ~500,000 L1 copies in the human genome, only ~100 are retrotransposition-competent. L1 activity is suppressed by DNA methylation, piRNA pathway (PIWI proteins), and APOBEC3 cytidine deaminases.

SINE Elements (Short Interspersed Nuclear Elements)

SINEs are non-autonomous elements (~100-400 bp) that parasitize LINE machinery for retrotransposition. They evolved from RNA polymerase III transcripts (tRNA or 7SL RNA).Alu elements, derived from 7SL RNA, are the most abundant TEs in the human genome: ~1.1 million copies comprising ~11% of the genome(~300 bp each). Alu is a dimeric element with left and right monomers joined by an A-rich linker and ending with an A-tail. Alu is transcribed by RNA Pol III and retrotransposes using L1 ORF2p (and possibly ORF1p). Alu insertions cause ~0.1% of human genetic diseases through insertional mutagenesis and Alu-Alu recombination causing deletions. However, Alu elements also provide regulatory innovation: they contain transcription factor binding sites, influence mRNA splicing (Alu exonization), and serve as A-to-I RNA editing substrates (ADAR acts on Alu dsRNA structures in 3' UTRs).

3.3 Genomic Impact of Transposable Elements

Composition of the Human Genome

  • LINEs: ~20% (~500,000 L1 copies, ~17% genome)
  • SINEs: ~13% (~1.8 million copies; Alu ~11%)
  • LTR elements: ~8% (HERVs, MaLR)
  • DNA transposons: ~3% (all inactive)
  • Total TE-derived: ~45% (vs ~1.5% protein-coding)

Functional Consequences

  • Insertional mutagenesis: ~100 disease-causing L1 insertions documented
  • Chromosomal rearrangements: Alu-Alu recombination, L1-mediated deletions
  • Gene regulation (exaptation): TE-derived enhancers, promoters, insulators
  • New genes: Syncytin (HERV-W), RAG1/2 (transposon-derived)
  • Epigenetic effects: TE methylation spreads to nearby genes

The Genome as a TE Ecosystem

The relationship between TEs and their host genome is best understood as an ecological interaction. TEs are “selfish genetic elements” whose proliferation is countered by host defense mechanisms (DNA methylation, piRNA, KRAB-ZFPs). The balance between TE amplification and host suppression determines the TE content of each species. Maize is ~85% TE-derived; Arabidopsis ~15%; pufferfish (Takifugu) only ~3%. Polyploidy events often trigger transposon bursts due to epigenetic relaxation (“genomic shock” as described by McClintock). Over time, most TE copies accumulate inactivating mutations and become “fossils,” but their sequences remain as a record of past transposition activity.

4. CRISPR-Cas: Bacterial Adaptive Immunity

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) and Cas (CRISPR-associated) proteins constitute a prokaryotic adaptive immune system that provides sequence-specific defense against bacteriophages and plasmids. Discovered in E. coli (1987) and functionally characterized as an immune system (2007), CRISPR-Cas has been revolutionized into the most powerful genome editing tool available.

Three Stages of CRISPR Immunity

Stage 1: Spacer Acquisition (Adaptation)

When a new phage infects, the Cas1-Cas2 complex (the universal “adaptation module”) captures a short fragment (~30 bp) of the invader's DNA and integrates it as a new spacer at the leader end of the CRISPR array. Spacers are separated by conserved repeat sequences (~36 bp). The Cas1-Cas2 complex is structurally a heterohexamer (Cas14-Cas22) that catalyzes integration via a mechanism analogous to retroviral integration. In Type I systems, Cas4 trims the protospacer before integration. The PAM sequence on the invading DNA is recognized during acquisition, ensuring that spacers are functional for interference. Naive acquisition is rare (~10-7 per infection), but primed acquisition (using existing partial immunity) is ~1000-fold more efficient.

Stage 2: crRNA Processing (Expression)

The CRISPR array is transcribed as a long pre-crRNA, which is processed into individualcrRNAs (CRISPR RNAs), each containing one spacer flanked by partial repeats. Processing differs by system type: In Type I, Cas6 endoribonuclease cleaves within repeats. In Type II (Cas9), a trans-activating crRNA (tracrRNA) base-pairs with the repeat region, and RNase III cleaves the resulting dsRNA. In Type III, Cas6 processes pre-crRNA, then the 3' end is trimmed by unknown nucleases. The mature crRNA is loaded into the effector complex for target recognition.

Stage 3: Target Interference

The crRNA-guided effector complex scans incoming DNA for sequences complementary to the spacer (the “protospacer”). For Cas9 (Type II-A):

  • PAM recognition: Cas9 first recognizes a short protospacer adjacent motif (PAM; 5'-NGG-3' for S. pyogenes Cas9) on the non-target strand. PAM recognition is mediated by the PAM-interacting domain (PID) and causes local DNA melting.
  • R-loop formation: The crRNA spacer invades the target strand, forming an RNA-DNA hybrid (R-loop) that propagates from the PAM-proximal “seed” region (8-12 nt). Mismatches in the seed abort interference.
  • DNA cleavage: Full R-loop formation triggers conformational activation of two nuclease domains: HNH cuts the target strand (complementary to crRNA) and RuvC cuts the non-target strand, producing a blunt DSB 3 bp upstream of the PAM.
  • Self vs non-self: The CRISPR repeat in the host array lacks a PAM adjacent to the spacer, preventing autoimmunity (self-targeting).

Genome Editing Applications

The programmability of Cas9 (guide RNA determines target) has transformed molecular biology:

Gene Knockout

DSB repaired by error-prone NHEJ creates indels, disrupting gene function. Efficiency ~50-90% in most cell types.

Precise Editing (HDR)

DSB + donor template enables precise sequence changes via homology-directed repair. Efficiency ~5-50%, limited to dividing cells.

Base Editing

dCas9 fused to deaminases (CBE: C→T, ABE: A→G) enables point mutations without DSBs. No donor template needed.

Prime Editing

Cas9 nickase fused to RT uses a prime editing guide RNA (pegRNA) to template precise edits. All 12 point mutations, insertions, deletions.

CRISPRi / CRISPRa

dCas9 fused to KRAB (repression) or VP64/p65/Rta (activation) for programmable gene regulation without DNA cutting.

Diagnostics (SHERLOCK/DETECTR)

Cas13 (RNA targeting) and Cas12 collateral cleavage activity enables rapid, sensitive nucleic acid detection (e.g., SARS-CoV-2).

5. Recombination in Evolution

Recombination and mobile elements are among the most powerful forces shaping genome evolution. They create the raw material for natural selection by generating new gene combinations, duplicating genes, and rearranging chromosomal architecture.

Horizontal Gene Transfer (HGT)

Recombination enables integration of foreign DNA acquired via transformation, transduction, or conjugation. HGT is pervasive in prokaryotes: up to ~30% of genes in some bacteria have been acquired horizontally. Antibiotic resistance genes spread via conjugative plasmids and integrons, with site-specific recombination integrating resistance cassettes. Pathogenicity islands (10-200 kb regions encoding virulence factors) are acquired by HGT and integrated at tRNA genes by phage-like integrases.

Gene Duplication

Unequal crossing over between misaligned repeats (including TEs) generates tandem gene duplications. The duplicated gene is freed from selective constraint and can acquire new functions (neofunctionalization), partition ancestral functions (subfunctionalization), or degenerate into a pseudogene. The globin gene family (α, β, myoglobin, neuroglobin) exemplifies progressive duplication and divergence from an ancestral single-domain globin over ~800 million years.

Exon Shuffling

Recombination within introns can juxtapose exons from different genes, creating chimeric proteins with novel domain combinations. This is facilitated by phase-compatible introns (introns at the same position within a codon). L1-mediated 3' transduction (when L1 retrotransposes downstream flanking sequence) can also mobilize exons. Many modular proteins (e.g., tissue plasminogen activator with kringle, EGF, and fibronectin domains) evolved by exon shuffling.

Genome Plasticity

TE-driven genome expansion and contraction create dramatic size variation: Paris japonica (150 Gb, 50× human) vs Utricularia gibba (82 Mb). Segmental duplications mediated by TE-rich regions cause recurrent genomic disorders (e.g., 22q11.2 deletion/DiGeorge syndrome). Chromosome rearrangements (inversions, translocations) at TE sites can drive speciation by creating reproductive barriers. TE domestication contributes novel regulatory networks: the interferon response in mammals uses thousands of TE-derived binding sites for IRF and STAT transcription factors.

Mathematical Framework

Recombination Frequency and Genetic Mapping

The recombination frequency (RF) between two loci is proportional to their physical distance for small distances, but saturates at 50% for distant loci (independent assortment). Map functions relate the true genetic distance d (in Morgans, where 1 Morgan = expected number of crossovers in that interval = 1) to the observed recombination frequencyr.

Haldane Map Function (No Interference)

Assumes crossovers are distributed as a Poisson process (no interference). The probability of an odd number of crossovers (which produces a recombinant) is:

$$r = \frac{1}{2}\left(1 - e^{-2d}\right)$$

Inverse: $d = -\frac{1}{2}\ln(1 - 2r)$

As $d \to \infty$, $r \to 0.5$ (independent assortment). For small $d$,$r \approx d$ (1 cM ≈ 1% recombination).

Kosambi Map Function (With Interference)

Accounts for positive crossover interference, where one crossover inhibits nearby crossovers:

$$r = \frac{1}{2}\tanh(2d) = \frac{1}{2}\cdot\frac{e^{4d} - 1}{e^{4d} + 1}$$

Inverse: $d = \frac{1}{4}\ln\left(\frac{1 + 2r}{1 - 2r}\right)$

The Kosambi function gives lower RF than Haldane for the same genetic distance because interference reduces double crossovers. The additivity property holds: if $r_{AC}$ is the RF for interval A-C, and B is between A and C:

$$r_{AC} = \frac{r_{AB} + r_{BC}}{1 + 4\,r_{AB}\,r_{BC}}$$

Coefficient of Coincidence and Interference

$$\text{COC} = \frac{\text{Observed double recombinants}}{\text{Expected double recombinants}} = \frac{f_{\text{double CO}}}{r_1 \times r_2}$$
$$\text{Interference}\;(I) = 1 - \text{COC}$$

$I = 1$: complete interference (no double COs); $I = 0$: no interference (Haldane); $I < 0$: negative interference (excess double COs, observed in some fungi).

Derivation: The Haldane Mapping Function

Starting from the Poisson distribution of crossover events, we derive the relationship between genetic distance d (in Morgans) and recombination frequency r.

Step 1: Assume crossovers follow a Poisson process

Under the Haldane model (no crossover interference), the number of crossovers k in an interval of genetic length d Morgans follows a Poisson distribution with mean m = 2d (factor of 2 because genetic distance counts crossovers per chromatid, but Poisson applies to the 4-strand bundle):

$$P(k) = \frac{(2d)^k \, e^{-2d}}{k!}$$

Step 2: Relate recombination to odd crossover number

A recombinant chromatid is produced only when the number of crossovers between two loci is odd. Even numbers of crossovers restore the parental configuration:

$$r = P(\text{odd number of COs}) = P(1) + P(3) + P(5) + \cdots$$

Step 3: Use the Poisson identity for odd vs. even sums

The probability of an odd number from a Poisson distribution has a closed-form solution. Using the identity ex − e−x = 2sinh(x):

$$P(\text{odd}) = \frac{1}{2}\left(1 - e^{-2m}\right) \quad \text{where } m = 2d$$

This follows because P(even) − P(odd) = Σ(−1)kmke−m/k! = e−m·e−m = e−2m.

Step 4: Obtain the Haldane map function

Substituting m = 2d into the odd-probability formula:

$$r = \frac{1}{2}\left(1 - e^{-2d}\right) \qquad \text{(Haldane map function)}$$

Step 5: Derive the inverse function

Solving for genetic distance d given observed recombination frequency r:

$$1 - 2r = e^{-2d} \implies d = -\frac{1}{2}\ln(1 - 2r)$$

Step 6: Verify limiting behavior

Check the two key limits: for small distances (d → 0), r ≈ d (1 cM ≈ 1% recombination). For large distances (d → ∞), r → 0.5 (independent assortment):

$$d \to 0:\; r \approx \frac{1}{2}(1-(1-2d)) = d \;\;\checkmark \qquad d \to \infty:\; r \to \frac{1}{2}(1-0) = 0.5 \;\;\checkmark$$

Transposable Element Dynamics

The copy number n of a replicative TE evolves according to:

$$\frac{dn}{dt} = n(u - v) - sn^2$$

where u = transposition rate per element per generation, v = excision rate, s = selection coefficient per element. The $sn^2$ term reflects purifying selection against TE load (each additional copy imposes a fitness cost proportional to total copy number). At equilibrium ($dn/dt = 0$):

$$n_{\text{eq}} = \frac{u - v}{s}$$

This is a logistic-like equation: TE copy number increases when $u > v$ but is limited by selection. The model predicts that weakly deleterious TEs (small s) accumulate to high copy numbers, consistent with the observation that most TEs in large genomes reside in gene-poor regions where their selective cost is minimal.

Derivation: Transposon Copy Number Dynamics from Birth-Death Process

Starting from first principles of replicative transposition and purifying selection, we derive the logistic-like equation governing TE copy number evolution.

Step 1: Define the gain and loss rates

For a genome carrying n copies of a replicative TE, each element transposes with rate u per generation (birth) and is excised/deleted with rate v (death):

$$\text{Birth rate} = u \cdot n \qquad \text{Death rate (excision)} = v \cdot n$$

Step 2: Model purifying selection against TE load

Each TE insertion carries a small fitness cost s. The total fitness cost scales with copy number, and selection removes TEs in proportion to the product of individual fitness cost and total TE burden:

$$\text{Loss from selection} = s \cdot n^2 \qquad (\text{each of } n \text{ copies has fitness cost } s \cdot n)$$

Step 3: Write the complete ODE

Combining gain and loss terms:

$$\frac{dn}{dt} = un - vn - sn^2 = n(u - v) - sn^2$$

Step 4: Recognize the logistic form

Factoring, the equation is a logistic growth equation with growth rate (u − v) and carrying capacity (u − v)/s:

$$\frac{dn}{dt} = (u-v) \cdot n \left(1 - \frac{n}{(u-v)/s}\right) = (u-v) \cdot n \left(1 - \frac{sn}{u-v}\right)$$

Step 5: Solve for the equilibrium copy number

At equilibrium (dn/dt = 0), either n = 0 (trivial, TE extinction) or:

$$n(u - v) - sn^2 = 0 \implies n_{\text{eq}} = \frac{u - v}{s} \qquad (\text{stable if } u > v)$$

Step 6: Numerical example for human LINE-1 elements

Using estimated parameters for active LINE-1 elements: u ≈ 0.05 per element per generation, v ≈ 0.005, s ≈ 10−3:

$$n_{\text{eq}} = \frac{0.05 - 0.005}{0.001} = 45 \;\text{active copies}$$

This is consistent with the estimated ~80–100 active LINE-1 elements in the human genome. The remaining ~500,000 LINE-1 copies are “dead” fossils (truncated/mutated) that no longer transpose. Over evolutionary time, the accumulation of inactive copies explains the massive TE content (~45%) of the human genome.

Python: Meiotic Crossover Simulation with Interference

This simulation implements the counting model of crossover interference. In this model, after each crossover, a fixed number m of non-crossover intermediates must be resolved before another crossover is permitted. The simulation places crossovers along chromosomes for thousands of meioses, then analyzes the crossover distribution, inter-crossover distances (showing interference), and genetic map distances vs physical distances. It also compares observed recombination frequencies with predictions from the Haldane and Kosambi map functions.

Meiotic Crossover Simulation (Counting Model)

Python

Simulate crossover placement with interference and compare map functions

script.py159 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Fortran: Transposable Element Copy Number Dynamics

This Fortran program models the dynamics of transposable element copy number over generations under four scenarios. It implements the logistic-like TE dynamics equation where replicative transposition (rate u) increases copy number, excision (rate v) decreases it, and purifying selection (coefficient s) imposes a fitness cost that scales quadratically with copy number. The equilibrium copy number $n_{eq} = (u-v)/s$ balances amplification against purging. Scenarios compare LINE-like replicative TEs, DNA transposons, strong vs weak selection, and TE invasion from low copy number.

TE Copy Number Dynamics

Fortran

Model transposable element amplification vs selection over generations

te_dynamics.f90124 lines

Click Run to execute the Fortran code

Code will be compiled with gfortran and executed on the server

Key Concepts Summary

Homologous Recombination

  • DSB repair model (Szostak): resection → strand invasion → D-loop → dHJ → resolution
  • RecBCD + Chi (5'-GCTGGTGG-3') initiates HR in E. coli
  • RecA/Rad51 filament: ATP-dependent homology search and strand exchange
  • Paranemic → plectonemic joint transition
  • Meiotic: Spo11 DSBs, ZMM proteins, crossover interference, obligate CO
  • Class I COs (interfering, MutLγ) vs Class II (Mus81, non-interfering)

Site-Specific Recombination

  • Tyrosine recombinases: 3'-phosphotyrosine, HJ intermediate (Int, Cre, FLP)
  • Serine recombinases: 5'-phosphoserine, subunit rotation (Hin, γδ)
  • λ integration: attB × attP → attL + attR
  • Cre-lox: conditional knockouts, lineage tracing
  • V(D)J: RAG1/2, 12/23 rule, P- and N-nucleotides, ~1011 antibodies
  • RAG evolved from a domesticated transposon

Transposable Elements

  • Class II (DNA): cut-and-paste, transposase, TIRs, TSDs
  • Class I LTR: retrovirus-like, gag-pol, VLP, integrase
  • LINE (L1): ORF1p + ORF2p (EN + RT), target-primed reverse transcription
  • SINE (Alu): ~1.1M copies, ~11% genome, parasitizes L1 machinery
  • ~45% human genome is TE-derived; exaptation creates new functions
  • Equilibrium: $n_{eq} = (u-v)/s$

CRISPR-Cas & Evolution

  • Adaptation: Cas1-Cas2 acquires spacers from invaders
  • Expression: pre-crRNA processed to guide RNAs
  • Interference: Cas9 PAM → R-loop → HNH + RuvC cleavage
  • Editing: knockouts (NHEJ), HDR, base editing, prime editing
  • HGT, gene duplication, exon shuffling drive genome evolution
  • TE domestication: syncytin, RAG1/2, regulatory networks