Part 10: Molecular Biology Techniques

Tools of the Trade

Modern molecular biology is built upon a powerful arsenal of techniques that allow researchers to isolate, manipulate, amplify, sequence, and edit nucleic acids and proteins. From the earliest restriction enzyme digests in the 1970s to today's single-cell multi-omics and CRISPR-based gene editing, these methods have transformed our understanding of life at the molecular level and enabled breakthroughs in medicine, agriculture, forensics, and biotechnology.

This chapter provides a comprehensive survey of the major techniques in molecular biology, covering the principles, reagents, applications, and quantitative foundations behind each method. We progress from classical cloning and electrophoresis through PCR and sequencing to modern genome editing and omics technologies.

1. DNA Cloning

1.1 Restriction Enzymes

Restriction endonucleases are bacterial defense enzymes that cleave DNA at specific recognition sequences, typically 4-8 bp palindromes. Type II restriction enzymes are the workhorses of molecular cloning because they cut at defined positions within or adjacent to their recognition sites, producing reproducible fragments.

Sticky-End Cutters

Generate single-stranded overhangs (cohesive ends) that facilitate directional cloning through complementary base pairing.

EcoRI: 5'-G|AATTC-3' (4-nt 5' overhang)

3'-CTTAA|G-5'

BamHI: 5'-G|GATCC-3' (4-nt 5' overhang)

3'-CCTAG|G-5'

HindIII: 5'-A|AGCTT-3' (4-nt 5' overhang)

3'-TTCGA|A-5'

Blunt-End Cutters

Cut both strands at the same position, producing flush ends without overhangs. Less efficient ligation but allows joining of any two blunt-ended fragments.

SmaI: 5'-CCC|GGG-3'

3'-GGG|CCC-5'

EcoRV: 5'-GAT|ATC-3'

3'-CTA|TAG-5'

Key Concepts

Palindromic recognition: Most Type II enzymes recognize sequences that read the same on both strands in the 5'-to-3' direction (e.g., EcoRI: 5'-GAATTC-3' / 3'-CTTAAG-5').
Isoschizomers: Different enzymes recognizing the same sequence (e.g., BamHI and BglII produce compatible cohesive ends). Neoschizomers cut the same sequence at different positions.
Methylation sensitivity: Many restriction enzymes are blocked by DNA methylation (Dam: GATC, Dcm: CCWGG). This is exploited in bacterial defense (R-M systems) and must be considered when cloning from methylation-positive hosts.
Star activity: Under suboptimal conditions (low ionic strength, high glycerol, excessive enzyme), some enzymes exhibit relaxed specificity, cutting at non-canonical sites.

The expected number of restriction sites in a random DNA sequence follows:

$ \text{Expected cuts} = \frac{L}{4^n} $

where L is the sequence length and n is the recognition site length. A 6-cutter (n=6) cuts on average every 4,096 bp; a 4-cutter every 256 bp; an 8-cutter every 65,536 bp.

1.2 Cloning Vectors

A vector is a DNA molecule used to carry foreign genetic material into a host cell for replication and/or expression. The choice of vector depends on the insert size, the host organism, and whether gene expression is required.

Plasmid Vectors (0.1-15 kb inserts)

The most commonly used vectors. Key features include:

Origin of replication (ori): ColE1 ori (pUC series, high copy ~500-700/cell) or p15A ori (lower copy ~15-20/cell). Copy number affects yield and stability.
Selectable markers: Antibiotic resistance genes (ampicillin/bla, kanamycin/kan, chloramphenicol/cat) allow selection of transformed cells.
Multiple cloning site (MCS/polylinker): Cluster of unique restriction sites (typically 10-20) within a reporter gene for convenient insertion.
pBR322: Classic plasmid (4,361 bp) with ampR and tetR markers. Insertional inactivation used for clone identification.
pUC series: High-copy derivative with lacZ-alpha MCS enabling blue-white screening. Smaller size (~2.7 kb) improves transformation efficiency.

Large-Insert Vectors

Bacteriophage lambda (up to 23 kb)

Central "stuffer" fragment replaced with insert. Packaged into phage particles in vitro for high-efficiency infection. Replacement and insertion vectors available.

Cosmids (up to 45 kb)

Hybrid plasmid/phage vector containing cos sites for lambda packaging. Propagates as plasmid in E. coli. Used for genomic libraries.

BACs (up to 300 kb)

Based on F-factor. Single copy per cell ensures stability. Critical for Human Genome Project. Contains parA/B/C partition genes.

YACs (up to 2 Mb)

Artificial yeast chromosomes with telomeres, centromere, ARS (ori), and selectable markers. Propagated in S. cerevisiae. Prone to chimerism.

Expression Vectors

Designed not just to carry DNA but to actively transcribe and translate the insert into protein:

T7 promoter system: Recognized by T7 RNA polymerase (supplied by DE3 lysogen in BL21(DE3) cells). Extremely strong, tightly regulated.
IPTG induction: Isopropyl-beta-D-thiogalactopyranoside mimics allolactose, relieving LacI repression of the T7 promoter. Typical induction: 0.1-1 mM IPTG.
Affinity tags: His-tag (6xHis, Ni-NTA purification), GST-tag (glutathione purification), MBP-tag (amylose purification). Tags can be N- or C-terminal with protease cleavage sites (TEV, thrombin).
pET system: Industry standard. pET-28a provides His-tag, T7 promoter, kanamycin resistance. Various derivatives for different tag configurations.

1.3 Ligation and Transformation

T4 DNA Ligase

Catalyzes phosphodiester bond formation between 3'-OH and 5'-phosphate ends in an ATP-dependent reaction. Key considerations:

Cohesive-end ligation: 100-1000x more efficient than blunt-end. Performed at 16 degrees C overnight or room temperature for 10 min (quick ligase).
Blunt-end ligation: Requires 10-100x more ligase and higher DNA concentrations. PEG 4000 (5-15%) can enhance efficiency.
Insert:vector ratio: Typically 3:1 to 5:1 molar ratio for optimal intermolecular ligation. Higher ratios favor concatemer formation.
Vector dephosphorylation: CIP or SAP removes 5'-phosphates from vector to prevent self-ligation, dramatically reducing background.

Blue-White Screening

A selection method based on alpha-complementation of beta-galactosidase:

Vector carries lacZ-alpha gene fragment with MCS inside
Host strain provides lacZ-omega (DH5-alpha, JM109)
Insert disrupts lacZ-alpha reading frame
Plates contain X-gal (chromogenic substrate) + IPTG (inducer)
Blue colonies: No insert (functional beta-gal cleaves X-gal)
White colonies: Contain insert (disrupted beta-gal)

Transformation Methods

Chemical Competence (CaCl2)

Divalent cations (Ca2+) neutralize DNA phosphate backbone charge, promoting DNA-membrane association. Heat shock (42 degrees C, 45-90 sec) creates transient pores. Efficiency: 10^6 - 10^8 CFU/microgram. TSS and Inoue methods further enhance competence.

Electroporation

Brief high-voltage pulses (1.8-2.5 kV, 5 ms) create transient pores in the membrane. 10-100x more efficient than chemical methods (up to 10^10 CFU/microgram). Essential for large DNA molecules and library construction. Requires desalted DNA.

2. Polymerase Chain Reaction (PCR)

2.1 Principle and Thermal Cycling

PCR is the single most transformative technique in molecular biology. Invented by Kary Mullis in 1983 (Nobel Prize 1993), it enables exponential amplification of a specific DNA target defined by two flanking oligonucleotide primers.

The Three-Step Thermal Cycle

94-98 degrees C

1. Denaturation

Double-stranded DNA melts into single strands. Higher GC content requires higher temperatures. Initial denaturation: 2-5 min; cycle denaturation: 15-30 sec.

50-65 degrees C

2. Annealing

Primers hybridize to complementary template strands. Temperature set 3-5 degrees C below primer Tm. Too low causes non-specific binding; too high reduces yield.

72 degrees C

3. Extension

DNA polymerase synthesizes new strand from 3' end of each primer. Rate: ~1 kb/min (Taq) or ~0.5 kb/min (proofreading). Time depends on amplicon length.

Exponential Amplification

With perfect efficiency, each cycle doubles the number of target molecules:

\[ N_n = N_0 \cdot 2^n \]

In practice, efficiency (E) is less than 100%, so the general amplification equation is:

\[ N_n = N_0 \cdot (1 + E)^n \quad \text{where } 0 \leq E \leq 1 \]

Starting from 1 template molecule with E=0.95 over 30 cycles: N = 1 x (1.95)^30 = 1.56 x 10^8 copies. With perfect efficiency: 2^30 = 1.07 x 10^9. The efficiency gap compounds exponentially.

Derivation: PCR Amplification Kinetics

Starting from the biochemistry of a single PCR cycle, we derive the exponential amplification equation and its efficiency-corrected form.

Step 1: Model a single ideal PCR cycle

In one cycle, each double-stranded template denatures into two single strands, primers anneal, and polymerase extends to create two new duplexes. With perfect efficiency, one molecule becomes two:

$$N_1 = N_0 \times 2 \qquad \text{(one cycle, 100\% efficiency)}$$

Step 2: Extend to n cycles (ideal case)

Each cycle doubles the copy number independently. After n cycles:

$$N_n = N_0 \times 2^n \qquad \text{(ideal exponential amplification)}$$

For N₀ = 1 and n = 30: N = 2³⁰ = 1.07 × 10⁹ copies (~1 billion-fold amplification).

Step 3: Introduce amplification efficiency E

In reality, not every template is copied in every cycle. Define efficiency E as the fraction of templates successfully replicated (0 ≤ E ≤ 1). Per cycle, each template produces (1 + E) copies:

$$N_1 = N_0 \times (1 + E) \qquad \text{where } E = 1 \text{ is perfect (doubling)}$$

Step 4: General amplification equation

Compounding over n cycles with constant efficiency:

$$N_n = N_0 \times (1 + E)^n$$

With E = 0.95 (typical for optimized reactions): (1.95)³⁰ = 1.56 × 10⁸, only 15% of ideal yield.

Step 5: Model efficiency decline (realistic PCR)

Efficiency decreases as primers deplete and products accumulate. A simple model uses a logistic correction:

$$E(n) = E_{\max} \times \frac{P_{\text{remaining}}}{P_{\text{total}}} \times \left(1 - \frac{N_n}{N_{\text{plateau}}}\right)$$

This produces the characteristic three-phase PCR curve: exponential phase (constant E), linear phase (declining E), and plateau phase (E → 0).

Step 6: Derive the qPCR Ct relationship

The threshold cycle C_t is when N_n reaches a detection threshold N_T. Solving N_T = N₀(1+E)^Ct:

$$C_t = \frac{\ln(N_T/N_0)}{\ln(1+E)} \qquad \Delta C_t \propto -\log_2(N_0) \;\;\text{(at } E \approx 1\text{)}$$

Each 10-fold difference in initial template shifts C_t by ~3.32 cycles (when E = 1). This linear relationship between C_t and log(N₀) is the basis of qPCR quantification.

2.2 Components and Polymerases

DNA Polymerases

Taq polymerase: From Thermus aquaticus. Thermostable (half-life ~40 min at 95 degrees C). No 3'-to-5' proofreading. Error rate: ~2 x 10^-4 per bp per cycle. Adds non-templated A to 3' ends (useful for TA cloning).
Pfu polymerase: From Pyrococcus furiosus. 3'-to-5' exonuclease (proofreading). Error rate: ~1.3 x 10^-6. Slower extension. Blunt ends.
Phusion polymerase: Engineered fusion of Pfu-like polymerase with processivity-enhancing domain. Error rate: ~4.4 x 10^-7. Fastest high-fidelity option. Industry standard for cloning.
Q5 polymerase: NEB's high-fidelity polymerase. Error rate: ~5.3 x 10^-7. Excellent for long amplicons and GC-rich templates.

Other Key Components

dNTPs: Typically 200 microM each (dATP, dCTP, dGTP, dTTP). Equimolar balance critical. Excess dNTPs chelate Mg2+ and reduce fidelity.
Mg2+ concentration: 1.5-2.5 mM optimal. Cofactor for polymerase. Higher Mg2+ increases yield but reduces specificity and fidelity.
Primers: 18-25 nt, 40-60% GC, Tm matched within 2 degrees C. Avoid 3'-end complementarity (primer dimers), repeats, and secondary structure.
Buffer: Tris-HCl pH 8.3-8.8, KCl (50 mM stabilizes primer annealing). Some buffers include betaine or DMSO for GC-rich templates.

Primer Melting Temperature (Tm) Calculation

Basic Wallace rule (for primers less than 20 nt):

\[ T_m = 2(A + T) + 4(G + C) \quad \text{(°C)} \]

Nearest-neighbor thermodynamic method (more accurate):

\[ T_m = \frac{\Delta H}{\Delta S + R \ln(C_T/4)} - 273.15 \]

where Delta-H and Delta-S are enthalpy and entropy sums of nearest-neighbor dinucleotide pairs, R is the gas constant (1.987 cal/mol/K), and C_T is the total primer concentration. Salt correction: Delta-S(adjusted) = Delta-S + 0.368 x (n-1) x ln([Na+]).

2.3 PCR Variants

RT-PCR (Reverse Transcription PCR)

First converts RNA to cDNA using reverse transcriptase (MMLV, SuperScript), then amplifies by standard PCR. Two-step (separate RT and PCR) or one-step (single tube) protocols. Essential for gene expression analysis and RNA virus detection. Oligo(dT) primers for mRNA, random hexamers for total RNA, or gene-specific primers.

qPCR (Quantitative Real-Time PCR)

Monitors amplification in real-time via fluorescence. Two main chemistries:

SYBR Green: Intercalating dye; binds any dsDNA. Cheap but non-specific (detects primer dimers). Melt curve analysis verifies single product.
TaqMan probes: Dual-labeled hydrolysis probes (5'-FAM reporter, 3'-TAMRA quencher). Sequence-specific. Taq 5'-to-3' exonuclease releases reporter during extension.

qPCR Quantification

The Ct (threshold cycle) is the cycle at which fluorescence crosses a set threshold:

\[ \Delta\Delta C_t = (C_{t,\text{target}} - C_{t,\text{ref}})_{\text{treated}} - (C_{t,\text{target}} - C_{t,\text{ref}})_{\text{control}} \]

Fold change = 2^(-Delta-Delta-Ct). Assumes ~100% amplification efficiency. Standard curve method more robust for unequal efficiencies.

Other PCR Variants

Digital PCR (dPCR): Partitions sample into thousands of nanoliter reactions. Absolute quantification by counting positive partitions (Poisson statistics). No standard curve needed.
Nested PCR: Two rounds with inner primers. Extreme sensitivity and specificity. Risk of contamination.
Overlap extension PCR: Joins/mutates sequences without restriction enzymes. Overlapping primers create chimeric products.
Inverse PCR: Outward-facing primers amplify unknown flanking sequences from a known internal region.
Multiplex PCR: Multiple primer pairs in one reaction. Used in pathogen panels, STR profiling, and genotyping.

3. DNA Sequencing

3.1 Sanger Sequencing (First Generation)

Developed by Frederick Sanger in 1977 (Nobel Prize 1980), dideoxy chain termination sequencing remains the gold standard for accuracy and is still widely used for single-gene sequencing, clone verification, and variant confirmation.

Principle: Dideoxy Chain Termination

The reaction contains normal dNTPs plus a small proportion of dideoxynucleotides (ddNTPs) lacking the 3'-OH group. When a ddNTP is incorporated, chain extension terminates at that position. The result is a nested set of fragments of every possible length, each terminated with a fluorescently labeled ddNTP that identifies the terminal base.

Template: 3'-TAGCCATGCGAT...-5'

Primer extension with dNTPs + ddNTPs:

Fragment 1: 5'-A* (terminated at pos 1, ddATP)

Fragment 2: 5'-AT* (terminated at pos 2, ddTTP)

Fragment 3: 5'-ATC* (terminated at pos 3, ddCTP)

Fragment 4: 5'-ATCG* (terminated at pos 4, ddGTP)

Fragment 5: 5'-ATCGG* (terminated at pos 5, ddGTP)

... and so on for each position

Modern Sanger Protocol

Single reaction with 4 differently-colored fluorescent ddNTPs
Capillary electrophoresis (CE) separates fragments by size
Laser excitation and CCD detection of fluorescent labels
Automated base calling by software (Phred quality scores)
Read length: ~800-1000 bp per reaction
Accuracy: >99.99% per base (Phred Q40+)

Phred Quality Score

\[ Q = -10 \log_{10}(P_e) \]

where P_e is the probability of an incorrect base call. Q20 = 1% error rate, Q30 = 0.1% error, Q40 = 0.01% error. Modern Sanger routinely achieves Q40-Q60 for the middle portion of reads.

Derivation: Sanger Sequencing Read Length from Poisson ddNTP Incorporation

Starting from the probability of ddNTP incorporation at each position, we derive the theoretical distribution of fragment lengths and the practical read length limit.

Step 1: Define the termination probability per position

At each template position, the polymerase incorporates either a normal dNTP (continuing synthesis) or a ddNTP (terminating the chain). The probability of termination depends on the ddNTP:dNTP ratio:

$$p = \frac{[\text{ddNTP}]}{[\text{ddNTP}] + [\text{dNTP}]} \qquad \text{(typically } p \approx 0.01\text{)}$$

Step 2: Model fragment length as a geometric distribution

The probability that a fragment terminates at exactly position k (chain reaches length k before termination) follows a geometric distribution:

$$P(\text{length} = k) = (1-p)^{k-1} \cdot p$$

Step 3: Calculate the expected fragment length

The mean fragment length from a geometric distribution is:

$$\langle L \rangle = \frac{1}{p} \qquad \text{For } p = 0.01: \;\langle L \rangle = 100\;\text{nt}$$

Step 4: Derive the signal intensity at position k

The number of fragments of length k out of N total molecules determines the fluorescence signal. The expected count at position k is:

$$n(k) = N \cdot p \cdot (1-p)^{k-1} = N \cdot p \cdot e^{-(k-1)\ln(1/(1-p))} \approx N \cdot p \cdot e^{-kp}$$

Step 5: Determine the practical read length limit

Base calling fails when the signal-to-noise ratio drops below a threshold. The signal decays exponentially while noise remains roughly constant. The read length L_max where signal equals noise floor (n_min):

$$N \cdot p \cdot e^{-L_{\max} \cdot p} = n_{\min} \implies L_{\max} = \frac{1}{p}\ln\left(\frac{Np}{n_{\min}}\right)$$

Step 6: Optimize p for maximum read length

Taking dL_max/dp = 0 to find the optimal ddNTP ratio, and using typical values (N = 10⁶ template molecules, n_min = 10):

$$p_{\text{opt}} = \frac{1}{L_{\text{target}}} \approx \frac{1}{1000} = 0.001$$

Modern Sanger sequencing optimizes the ddNTP:dNTP ratio to achieve 800–1000 bp read lengths. Beyond this, electrophoretic resolution of fragments differing by 1 bp (at size >1000) becomes the limiting factor rather than the Poisson statistics.

3.2 Next-Generation Sequencing (NGS)

NGS technologies perform massively parallel sequencing of millions to billions of DNA fragments simultaneously, reducing cost per base by orders of magnitude compared to Sanger. The cost of sequencing a human genome dropped from ~$3 billion (2001) to under $200 (2024).

Illumina Sequencing (Sequencing by Synthesis)

Dominant NGS platform (~80% of all sequencing data generated worldwide).

Library preparation: Fragment DNA, ligate adaptors, size-select.
Bridge amplification: ssDNA binds flow cell oligos, bends to hybridize with adjacent oligo, polymerase creates complement. Repeated to form clonal clusters (~1000 copies each).
Sequencing by synthesis: Fluorescently labeled reversible terminators added one at a time. After each incorporation: image, cleave fluorophore + 3'-blocking group, repeat.
Output: Read length 2x75 to 2x300 bp (paired-end). NovaSeq 6000: ~6 Tb per run, 20 billion reads. NovaSeq X Plus: ~16 Tb.

Ion Torrent (Semiconductor Sequencing)

Detects H+ ions released during nucleotide incorporation. Each well contains a single clonal template on an ion-sensitive field-effect transistor (ISFET). Nucleotides flowed sequentially; pH change proportional to number of incorporated bases. No optics needed. Fast turnaround (~2 hours). Weakness: homopolymer errors (difficult to count >6 identical bases). Read length: ~200-400 bp. Ion GeneStudio S5: ~50 Gb per run.

3.3 Third-Generation Sequencing (Long-Read)

PacBio SMRT Sequencing

Single Molecule, Real-Time sequencing:

Zero-mode waveguides (ZMWs): Nanophotonic chambers (~70 nm diameter) confining observation volume to ~20 zeptoliters. Only the fluorescent nucleotide being incorporated by the polymerase is observed.
Mechanism: Single DNA polymerase molecule anchored at ZMW bottom. Fluorescent phospholinked dNTPs diffuse in, are held during incorporation (~ms), fluorophore detected, then released upon translocation.
Read length: Average 15-20 kb (HiFi mode), up to 50+ kb. Circular consensus sequencing (CCS/HiFi): multiple passes around SMRTbell template yields Q30+ accuracy.
Revio system: 25 million ZMWs, ~90 Gb HiFi data per run, ~360 per year with 4 SMRT cells per run.

Oxford Nanopore Sequencing

Direct measurement of nucleic acids threading through protein nanopores:

Protein pore: CsgG-derived nanopore embedded in synthetic membrane. Voltage applied across membrane drives ionic current through pore.
Signal: As each ~5-mer passes through the constriction, it modulates ionic current in a characteristic way. Motor protein (helicase) controls translocation speed (~450 bp/sec).
Read length: No theoretical limit. Routine reads 10-100 kb. Record: >4 Mb single read. Can sequence RNA directly without cDNA conversion.
Platforms: MinION (portable USB device, ~50 Gb), PromethION (desktop, ~290 Gb per flow cell x 48 flow cells). Real-time basecalling enables adaptive sampling.
Accuracy: Raw single-read ~95-98% (R10.4.1 chemistry). Consensus/duplex: >Q30.

Sequencing Error Rate Comparison

\[ \text{Error rate} = 10^{-Q/10} \quad \Rightarrow \quad \text{Q30} = 10^{-3} = 0.1\% \text{ error per base} \]

Platform	Read Length	Raw Accuracy	Throughput/Run	Cost/Gb
Sanger (CE)	~800 bp	99.99%	~0.8 kb	~$500,000
Illumina NovaSeq X	2x150 bp	>99.9% (Q30+)	~16 Tb	~$2
Ion Torrent S5	~200-400 bp	99.5%	~50 Gb	~$50
PacBio Revio (HiFi)	15-20 kb	>99.9% (Q30+)	~90 Gb	~$10
ONT PromethION	10-100+ kb	~95-99%	~290 Gb	~$5

3.4 Sequencing Applications

Genome & Transcriptome

WGS: Whole genome sequencing for variant calling, de novo assembly, structural variant detection.
WES: Whole exome sequencing (~1-2% of genome, ~22,000 genes). Cost-effective for clinical diagnostics.
RNA-seq: Transcriptome profiling. Reveals expression levels (FPKM/TPM), alternative splicing, novel transcripts, gene fusions.

Epigenome & Chromatin

ChIP-seq: Chromatin immunoprecipitation + sequencing maps protein-DNA interactions and histone modifications genome-wide.
ATAC-seq: Assay for Transposase-Accessible Chromatin. Maps open chromatin regions using Tn5 transposase insertion.
Bisulfite-seq: Maps DNA methylation (5mC) at single-base resolution by converting unmethylated C to U.

4. Gel Electrophoresis

Separation by Size and Charge

Electrophoresis separates macromolecules based on their migration through a porous gel matrix under an applied electric field. The fundamental equation governing electrophoretic mobility:

\[ \mu = \frac{v}{E} = \frac{q}{f} \]

where mu is electrophoretic mobility, v is velocity, E is electric field strength, q is net charge, and f is the frictional coefficient (depends on size and shape). For DNA, charge is uniformly negative (one phosphate per nucleotide), so separation is purely by size through the sieving matrix.

Agarose Gel (DNA/RNA)

Concentration: 0.5-2% depending on fragment size
0.5% agarose: resolves 1-30 kb fragments
1.0% agarose: resolves 0.5-10 kb (most common)
2.0% agarose: resolves 0.1-2 kb fragments
Staining: ethidium bromide (intercalator, UV fluorescent) or safer alternatives (SYBR Safe, GelRed)
Buffer: TAE (Tris-Acetate-EDTA) or TBE (Tris-Borate-EDTA)
Migration: log(mobility) is proportional to -[agarose%] (Ferguson plot)

PAGE (Polyacrylamide Gel)

SDS-PAGE (denaturing): Sodium dodecyl sulfate denatures proteins and confers uniform negative charge proportional to mass (~1.4 g SDS/g protein). Separation by molecular weight alone.
Native PAGE: Separates by charge, size, and shape. Preserves protein activity and quaternary structure.
Acrylamide concentration: 6-15% (higher = better resolution of small proteins)
Staining: Coomassie blue (~10 ng sensitivity) or silver stain (~0.1 ng sensitivity)

Ferguson Plot Analysis

The relationship between electrophoretic mobility and gel concentration:

\[ \log(\mu) = \log(\mu_0) - K_R \cdot T \]

where mu_0 is the free mobility (zero gel concentration), K_R is the retardation coefficient (proportional to molecular size/Stokes radius), and T is the gel concentration. Plotting log(mu) vs. T for different species yields straight lines whose slopes indicate relative size.

Derivation: Ferguson Plot — Gel Electrophoresis Mobility

Starting from the Ogston sieving model, we derive the log-linear relationship between electrophoretic mobility and gel concentration (the Ferguson plot).

Step 1: Define electrophoretic mobility

A charged macromolecule in an electric field experiences an electromotive force balanced by friction:

$$qE = fv \implies \mu = \frac{v}{E} = \frac{q}{f}$$

where q is net charge, E is electric field, f is frictional coefficient, and v is migration velocity.

Step 2: Free solution mobility (μ₀)

In free solution (no gel), DNA has uniform charge density (~2 phosphates per bp, each with charge −e). The free mobility μ₀ is nearly independent of DNA size because charge and friction both scale linearly with length:

$$\mu_0 = \frac{q}{f} = \frac{N \cdot q_{\text{bp}}}{N \cdot f_{\text{bp}}} = \frac{q_{\text{bp}}}{f_{\text{bp}}} \approx 3.8 \times 10^{-4}\;\text{cm}^2/\text{V·s}$$

Step 3: Ogston sieving model — gel as a random fiber network

The gel matrix creates pores of various sizes. The probability that a spherical particle of radius R_s fits through a pore in a random network of fibers at concentration T (g/100 mL) is (Ogston, 1958):

$$f_{\text{accessible}} = e^{-\pi (R_s + r_f)^2 \cdot \ell \cdot T} = e^{-K_R \cdot T}$$

where r_f is the fiber radius, &ell; is the fiber length per unit concentration, and K_R = π(R_s + r_f)²&ell; is the retardation coefficient.

Step 4: Derive the Ferguson equation

The effective mobility in the gel is the free mobility multiplied by the fraction of accessible pore space:

$$\mu = \mu_0 \cdot e^{-K_R \cdot T} \implies \log(\mu) = \log(\mu_0) - K_R \cdot T$$

Step 5: Interpret the Ferguson plot

Plotting log(μ) vs. gel concentration T for different molecules yields straight lines. The slope (−K_R) depends on molecular size, while the y-intercept gives log(μ₀), the free mobility:

$$K_R \propto R_s^2 \propto M_w^{2/3} \qquad \text{(for globular proteins)}$$

$$K_R \propto L \qquad \text{(for rod-like DNA, where } L \text{ is contour length)}$$

Step 6: SDS-PAGE size calibration

In SDS-PAGE, all proteins have the same charge-to-mass ratio (due to SDS coating), so μ₀ is constant. The mobility depends only on K_R (molecular sieving). At fixed gel concentration, log(μ) is approximately linear with log(M_w):

$$\log(\mu) = a - b \cdot \log(M_w) \qquad \text{(basis for MW determination by SDS-PAGE)}$$

This is why plotting log(M_w) of protein standards vs. relative mobility (R_f) yields a linear calibration curve for determining the molecular weight of unknown proteins.

Pulsed-Field Gel Electrophoresis (PFGE)

Standard agarose gels cannot resolve DNA fragments larger than ~50 kb (co-migration). PFGE alternates the electric field direction at defined intervals, forcing large DNA molecules to reorient before migrating. Resolves fragments from 50 kb to >10 Mb. Used for bacterial strain typing, chromosome-level karyotyping, and restriction mapping of large genomes. Switch times determine resolution range.

5. Blotting Techniques

Blotting techniques transfer separated macromolecules from a gel to a membrane (nitrocellulose or PVDF) for detection with specific probes or antibodies. Named for their targets:

Southern Blot (DNA)

Developed by Edwin Southern (1975).

Digest genomic DNA with restriction enzyme
Separate fragments by agarose gel electrophoresis
Denature DNA in gel (NaOH)
Transfer (capillary/vacuum) to nylon membrane
Hybridize with labeled probe (radioactive or DIG-labeled)
Wash at appropriate stringency, detect signal

Applications: RFLP analysis, gene copy number, transgene verification, viral integration mapping.

Northern Blot (RNA)

Named as a play on "Southern."

Extract total RNA, separate on denaturing agarose gel (formaldehyde)
Transfer to membrane
Hybridize with complementary probe
Detect specific mRNA species

Advantages over RT-qPCR: reveals transcript size, detects alternative splicing, shows degradation. Disadvantage: less sensitive, requires more RNA.

Western Blot (Protein)

Detects proteins with specific antibodies.

Separate proteins by SDS-PAGE
Transfer (wet/semi-dry) to PVDF or nitrocellulose
Block non-specific binding (BSA or milk)
Incubate with primary antibody (target-specific)
Wash, add secondary antibody (HRP or AP conjugated)
Detect by chemiluminescence (ECL), colorimetric, or fluorescence

Semi-quantitative. Loading controls: beta-actin, GAPDH, total protein stain.

Hybridization Stringency

The specificity of probe-target binding is controlled by wash stringency. Higher temperature and lower salt concentration increase stringency (more mismatches disrupted). The Tm of a probe-target hybrid depends on:

\[ T_m = 81.5 + 16.6\log_{10}[\text{Na}^+] + 41(\%\text{GC}) - \frac{600}{L} - 0.63(\%\text{formamide}) \]

High stringency washes (0.1x SSC, 65 degrees C) allow only perfectly matched hybrids. Low stringency (2x SSC, 42 degrees C) permits cross-species hybridization for homolog detection.

6. CRISPR-Cas9 Gene Editing

6.1 Mechanism

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) with Cas9 nuclease is the most revolutionary genetic tool since PCR. Adapted from a bacterial adaptive immune system, it enables precise, programmable DNA cleavage guided by a short RNA sequence. Doudna and Charpentier received the 2020 Nobel Prize for its development.

Single Guide RNA (sgRNA) Design

The sgRNA is a synthetic fusion of two natural RNA components:

Spacer (20 nt): Complementary to the target DNA. Determines specificity. Mismatches at the PAM-proximal end (seed region, positions 1-12) are less tolerated.
Scaffold (~80 nt): Derived from tracrRNA. Forms the stem-loops that bind Cas9. Invariant structure required for RNP assembly.

Target DNA: 5'-...NNNNNNNNNNNNNNNNNNNNNNGG...-3'

3'-...NNNNNNNNNNNNNNNNNNNNNNCC...-5'

^PAM

sgRNA: 3'-NNNNNNNNNNNNNNNNNNNNN-[scaffold]-5'

||||||||||||||||||||||

Cas9 cleaves both strands 3 bp upstream of PAM

Cas9 PAM Recognition and R-loop Formation

SpCas9 (from Streptococcus pyogenes) requires a 5'-NGG-3' PAM (protospacer adjacent motif) immediately 3' of the target sequence on the non-target strand. The mechanism proceeds:

Cas9-sgRNA complex scans DNA, recognizing PAM via the PAM-interacting domain (PID)
Local DNA unwinding creates an R-loop: sgRNA displaces the non-target strand and base-pairs with the target strand
Full R-loop formation triggers conformational change activating both nuclease domains
RuvC domain cleaves non-target strand; HNH domain cleaves target strand
Blunt-ended DSB created 3 bp upstream of PAM

CRISPR Efficiency Estimation

On-target efficiency depends on guide RNA design. The Doench et al. (2016) Rule Set 2 score models cutting efficiency as:

\[ \text{Score} = \sigma\left(\sum_{i} w_i \cdot x_i + \sum_{j,k} w_{jk} \cdot x_j \cdot x_k + b\right) \]

where sigma is the logistic function, x_i are position-specific nucleotide features, w are learned weights, and the interaction terms capture dinucleotide effects. GC content between 40-70%, absence of poly-T tracts (Pol III terminator), and specific nucleotide preferences at positions -4 to +3 relative to the cut site all contribute to efficiency.

Derivation: CRISPR Guide RNA Binding Thermodynamics

Starting from the thermodynamics of RNA:DNA hybridization, we derive the free energy of Cas9-sgRNA binding to its target, explaining the seed region requirement and PAM specificity.

Step 1: Decompose the total binding free energy

The overall binding free energy of the Cas9-sgRNA complex to target DNA has three main components:

$$\Delta G_{\text{binding}} = \Delta G_{\text{PAM}} + \Delta G_{\text{seed}} + \Delta G_{\text{non-seed}} + \Delta G_{\text{DNA unwinding}}$$

Step 2: PAM recognition energy

Cas9 first scans for the PAM (5'-NGG-3') through protein-DNA contacts (not base pairing). The PAM-interacting domain (PID) makes sequence-specific hydrogen bonds with the minor groove of the GG dinucleotide:

$$\Delta G_{\text{PAM}} \approx -3\;\text{to}\;-5\;\text{kcal/mol} \qquad \text{(protein-DNA H-bonds)}$$

Step 3: Seed region hybridization (positions 1–12)

After PAM recognition, R-loop formation initiates at the PAM-proximal “seed” region. Using nearest-neighbor RNA:DNA hybridization parameters, each matched base pair contributes:

$$\Delta G_{\text{seed}} = \sum_{i=1}^{12} \Delta G_{\text{nn},i} \approx 12 \times (-1.5) = -18\;\text{kcal/mol}$$

Step 4: Cost of DNA unwinding

R-loop formation requires melting the target DNA duplex. The cost of unwinding 20 bp of dsDNA (to be replaced by the RNA:DNA hybrid) is:

$$\Delta G_{\text{unwinding}} = +\sum_{i=1}^{20}\Delta G_{\text{DNA:DNA},i} \approx +20 \times 1.5 = +30\;\text{kcal/mol}$$

Step 5: Non-seed region and full R-loop

Positions 13–20 (PAM-distal) contribute additional RNA:DNA base pairing. Mismatches here are better tolerated because the seed region has already committed the complex:

$$\Delta G_{\text{non-seed}} \approx 8 \times (-1.5) = -12\;\text{kcal/mol}$$

Step 6: Net binding energy and mismatch tolerance

The total free energy for a perfectly matched target:

$$\Delta G_{\text{total}} = (-4) + (-18) + (-12) + (+30) + \Delta G_{\text{conformational}} \approx -12\;\text{kcal/mol}$$

Each mismatch in the seed region costs ~+3 kcal/mol (loss of one base pair plus disruption of stacking), while mismatches in positions 15–20 cost only ~+1.5 kcal/mol. This explains why seed mismatches are far more detrimental to on-target efficiency (~100× reduction per mismatch) compared to PAM-distal mismatches (~3× per mismatch), and why off-target prediction algorithms weight seed positions more heavily.

6.2 Repair Outcomes

NHEJ (Non-Homologous End Joining)

Error-prone repair that directly ligates broken ends. Active throughout the cell cycle.

Produces insertions and deletions (indels) at the cut site
Frameshift mutations typically destroy gene function (knockout)
Efficiency: 20-80% of alleles edited depending on guide and cell type
Detected by T7 Endonuclease I assay, TIDE analysis, or amplicon sequencing

HDR (Homology-Directed Repair)

Template-guided precise repair. Requires a donor DNA with homology arms.

Restricted to S/G2 phase (requires sister chromatid or exogenous template)
Efficiency: typically 1-30% (much lower than NHEJ)
Donor templates: ssODN (for small edits, 60-200 nt) or plasmid (for large knock-ins, 500+ bp homology arms)
Enhanced by NHEJ inhibitors (SCR7, NU7441) or cell cycle synchronization

Base Editing

Precise single-nucleotide changes without DSBs:

CBE (Cytosine Base Editor): Cas9 nickase (D10A) fused to cytidine deaminase (APOBEC1) + UGI. Converts C-to-T (or G-to-A on opposite strand) within a ~5 nt editing window.
ABE (Adenine Base Editor): Cas9 nickase fused to evolved TadA adenosine deaminase. Converts A-to-G (T-to-C on opposite strand). No natural enzyme existed; created by directed evolution.
Efficiency: 20-80% at target positions. Lower off-target and indel rates than nuclease Cas9.

Prime Editing

"Search-and-replace" editing enabling all 12 transition/transversion mutations, small insertions, and deletions without DSBs or donor template:

pegRNA: Extended guide RNA with a primer binding site (PBS) and RT template encoding the desired edit.
PE2: Cas9 nickase (H840A) fused to M-MLV reverse transcriptase. Nicks target strand, RT copies edit from pegRNA into genomic DNA.
PE3/PE3b: Additional nicking guide for the non-edited strand encourages mismatch repair to favor the edited strand. PE3b nicks only after editing, reducing indels.
Lower efficiency than base editing (~5-50%) but greater versatility.

6.3 Applications

Basic Research

Gene knockout libraries (genome-wide screens)
CRISPRi/CRISPRa (dCas9 for transcription repression/activation)
Lineage tracing (evolving barcodes)
Disease model generation (mouse, organoid, iPSC)

Gene Therapy

Sickle cell disease: Casgevy (exa-cel) - first approved CRISPR therapy (2023). Edits BCL11A enhancer in HSCs to reactivate fetal hemoglobin.
CAR-T cells: CRISPR knockout of PD-1, TRAC, B2M for universal allogeneic CAR-T. Reduced graft-vs-host disease.
In vivo editing: Liver-targeted LNP delivery for transthyretin amyloidosis (NTLA-2001, Intellia).

Gene Drives

Super-Mendelian inheritance (>50% transmission)
CRISPR-based gene drive converts heterozygotes to homozygotes
Potential: suppress malaria mosquitoes (Anopheles), invasive species control
Ethical concerns: irreversibility, ecosystem effects, biosecurity
Daisy-chain drives proposed as self-limiting safeguard

7. Protein Analysis and Structural Biology

7.1 Mass Spectrometry

Mass spectrometry (MS) is the premier technology for protein identification, quantification, and post-translational modification mapping. It measures the mass-to-charge ratio (m/z) of ionized molecules with extraordinary precision.

Ionization Methods

MALDI-TOF: Matrix-Assisted Laser Desorption/Ionization with Time-of-Flight analyzer. Sample co-crystallized with matrix (sinapinic acid, CHCA). Pulsed laser causes matrix ablation and analyte ionization. Predominantly singly-charged ions. Fast, tolerant of salts/detergents. Used for peptide mass fingerprinting, intact protein mass, microbial identification (MALDI Biotyper).
ESI: Electrospray Ionization. Continuous flow from liquid chromatography. Produces multiply-charged ions enabling analysis of very large proteins. Coupled to various analyzers (quadrupole, Orbitrap, TOF).

Tandem MS (MS/MS)

Two stages of mass analysis for protein sequencing:

First MS (MS1): Measures intact peptide ions (precursor ions)
Fragmentation: Collision-induced dissociation (CID), higher-energy collisional dissociation (HCD), or electron-transfer dissociation (ETD) breaks peptide bonds
Second MS (MS2): Measures fragment ions (b/y ion series)
Database search (Mascot, MaxQuant, Proteome Discoverer) matches spectra to peptide sequences

Data-independent acquisition (DIA) fragments all peptides in defined m/z windows, enabling deeper and more reproducible proteome coverage than traditional data-dependent acquisition (DDA).

7.2 Structural Biology Methods

X-ray Crystallography

The workhorse of structural biology (>85% of all PDB structures). Requires protein crystallization, which remains the major bottleneck. X-rays (wavelength ~1 Angstrom) diffract off the electron cloud of atoms in the crystal lattice. Bragg's law governs diffraction:

\[ n\lambda = 2d\sin\theta \]

The "phase problem" (diffraction patterns record intensities but not phases) is solved by molecular replacement, isomorphous replacement (heavy atom derivatives), or anomalous dispersion (SAD/MAD). Resolution: typically 1.5-3.0 Angstrom. Synchrotron radiation and micro-focus beams enable structure determination from crystals as small as 5 micrometers.

Cryo-Electron Microscopy (Cryo-EM)

The "resolution revolution" (Nobel Prize 2017 to Dubochet, Frank, Henderson). No crystallization required.

Single-particle analysis: Purified protein flash-frozen in vitreous ice. TEM images thousands of individual particles in random orientations. 2D class averaging and 3D reconstruction yield near-atomic resolution structures.
Resolution: Now routinely reaches 2-3 Angstrom, with exceptional cases below 1.5 Angstrom. Direct electron detectors (K3, Falcon 4) and energy filters were transformative.
Advantages: Works with heterogeneous samples, can resolve multiple conformational states, requires less protein (~1 microgram), no crystal packing artifacts.
Cryo-ET: Cryo-electron tomography images proteins in situ within cells. Tilt series reconstructed into 3D tomograms. Sub-tomogram averaging for molecular resolution.

NMR Spectroscopy

Nuclear Magnetic Resonance determines structure in solution, capturing dynamics and conformational exchange. Proteins must be isotopically labeled (15N, 13C). HSQC (Heteronuclear Single Quantum Coherence) spectra give one peak per amide N-H, fingerprinting the protein. NOE (Nuclear Overhauser Effect) provides distance restraints (<5 Angstrom). Limited to proteins <~40 kDa (TROSY extends to ~100 kDa). Uniquely reveals microsecond-millisecond dynamics, binding kinetics, and intrinsically disordered regions.

Circular Dichroism (CD)

Measures differential absorption of left- and right-circularly polarized UV light. Secondary structure elements have characteristic CD spectra: alpha-helices show minima at 208 and 222 nm; beta-sheets at 218 nm. Quick assessment of secondary structure content (~5 min), folding state, thermal stability (Tm determination by monitoring 222 nm ellipticity vs. temperature), and conformational changes upon ligand binding. Requires only ~0.1-0.5 mg protein.

8. Modern Omics Technologies

8.1 The Omics Landscape

Modern biology increasingly takes a systems-level view, profiling entire classes of biomolecules simultaneously rather than studying individual genes or proteins in isolation. These "omics" approaches generate massive datasets that require bioinformatics and machine learning for interpretation.

Genomics

Study of entire genomes. Enabled by NGS. Key applications include whole-genome sequencing for variant discovery (SNPs, indels, SVs, CNVs), genome-wide association studies (GWAS) linking variants to phenotypes, population genomics, pharmacogenomics (drug response prediction), and comparative genomics across species. The 1000 Genomes Project cataloged >88 million variants across 2,504 individuals from 26 populations.

Transcriptomics (RNA-seq)

Quantifies all RNA species in a sample. Workflow: RNA extraction, poly(A) selection or rRNA depletion, fragmentation, cDNA synthesis, library prep, sequencing, alignment (STAR/HISAT2), quantification (featureCounts/Salmon), differential expression (DESeq2/edgeR). Normalized as FPKM, TPM, or CPM. Reveals gene expression changes, alternative splicing, fusion transcripts, allele-specific expression, and non-coding RNA.

Proteomics

Large-scale study of all proteins in a sample. Bottom-up proteomics: digest proteins with trypsin, analyze peptides by LC-MS/MS. Top-down proteomics: analyze intact proteins. Quantification: label-free (intensity-based), TMT/iTRAQ (isobaric mass tags for multiplexed quantification), or SILAC (metabolic labeling with heavy amino acids). Post-translational modification (PTM) profiling: phosphoproteomics (TiO2/IMAC enrichment), ubiquitinomics, glycoproteomics, acetylomics.

Metabolomics

Comprehensive analysis of small molecule metabolites (<1500 Da). Platforms: LC-MS (broadest coverage), GC-MS (volatile metabolites), NMR (quantitative, non-destructive). Untargeted metabolomics discovers novel biomarkers; targeted metabolomics quantifies known pathways. Applications: disease diagnosis, drug metabolism (ADME), nutritional studies, microbiome metabolite profiling (short-chain fatty acids, bile acids).

8.2 Single-Cell and Spatial Omics

Traditional omics average signals across millions of cells, masking cellular heterogeneity. Single-cell and spatial technologies resolve individual cells and their spatial context.

Single-Cell RNA-seq (scRNA-seq)

Profiles the transcriptome of individual cells. Droplet-based methods (10x Genomics Chromium) encapsulate single cells with barcoded gel beads in nanoliter droplets. Each cell gets a unique barcode; each mRNA molecule gets a unique molecular identifier (UMI) to eliminate PCR amplification bias. Typical experiment: 5,000-50,000 cells, detecting 2,000-5,000 genes per cell.

Analysis pipeline: Cell Ranger (alignment/counting), Seurat or Scanpy (normalization, dimensionality reduction via PCA/UMAP, clustering, differential expression, trajectory inference, RNA velocity). Applications: cell atlas projects (Human Cell Atlas), tumor heterogeneity, developmental biology, immune cell profiling.

Single-Cell Multi-omics

Simultaneously measure multiple modalities from the same single cell:

CITE-seq: scRNA-seq + surface protein (antibody-derived tags). Combines transcriptome with >200 protein markers per cell.
10x Multiome: Joint scRNA-seq + scATAC-seq from the same nucleus. Links gene expression to chromatin accessibility.
SHARE-seq: Chromatin accessibility + gene expression. Reveals regulatory element-gene linkages.
scCUT&Tag: Single-cell histone modification profiling. Maps epigenomic landscapes per cell.
Perturb-seq: CRISPR perturbations + scRNA-seq readout. Genome-scale screens with rich phenotypic readout.

Spatial Transcriptomics

Preserves the spatial location of gene expression within tissue sections. Named "Method of the Year 2020" by Nature Methods.

MERFISH (Imaging-based)

Multiplexed Error-Robust FISH. Combinatorial barcoding of RNA species with sequential rounds of hybridization and imaging. Subcellular resolution. Detects 100-10,000 genes per experiment. Error-correcting Hamming codes enable reliable identification. Vizgen MERSCOPE platform.

Visium (Sequencing-based)

10x Genomics Visium: tissue section placed on a slide with spatially barcoded oligo capture spots (55 micrometer diameter, ~5,000 spots). mRNA captured, reverse-transcribed, and sequenced with spatial barcodes. Whole transcriptome but ~5-10 cell resolution. Visium HD achieves 2-micrometer (single-cell) resolution.

Computational Lab: PCR Amplification Simulation

This Python simulation models PCR amplification cycle-by-cycle with realistic sub-100% efficiency, primer depletion, and product saturation. It computes the amplification curve, determines the qPCR Ct value, and performs Delta-Ct quantification analysis. The output demonstrates the three phases of PCR (exponential, linear, plateau) and how real-world amplification diverges from the ideal 2^n model.

Python: PCR Amplification & qPCR Ct Analysis

Python

#!/usr/bin/env python3

pcr_simulation.py119 lines

#!/usr/bin/env python3
"""pcr_simulation.py - PCR amplification curve with efficiency modeling and qPCR Ct analysis"""

import math

# ============================================================
# PCR Amplification Simulation with Primer Depletion
# ============================================================
# Model: N(c) = N0 * product(1 + E(c)) for each cycle c
# Efficiency drops as primers deplete and products accumulate

N0 = 100                  # Initial template copy number
total_primers = 2e9       # Total primer molecules available
primer_per_copy = 2       # Primers consumed per new copy
max_efficiency = 0.95     # Max amplification efficiency
n_cycles = 40             # Total PCR cycles
plateau_copies = 1e10     # Carrying capacity (reagent limitation)

# Ct threshold for qPCR
threshold = 1e6           # Fluorescence threshold (proportional to copy number)

copies = [N0]
efficiencies = [0.0]
primers_remaining = total_primers

for cycle in range(1, n_cycles + 1):
    current = copies[-1]
    # Efficiency decreases with primer depletion and product accumulation
    primer_factor = primers_remaining / total_primers
    product_factor = 1.0 - (current / plateau_copies)
    if product_factor < 0:
        product_factor = 0.0
    eff = max_efficiency * primer_factor * product_factor
    if eff < 0:
        eff = 0.0

new_copies = current * (1.0 + eff)
    primers_used = (new_copies - current) * primer_per_copy
    primers_remaining = max(0, primers_remaining - primers_used)

copies.append(new_copies)
    efficiencies.append(eff)

# Determine Ct value (cycle where copies cross threshold)
ct_value = None
for i, c in enumerate(copies):
    if c >= threshold:
        ct_value = i
        break

# Ideal amplification (100% efficiency) for comparison
ideal_copies = [N0 * (2 ** c) for c in range(n_cycles + 1)]
# Cap ideal at plateau
ideal_copies = [min(c, plateau_copies * 2) for c in ideal_copies]

# ============================================================
# Output Results
# ============================================================
print("=" * 65)
print("       PCR AMPLIFICATION SIMULATION RESULTS")
print("=" * 65)
print(f"  Initial template copies (N0):  {N0}")
print(f"  Max efficiency:                {max_efficiency:.0%}")
print(f"  Total cycles:                  {n_cycles}")
print(f"  Primer pool:                   {total_primers:.2e}")
print(f"  Plateau capacity:              {plateau_copies:.2e}")
print("-" * 65)

print("\n  Cycle-by-Cycle Amplification:")
print(f"  {'Cycle':>6}  {'Copies':>14}  {'Efficiency':>10}  {'Log10(copies)':>14}")
print(f"  {'-'*6}  {'-'*14}  {'-'*10}  {'-'*14}")
for i in range(0, n_cycles + 1, 2):
    log_c = math.log10(copies[i]) if copies[i] > 0 else 0
    print(f"  {i:>6}  {copies[i]:>14.2e}  {efficiencies[i]:>10.4f}  {log_c:>14.2f}")

print()
if ct_value is not None:
    print(f"  qPCR Ct value (threshold = {threshold:.0e}):  Cycle {ct_value}")
    # DeltaCt example
    ct_reference = 15  # hypothetical reference gene Ct
    delta_ct = ct_value - ct_reference
    fold_change = 2 ** (-delta_ct)
    print(f"  Reference gene Ct:             {ct_reference}")
    print(f"  Delta-Ct:                      {delta_ct}")
    print(f"  Fold change (2^-dCt):          {fold_change:.4f}")
else:
    print(f"  Threshold {threshold:.0e} NOT reached in {n_cycles} cycles.")

print()
print("  Amplification Curve (ASCII plot - log scale):")
print(f"  {'':>6}  {'|'}")
max_log = math.log10(max(copies))
for i in range(0, n_cycles + 1, 2):
    log_c = math.log10(copies[i]) if copies[i] > 0 else 0
    bar_len = int(40 * log_c / max_log) if max_log > 0 else 0
    marker = " <-- Ct" if (ct_value and i == ct_value) or (ct_value and i == ct_value + 1) else ""
    print(f"  C{i:>3}  |{'#' * bar_len} {copies[i]:.1e}{marker}")
print(f"  {'':>6}  +{'-' * 50}")
print(f"  {'':>9}0{'':>19}log10(copies) -> {max_log:.1f}")

print()
print("  Phase Analysis:")
lag_end = 0
exp_end = 0
for i in range(1, len(efficiencies)):
    if efficiencies[i] > 0.5 * max_efficiency and lag_end == 0:
        lag_end = i
    if efficiencies[i] < 0.5 * max_efficiency and lag_end > 0 and exp_end == 0:
        exp_end = i
        break
if exp_end == 0:
    exp_end = n_cycles
print(f"  Exponential phase: Cycles {lag_end} - {exp_end}")
print(f"  Plateau reached at ~{copies[exp_end]:.2e} copies")
print(f"  Final copy number: {copies[-1]:.2e}")
print(f"  Theoretical max (2^{n_cycles}): {2**n_cycles:.2e}")
print(f"  Actual/Theoretical ratio: {copies[-1]/(N0 * 2**n_cycles):.6e}")
print("=" * 65)

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Computational Lab: Sanger Sequencing Simulation

This Fortran program simulates the Sanger dideoxy chain-termination method. It generates a random DNA template sequence, performs four termination reactions (one per ddNTP), collects all terminated fragments, sorts them by length (simulating capillary electrophoresis), and reads off the terminal bases to reconstruct the complement strand. The simulation demonstrates how Sanger sequencing determines DNA sequence from a nested set of terminated fragments.

Fortran: Sanger Sequencing Simulation

Fortran

Chain-termination sequencing with fragment sorting and sequence reconstruction

sanger_sequencing.f90205 lines

program sanger_sequencing_simulation
  ! ================================================================
  ! Sanger Sequencing Simulation
  ! Generates a random DNA template, produces chain-terminated
  ! fragments of different lengths, then sorts them to reconstruct
  ! the sequence - mimicking capillary electrophoresis readout.
  ! ================================================================
  implicit none

integer, parameter :: TEMPLATE_LEN = 120
  integer, parameter :: MAX_FRAGMENTS = 500
  integer, parameter :: NUM_REACTIONS = 4  ! A, C, G, T terminators

character(len=TEMPLATE_LEN) :: template_seq, complement_seq
  character(len=TEMPLATE_LEN) :: reconstructed
  character(len=1) :: bases(4)
  character(len=1) :: terminator_base

! Fragment storage: length and terminating base
  integer :: frag_lengths(MAX_FRAGMENTS)
  character(len=1) :: frag_terminators(MAX_FRAGMENTS)
  integer :: num_frags, total_frags

! Sorting arrays
  integer :: sorted_lengths(MAX_FRAGMENTS)
  character(len=1) :: sorted_bases(MAX_FRAGMENTS)

integer :: i, j, k, pos, temp_len, seed_val
  character(len=1) :: temp_base
  real :: r
  integer :: counts(4)

bases(1) = 'A'
  bases(2) = 'C'
  bases(3) = 'G'
  bases(4) = 'T'

! Initialize random seed
  seed_val = 42
  call simple_seed(seed_val)

! Generate random template sequence
  do i = 1, TEMPLATE_LEN
    call simple_random(r)
    j = int(r * 4.0) + 1
    if (j > 4) j = 4
    template_seq(i:i) = bases(j)
  end do

! Generate complement
  do i = 1, TEMPLATE_LEN
    select case(template_seq(i:i))
      case('A'); complement_seq(i:i) = 'T'
      case('T'); complement_seq(i:i) = 'A'
      case('C'); complement_seq(i:i) = 'G'
      case('G'); complement_seq(i:i) = 'C'
    end select
  end do

write(*,'(A)') '================================================================'
  write(*,'(A)') '         SANGER SEQUENCING SIMULATION'
  write(*,'(A)') '================================================================'
  write(*,'(A)') ''
  write(*,'(A,I4,A)') '  Template (', TEMPLATE_LEN, ' bp):'
  write(*,'(A,A)')    '  5-', template_seq(1:60)
  write(*,'(A,A,A)')  '    ', template_seq(61:TEMPLATE_LEN), '-3'
  write(*,'(A)') ''

! Simulate 4 termination reactions (ddATP, ddCTP, ddGTP, ddTTP)
  total_frags = 0

write(*,'(A)') '  Chain Termination Reactions:'
  write(*,'(A)') '  ----------------------------'

do k = 1, NUM_REACTIONS
    terminator_base = bases(k)
    num_frags = 0

! For each position matching the terminator base, create a fragment
    do i = 1, TEMPLATE_LEN
      ! The new strand is complement of template, so termination
      ! occurs where the complement matches the ddNTP
      if (complement_seq(i:i) == terminator_base) then
        total_frags = total_frags + 1
        if (total_frags <= MAX_FRAGMENTS) then
          frag_lengths(total_frags) = i
          frag_terminators(total_frags) = terminator_base
          num_frags = num_frags + 1
        end if
      end if
    end do

write(*,'(A,A,A,I4,A)') '  dd', terminator_base, &
      'TP reaction: ', num_frags, ' fragments generated'
  end do

write(*,'(A,I4,A)') '  Total fragments: ', total_frags, ''
  write(*,'(A)') ''

! Sort fragments by length (bubble sort - simulating gel separation)
  write(*,'(A)') '  Capillary Electrophoresis (sorting by size)...'

! Copy to sorting arrays
  do i = 1, total_frags
    sorted_lengths(i) = frag_lengths(i)
    sorted_bases(i) = frag_terminators(i)
  end do

! Bubble sort by fragment length (ascending = smallest migrates fastest)
  do i = 1, total_frags - 1
    do j = 1, total_frags - i
      if (sorted_lengths(j) > sorted_lengths(j+1)) then
        temp_len = sorted_lengths(j)
        sorted_lengths(j) = sorted_lengths(j+1)
        sorted_lengths(j+1) = temp_len
        temp_base = sorted_bases(j)
        sorted_bases(j) = sorted_bases(j+1)
        sorted_bases(j+1) = temp_base
      end if
    end do
  end do

! Reconstruct sequence from sorted fragments
  write(*,'(A)') ''
  write(*,'(A)') '  Electropherogram Readout (first 60 bases):'
  write(*,'(A)') '  Length | Terminator | Reconstructed Base'
  write(*,'(A)') '  -------+------------+-------------------'

reconstructed = ' '
  do i = 1, min(total_frags, TEMPLATE_LEN)
    pos = sorted_lengths(i)
    if (pos <= TEMPLATE_LEN) then
      reconstructed(pos:pos) = sorted_bases(i)
      if (pos <= 60) then
        write(*,'(A,I5,A,A4,A,A,A)') '  ', pos, &
          '    |    dd', sorted_bases(i), &
          '     | ', sorted_bases(i), ''
      end if
    end if
  end do

write(*,'(A)') ''
  write(*,'(A)') '  Reconstructed Sequence (new strand, first 80 bp):'
  write(*,'(A,A)') '  5-', reconstructed(1:80)
  write(*,'(A)') ''
  write(*,'(A)') '  Expected Complement   (first 80 bp):'
  write(*,'(A,A)') '  5-', complement_seq(1:80)
  write(*,'(A)') ''

! Verify accuracy
  counts = 0
  j = 0  ! matches
  k = 0  ! gaps (positions not covered)
  do i = 1, TEMPLATE_LEN
    if (reconstructed(i:i) == ' ') then
      k = k + 1
    else if (reconstructed(i:i) == complement_seq(i:i)) then
      j = j + 1
    end if
    ! Count base composition
    select case(complement_seq(i:i))
      case('A'); counts(1) = counts(1) + 1
      case('C'); counts(2) = counts(2) + 1
      case('G'); counts(3) = counts(3) + 1
      case('T'); counts(4) = counts(4) + 1
    end select
  end do

write(*,'(A)') '  Sequence Statistics:'
  write(*,'(A,I4,A,I4)') '  Correct bases:    ', j, ' / ', TEMPLATE_LEN
  write(*,'(A,I4)')       '  Uncovered gaps:   ', k
  write(*,'(A,F6.1,A)')   '  Coverage:         ', &
    100.0 * real(TEMPLATE_LEN - k) / real(TEMPLATE_LEN), '%'
  write(*,'(A,F6.1,A)')   '  Accuracy:         ', &
    100.0 * real(j) / real(TEMPLATE_LEN - k), '%'
  write(*,'(A)') ''
  write(*,'(A)') '  Base Composition (complement strand):'
  write(*,'(A,I4,A,F5.1,A)') '    A: ', counts(1), &
    ' (', 100.0*real(counts(1))/real(TEMPLATE_LEN), '%)'
  write(*,'(A,I4,A,F5.1,A)') '    C: ', counts(2), &
    ' (', 100.0*real(counts(2))/real(TEMPLATE_LEN), '%)'
  write(*,'(A,I4,A,F5.1,A)') '    G: ', counts(3), &
    ' (', 100.0*real(counts(3))/real(TEMPLATE_LEN), '%)'
  write(*,'(A,I4,A,F5.1,A)') '    T: ', counts(4), &
    ' (', 100.0*real(counts(4))/real(TEMPLATE_LEN), '%)'
  write(*,'(A,F5.1,A)') '    GC content: ', &
    100.0*real(counts(2)+counts(3))/real(TEMPLATE_LEN), '%'
  write(*,'(A)') ''
  write(*,'(A)') '================================================================'

contains

subroutine simple_seed(s)
    integer, intent(in) :: s
    seed_val = s
  end subroutine

subroutine simple_random(r)
    real, intent(out) :: r
    seed_val = mod(seed_val * 1103515245 + 12345, 2147483647)
    r = real(mod(abs(seed_val), 10000)) / 10000.0
  end subroutine

end program sanger_sequencing_simulation

Click Run to execute the Fortran code

Code will be compiled with gfortran and executed on the server

Key Equations Summary

PCR Amplification (with efficiency E):

\[ N_n = N_0 (1+E)^n, \quad 0 \le E \le 1 \]

Primer Tm (nearest-neighbor):

\[ T_m = \frac{\Delta H}{\Delta S + R\ln(C_T/4)} - 273.15 \]

Electrophoretic Mobility:

\[ \mu = \frac{q}{f}, \quad \log\mu = \log\mu_0 - K_R T \]

Phred Quality Score / Sequencing Error:

\[ Q = -10\log_{10}(P_e), \quad P_e = 10^{-Q/10} \]

qPCR Relative Quantification:

\[ \text{Fold change} = 2^{-\Delta\Delta C_t} \]

Restriction Site Frequency:

\[ \text{Expected frequency} = \frac{1}{4^n} \text{ per bp} \]

Bragg's Law (X-ray Crystallography):

\[ n\lambda = 2d\sin\theta \]

Congratulations!

You have completed the Molecular Biology course. You now have a comprehensive foundation in DNA structure, replication, repair, transcription, translation, gene regulation, recombination, and the full spectrum of laboratory techniques from classical cloning to CRISPR genome editing and single-cell multi-omics. These tools and concepts form the backbone of modern biomedical research, clinical diagnostics, and biotechnology.

Return to Course Home Try BioSketch Studio

← Recombination Course Overview →

Share:X Reddit LinkedIn