Part 10: Molecular Biology Techniques
Tools of the Trade
Modern molecular biology is built upon a powerful arsenal of techniques that allow researchers to isolate, manipulate, amplify, sequence, and edit nucleic acids and proteins. From the earliest restriction enzyme digests in the 1970s to today's single-cell multi-omics and CRISPR-based gene editing, these methods have transformed our understanding of life at the molecular level and enabled breakthroughs in medicine, agriculture, forensics, and biotechnology.
This chapter provides a comprehensive survey of the major techniques in molecular biology, covering the principles, reagents, applications, and quantitative foundations behind each method. We progress from classical cloning and electrophoresis through PCR and sequencing to modern genome editing and omics technologies.
1. DNA Cloning
1.1 Restriction Enzymes
Restriction endonucleases are bacterial defense enzymes that cleave DNA at specific recognition sequences, typically 4-8 bp palindromes. Type II restriction enzymes are the workhorses of molecular cloning because they cut at defined positions within or adjacent to their recognition sites, producing reproducible fragments.
Sticky-End Cutters
Generate single-stranded overhangs (cohesive ends) that facilitate directional cloning through complementary base pairing.
EcoRI: 5'-G|AATTC-3' (4-nt 5' overhang)
3'-CTTAA|G-5'
BamHI: 5'-G|GATCC-3' (4-nt 5' overhang)
3'-CCTAG|G-5'
HindIII: 5'-A|AGCTT-3' (4-nt 5' overhang)
3'-TTCGA|A-5'
Blunt-End Cutters
Cut both strands at the same position, producing flush ends without overhangs. Less efficient ligation but allows joining of any two blunt-ended fragments.
SmaI: 5'-CCC|GGG-3'
3'-GGG|CCC-5'
EcoRV: 5'-GAT|ATC-3'
3'-CTA|TAG-5'
Key Concepts
- Palindromic recognition: Most Type II enzymes recognize sequences that read the same on both strands in the 5'-to-3' direction (e.g., EcoRI: 5'-GAATTC-3' / 3'-CTTAAG-5').
- Isoschizomers: Different enzymes recognizing the same sequence (e.g., BamHI and BglII produce compatible cohesive ends). Neoschizomers cut the same sequence at different positions.
- Methylation sensitivity: Many restriction enzymes are blocked by DNA methylation (Dam: GATC, Dcm: CCWGG). This is exploited in bacterial defense (R-M systems) and must be considered when cloning from methylation-positive hosts.
- Star activity: Under suboptimal conditions (low ionic strength, high glycerol, excessive enzyme), some enzymes exhibit relaxed specificity, cutting at non-canonical sites.
The expected number of restriction sites in a random DNA sequence follows:
where L is the sequence length and n is the recognition site length. A 6-cutter (n=6) cuts on average every 4,096 bp; a 4-cutter every 256 bp; an 8-cutter every 65,536 bp.
1.2 Cloning Vectors
A vector is a DNA molecule used to carry foreign genetic material into a host cell for replication and/or expression. The choice of vector depends on the insert size, the host organism, and whether gene expression is required.
Plasmid Vectors (0.1-15 kb inserts)
The most commonly used vectors. Key features include:
- Origin of replication (ori): ColE1 ori (pUC series, high copy ~500-700/cell) or p15A ori (lower copy ~15-20/cell). Copy number affects yield and stability.
- Selectable markers: Antibiotic resistance genes (ampicillin/bla, kanamycin/kan, chloramphenicol/cat) allow selection of transformed cells.
- Multiple cloning site (MCS/polylinker): Cluster of unique restriction sites (typically 10-20) within a reporter gene for convenient insertion.
- pBR322: Classic plasmid (4,361 bp) with ampR and tetR markers. Insertional inactivation used for clone identification.
- pUC series: High-copy derivative with lacZ-alpha MCS enabling blue-white screening. Smaller size (~2.7 kb) improves transformation efficiency.
Large-Insert Vectors
Bacteriophage lambda (up to 23 kb)
Central "stuffer" fragment replaced with insert. Packaged into phage particles in vitro for high-efficiency infection. Replacement and insertion vectors available.
Cosmids (up to 45 kb)
Hybrid plasmid/phage vector containing cos sites for lambda packaging. Propagates as plasmid in E. coli. Used for genomic libraries.
BACs (up to 300 kb)
Based on F-factor. Single copy per cell ensures stability. Critical for Human Genome Project. Contains parA/B/C partition genes.
YACs (up to 2 Mb)
Artificial yeast chromosomes with telomeres, centromere, ARS (ori), and selectable markers. Propagated in S. cerevisiae. Prone to chimerism.
Expression Vectors
Designed not just to carry DNA but to actively transcribe and translate the insert into protein:
- T7 promoter system: Recognized by T7 RNA polymerase (supplied by DE3 lysogen in BL21(DE3) cells). Extremely strong, tightly regulated.
- IPTG induction: Isopropyl-beta-D-thiogalactopyranoside mimics allolactose, relieving LacI repression of the T7 promoter. Typical induction: 0.1-1 mM IPTG.
- Affinity tags: His-tag (6xHis, Ni-NTA purification), GST-tag (glutathione purification), MBP-tag (amylose purification). Tags can be N- or C-terminal with protease cleavage sites (TEV, thrombin).
- pET system: Industry standard. pET-28a provides His-tag, T7 promoter, kanamycin resistance. Various derivatives for different tag configurations.
1.3 Ligation and Transformation
T4 DNA Ligase
Catalyzes phosphodiester bond formation between 3'-OH and 5'-phosphate ends in an ATP-dependent reaction. Key considerations:
- Cohesive-end ligation: 100-1000x more efficient than blunt-end. Performed at 16 degrees C overnight or room temperature for 10 min (quick ligase).
- Blunt-end ligation: Requires 10-100x more ligase and higher DNA concentrations. PEG 4000 (5-15%) can enhance efficiency.
- Insert:vector ratio: Typically 3:1 to 5:1 molar ratio for optimal intermolecular ligation. Higher ratios favor concatemer formation.
- Vector dephosphorylation: CIP or SAP removes 5'-phosphates from vector to prevent self-ligation, dramatically reducing background.
Blue-White Screening
A selection method based on alpha-complementation of beta-galactosidase:
- Vector carries lacZ-alpha gene fragment with MCS inside
- Host strain provides lacZ-omega (DH5-alpha, JM109)
- Insert disrupts lacZ-alpha reading frame
- Plates contain X-gal (chromogenic substrate) + IPTG (inducer)
- Blue colonies: No insert (functional beta-gal cleaves X-gal)
- White colonies: Contain insert (disrupted beta-gal)
Transformation Methods
Chemical Competence (CaCl2)
Divalent cations (Ca2+) neutralize DNA phosphate backbone charge, promoting DNA-membrane association. Heat shock (42 degrees C, 45-90 sec) creates transient pores. Efficiency: 10^6 - 10^8 CFU/microgram. TSS and Inoue methods further enhance competence.
Electroporation
Brief high-voltage pulses (1.8-2.5 kV, 5 ms) create transient pores in the membrane. 10-100x more efficient than chemical methods (up to 10^10 CFU/microgram). Essential for large DNA molecules and library construction. Requires desalted DNA.
2. Polymerase Chain Reaction (PCR)
2.1 Principle and Thermal Cycling
PCR is the single most transformative technique in molecular biology. Invented by Kary Mullis in 1983 (Nobel Prize 1993), it enables exponential amplification of a specific DNA target defined by two flanking oligonucleotide primers.
The Three-Step Thermal Cycle
1. Denaturation
Double-stranded DNA melts into single strands. Higher GC content requires higher temperatures. Initial denaturation: 2-5 min; cycle denaturation: 15-30 sec.
2. Annealing
Primers hybridize to complementary template strands. Temperature set 3-5 degrees C below primer Tm. Too low causes non-specific binding; too high reduces yield.
3. Extension
DNA polymerase synthesizes new strand from 3' end of each primer. Rate: ~1 kb/min (Taq) or ~0.5 kb/min (proofreading). Time depends on amplicon length.
Exponential Amplification
With perfect efficiency, each cycle doubles the number of target molecules:
In practice, efficiency (E) is less than 100%, so the general amplification equation is:
Starting from 1 template molecule with E=0.95 over 30 cycles: N = 1 x (1.95)^30 = 1.56 x 10^8 copies. With perfect efficiency: 2^30 = 1.07 x 10^9. The efficiency gap compounds exponentially.
Derivation: PCR Amplification Kinetics
Starting from the biochemistry of a single PCR cycle, we derive the exponential amplification equation and its efficiency-corrected form.
Step 1: Model a single ideal PCR cycle
In one cycle, each double-stranded template denatures into two single strands, primers anneal, and polymerase extends to create two new duplexes. With perfect efficiency, one molecule becomes two:
$$N_1 = N_0 \times 2 \qquad \text{(one cycle, 100\% efficiency)}$$
Step 2: Extend to n cycles (ideal case)
Each cycle doubles the copy number independently. After n cycles:
$$N_n = N_0 \times 2^n \qquad \text{(ideal exponential amplification)}$$
For N0 = 1 and n = 30: N = 230 = 1.07 ร 109 copies (~1 billion-fold amplification).
Step 3: Introduce amplification efficiency E
In reality, not every template is copied in every cycle. Define efficiency E as the fraction of templates successfully replicated (0 โค E โค 1). Per cycle, each template produces (1 + E) copies:
$$N_1 = N_0 \times (1 + E) \qquad \text{where } E = 1 \text{ is perfect (doubling)}$$
Step 4: General amplification equation
Compounding over n cycles with constant efficiency:
$$N_n = N_0 \times (1 + E)^n$$
With E = 0.95 (typical for optimized reactions): (1.95)30 = 1.56 ร 108, only 15% of ideal yield.
Step 5: Model efficiency decline (realistic PCR)
Efficiency decreases as primers deplete and products accumulate. A simple model uses a logistic correction:
$$E(n) = E_{\max} \times \frac{P_{\text{remaining}}}{P_{\text{total}}} \times \left(1 - \frac{N_n}{N_{\text{plateau}}}\right)$$
This produces the characteristic three-phase PCR curve: exponential phase (constant E), linear phase (declining E), and plateau phase (E โ 0).
Step 6: Derive the qPCR Ct relationship
The threshold cycle Ct is when Nn reaches a detection threshold NT. Solving NT = N0(1+E)Ct:
$$C_t = \frac{\ln(N_T/N_0)}{\ln(1+E)} \qquad \Delta C_t \propto -\log_2(N_0) \;\;\text{(at } E \approx 1\text{)}$$
Each 10-fold difference in initial template shifts Ct by ~3.32 cycles (when E = 1). This linear relationship between Ct and log(N0) is the basis of qPCR quantification.
2.2 Components and Polymerases
DNA Polymerases
- Taq polymerase: From Thermus aquaticus. Thermostable (half-life ~40 min at 95 degrees C). No 3'-to-5' proofreading. Error rate: ~2 x 10^-4 per bp per cycle. Adds non-templated A to 3' ends (useful for TA cloning).
- Pfu polymerase: From Pyrococcus furiosus. 3'-to-5' exonuclease (proofreading). Error rate: ~1.3 x 10^-6. Slower extension. Blunt ends.
- Phusion polymerase: Engineered fusion of Pfu-like polymerase with processivity-enhancing domain. Error rate: ~4.4 x 10^-7. Fastest high-fidelity option. Industry standard for cloning.
- Q5 polymerase: NEB's high-fidelity polymerase. Error rate: ~5.3 x 10^-7. Excellent for long amplicons and GC-rich templates.
Other Key Components
- dNTPs: Typically 200 microM each (dATP, dCTP, dGTP, dTTP). Equimolar balance critical. Excess dNTPs chelate Mg2+ and reduce fidelity.
- Mg2+ concentration: 1.5-2.5 mM optimal. Cofactor for polymerase. Higher Mg2+ increases yield but reduces specificity and fidelity.
- Primers: 18-25 nt, 40-60% GC, Tm matched within 2 degrees C. Avoid 3'-end complementarity (primer dimers), repeats, and secondary structure.
- Buffer: Tris-HCl pH 8.3-8.8, KCl (50 mM stabilizes primer annealing). Some buffers include betaine or DMSO for GC-rich templates.
Primer Melting Temperature (Tm) Calculation
Basic Wallace rule (for primers less than 20 nt):
Nearest-neighbor thermodynamic method (more accurate):
where Delta-H and Delta-S are enthalpy and entropy sums of nearest-neighbor dinucleotide pairs, R is the gas constant (1.987 cal/mol/K), and C_T is the total primer concentration. Salt correction: Delta-S(adjusted) = Delta-S + 0.368 x (n-1) x ln([Na+]).
2.3 PCR Variants
RT-PCR (Reverse Transcription PCR)
First converts RNA to cDNA using reverse transcriptase (MMLV, SuperScript), then amplifies by standard PCR. Two-step (separate RT and PCR) or one-step (single tube) protocols. Essential for gene expression analysis and RNA virus detection. Oligo(dT) primers for mRNA, random hexamers for total RNA, or gene-specific primers.
qPCR (Quantitative Real-Time PCR)
Monitors amplification in real-time via fluorescence. Two main chemistries:
- SYBR Green: Intercalating dye; binds any dsDNA. Cheap but non-specific (detects primer dimers). Melt curve analysis verifies single product.
- TaqMan probes: Dual-labeled hydrolysis probes (5'-FAM reporter, 3'-TAMRA quencher). Sequence-specific. Taq 5'-to-3' exonuclease releases reporter during extension.
qPCR Quantification
The Ct (threshold cycle) is the cycle at which fluorescence crosses a set threshold:
Fold change = 2^(-Delta-Delta-Ct). Assumes ~100% amplification efficiency. Standard curve method more robust for unequal efficiencies.
Other PCR Variants
- Digital PCR (dPCR): Partitions sample into thousands of nanoliter reactions. Absolute quantification by counting positive partitions (Poisson statistics). No standard curve needed.
- Nested PCR: Two rounds with inner primers. Extreme sensitivity and specificity. Risk of contamination.
- Overlap extension PCR: Joins/mutates sequences without restriction enzymes. Overlapping primers create chimeric products.
- Inverse PCR: Outward-facing primers amplify unknown flanking sequences from a known internal region.
- Multiplex PCR: Multiple primer pairs in one reaction. Used in pathogen panels, STR profiling, and genotyping.
3. DNA Sequencing
3.1 Sanger Sequencing (First Generation)
Developed by Frederick Sanger in 1977 (Nobel Prize 1980), dideoxy chain termination sequencing remains the gold standard for accuracy and is still widely used for single-gene sequencing, clone verification, and variant confirmation.
Principle: Dideoxy Chain Termination
The reaction contains normal dNTPs plus a small proportion of dideoxynucleotides (ddNTPs) lacking the 3'-OH group. When a ddNTP is incorporated, chain extension terminates at that position. The result is a nested set of fragments of every possible length, each terminated with a fluorescently labeled ddNTP that identifies the terminal base.
Template: 3'-TAGCCATGCGAT...-5'
Primer extension with dNTPs + ddNTPs:
Fragment 1: 5'-A* (terminated at pos 1, ddATP)
Fragment 2: 5'-AT* (terminated at pos 2, ddTTP)
Fragment 3: 5'-ATC* (terminated at pos 3, ddCTP)
Fragment 4: 5'-ATCG* (terminated at pos 4, ddGTP)
Fragment 5: 5'-ATCGG* (terminated at pos 5, ddGTP)
... and so on for each position
Modern Sanger Protocol
- Single reaction with 4 differently-colored fluorescent ddNTPs
- Capillary electrophoresis (CE) separates fragments by size
- Laser excitation and CCD detection of fluorescent labels
- Automated base calling by software (Phred quality scores)
- Read length: ~800-1000 bp per reaction
- Accuracy: >99.99% per base (Phred Q40+)
Phred Quality Score
where P_e is the probability of an incorrect base call. Q20 = 1% error rate, Q30 = 0.1% error, Q40 = 0.01% error. Modern Sanger routinely achieves Q40-Q60 for the middle portion of reads.
Derivation: Sanger Sequencing Read Length from Poisson ddNTP Incorporation
Starting from the probability of ddNTP incorporation at each position, we derive the theoretical distribution of fragment lengths and the practical read length limit.
Step 1: Define the termination probability per position
At each template position, the polymerase incorporates either a normal dNTP (continuing synthesis) or a ddNTP (terminating the chain). The probability of termination depends on the ddNTP:dNTP ratio:
$$p = \frac{[\text{ddNTP}]}{[\text{ddNTP}] + [\text{dNTP}]} \qquad \text{(typically } p \approx 0.01\text{)}$$
Step 2: Model fragment length as a geometric distribution
The probability that a fragment terminates at exactly position k (chain reaches length k before termination) follows a geometric distribution:
$$P(\text{length} = k) = (1-p)^{k-1} \cdot p$$
Step 3: Calculate the expected fragment length
The mean fragment length from a geometric distribution is:
$$\langle L \rangle = \frac{1}{p} \qquad \text{For } p = 0.01: \;\langle L \rangle = 100\;\text{nt}$$
Step 4: Derive the signal intensity at position k
The number of fragments of length k out of N total molecules determines the fluorescence signal. The expected count at position k is:
$$n(k) = N \cdot p \cdot (1-p)^{k-1} = N \cdot p \cdot e^{-(k-1)\ln(1/(1-p))} \approx N \cdot p \cdot e^{-kp}$$
Step 5: Determine the practical read length limit
Base calling fails when the signal-to-noise ratio drops below a threshold. The signal decays exponentially while noise remains roughly constant. The read length Lmax where signal equals noise floor (nmin):
$$N \cdot p \cdot e^{-L_{\max} \cdot p} = n_{\min} \implies L_{\max} = \frac{1}{p}\ln\left(\frac{Np}{n_{\min}}\right)$$
Step 6: Optimize p for maximum read length
Taking dLmax/dp = 0 to find the optimal ddNTP ratio, and using typical values (N = 106 template molecules, nmin = 10):
$$p_{\text{opt}} = \frac{1}{L_{\text{target}}} \approx \frac{1}{1000} = 0.001$$
Modern Sanger sequencing optimizes the ddNTP:dNTP ratio to achieve 800โ1000 bp read lengths. Beyond this, electrophoretic resolution of fragments differing by 1 bp (at size >1000) becomes the limiting factor rather than the Poisson statistics.
3.2 Next-Generation Sequencing (NGS)
NGS technologies perform massively parallel sequencing of millions to billions of DNA fragments simultaneously, reducing cost per base by orders of magnitude compared to Sanger. The cost of sequencing a human genome dropped from ~$3 billion (2001) to under $200 (2024).
Illumina Sequencing (Sequencing by Synthesis)
Dominant NGS platform (~80% of all sequencing data generated worldwide).
- Library preparation: Fragment DNA, ligate adaptors, size-select.
- Bridge amplification: ssDNA binds flow cell oligos, bends to hybridize with adjacent oligo, polymerase creates complement. Repeated to form clonal clusters (~1000 copies each).
- Sequencing by synthesis: Fluorescently labeled reversible terminators added one at a time. After each incorporation: image, cleave fluorophore + 3'-blocking group, repeat.
- Output: Read length 2x75 to 2x300 bp (paired-end). NovaSeq 6000: ~6 Tb per run, 20 billion reads. NovaSeq X Plus: ~16 Tb.
Ion Torrent (Semiconductor Sequencing)
Detects H+ ions released during nucleotide incorporation. Each well contains a single clonal template on an ion-sensitive field-effect transistor (ISFET). Nucleotides flowed sequentially; pH change proportional to number of incorporated bases. No optics needed. Fast turnaround (~2 hours). Weakness: homopolymer errors (difficult to count >6 identical bases). Read length: ~200-400 bp. Ion GeneStudio S5: ~50 Gb per run.
3.3 Third-Generation Sequencing (Long-Read)
PacBio SMRT Sequencing
Single Molecule, Real-Time sequencing:
- Zero-mode waveguides (ZMWs): Nanophotonic chambers (~70 nm diameter) confining observation volume to ~20 zeptoliters. Only the fluorescent nucleotide being incorporated by the polymerase is observed.
- Mechanism: Single DNA polymerase molecule anchored at ZMW bottom. Fluorescent phospholinked dNTPs diffuse in, are held during incorporation (~ms), fluorophore detected, then released upon translocation.
- Read length: Average 15-20 kb (HiFi mode), up to 50+ kb. Circular consensus sequencing (CCS/HiFi): multiple passes around SMRTbell template yields Q30+ accuracy.
- Revio system: 25 million ZMWs, ~90 Gb HiFi data per run, ~360 per year with 4 SMRT cells per run.
Oxford Nanopore Sequencing
Direct measurement of nucleic acids threading through protein nanopores:
- Protein pore: CsgG-derived nanopore embedded in synthetic membrane. Voltage applied across membrane drives ionic current through pore.
- Signal: As each ~5-mer passes through the constriction, it modulates ionic current in a characteristic way. Motor protein (helicase) controls translocation speed (~450 bp/sec).
- Read length: No theoretical limit. Routine reads 10-100 kb. Record: >4 Mb single read. Can sequence RNA directly without cDNA conversion.
- Platforms: MinION (portable USB device, ~50 Gb), PromethION (desktop, ~290 Gb per flow cell x 48 flow cells). Real-time basecalling enables adaptive sampling.
- Accuracy: Raw single-read ~95-98% (R10.4.1 chemistry). Consensus/duplex: >Q30.
Sequencing Error Rate Comparison
| Platform | Read Length | Raw Accuracy | Throughput/Run | Cost/Gb |
|---|---|---|---|---|
| Sanger (CE) | ~800 bp | 99.99% | ~0.8 kb | ~$500,000 |
| Illumina NovaSeq X | 2x150 bp | >99.9% (Q30+) | ~16 Tb | ~$2 |
| Ion Torrent S5 | ~200-400 bp | 99.5% | ~50 Gb | ~$50 |
| PacBio Revio (HiFi) | 15-20 kb | >99.9% (Q30+) | ~90 Gb | ~$10 |
| ONT PromethION | 10-100+ kb | ~95-99% | ~290 Gb | ~$5 |
3.4 Sequencing Applications
Genome & Transcriptome
- WGS: Whole genome sequencing for variant calling, de novo assembly, structural variant detection.
- WES: Whole exome sequencing (~1-2% of genome, ~22,000 genes). Cost-effective for clinical diagnostics.
- RNA-seq: Transcriptome profiling. Reveals expression levels (FPKM/TPM), alternative splicing, novel transcripts, gene fusions.
Epigenome & Chromatin
- ChIP-seq: Chromatin immunoprecipitation + sequencing maps protein-DNA interactions and histone modifications genome-wide.
- ATAC-seq: Assay for Transposase-Accessible Chromatin. Maps open chromatin regions using Tn5 transposase insertion.
- Bisulfite-seq: Maps DNA methylation (5mC) at single-base resolution by converting unmethylated C to U.
4. Gel Electrophoresis
Separation by Size and Charge
Electrophoresis separates macromolecules based on their migration through a porous gel matrix under an applied electric field. The fundamental equation governing electrophoretic mobility:
where mu is electrophoretic mobility, v is velocity, E is electric field strength, q is net charge, and f is the frictional coefficient (depends on size and shape). For DNA, charge is uniformly negative (one phosphate per nucleotide), so separation is purely by size through the sieving matrix.
Agarose Gel (DNA/RNA)
- Concentration: 0.5-2% depending on fragment size
- 0.5% agarose: resolves 1-30 kb fragments
- 1.0% agarose: resolves 0.5-10 kb (most common)
- 2.0% agarose: resolves 0.1-2 kb fragments
- Staining: ethidium bromide (intercalator, UV fluorescent) or safer alternatives (SYBR Safe, GelRed)
- Buffer: TAE (Tris-Acetate-EDTA) or TBE (Tris-Borate-EDTA)
- Migration: log(mobility) is proportional to -[agarose%] (Ferguson plot)
PAGE (Polyacrylamide Gel)
- SDS-PAGE (denaturing): Sodium dodecyl sulfate denatures proteins and confers uniform negative charge proportional to mass (~1.4 g SDS/g protein). Separation by molecular weight alone.
- Native PAGE: Separates by charge, size, and shape. Preserves protein activity and quaternary structure.
- Acrylamide concentration: 6-15% (higher = better resolution of small proteins)
- Staining: Coomassie blue (~10 ng sensitivity) or silver stain (~0.1 ng sensitivity)
Ferguson Plot Analysis
The relationship between electrophoretic mobility and gel concentration:
where mu_0 is the free mobility (zero gel concentration), K_R is the retardation coefficient (proportional to molecular size/Stokes radius), and T is the gel concentration. Plotting log(mu) vs. T for different species yields straight lines whose slopes indicate relative size.
Derivation: Ferguson Plot โ Gel Electrophoresis Mobility
Starting from the Ogston sieving model, we derive the log-linear relationship between electrophoretic mobility and gel concentration (the Ferguson plot).
Step 1: Define electrophoretic mobility
A charged macromolecule in an electric field experiences an electromotive force balanced by friction:
$$qE = fv \implies \mu = \frac{v}{E} = \frac{q}{f}$$
where q is net charge, E is electric field, f is frictional coefficient, and v is migration velocity.
Step 2: Free solution mobility (ฮผ0)
In free solution (no gel), DNA has uniform charge density (~2 phosphates per bp, each with charge โe). The free mobility ฮผ0 is nearly independent of DNA size because charge and friction both scale linearly with length:
$$\mu_0 = \frac{q}{f} = \frac{N \cdot q_{\text{bp}}}{N \cdot f_{\text{bp}}} = \frac{q_{\text{bp}}}{f_{\text{bp}}} \approx 3.8 \times 10^{-4}\;\text{cm}^2/\text{Vยทs}$$
Step 3: Ogston sieving model โ gel as a random fiber network
The gel matrix creates pores of various sizes. The probability that a spherical particle of radius Rs fits through a pore in a random network of fibers at concentration T (g/100 mL) is (Ogston, 1958):
$$f_{\text{accessible}} = e^{-\pi (R_s + r_f)^2 \cdot \ell \cdot T} = e^{-K_R \cdot T}$$
where rf is the fiber radius, ℓ is the fiber length per unit concentration, and KR = ฯ(Rs + rf)2ℓ is the retardation coefficient.
Step 4: Derive the Ferguson equation
The effective mobility in the gel is the free mobility multiplied by the fraction of accessible pore space:
$$\mu = \mu_0 \cdot e^{-K_R \cdot T} \implies \log(\mu) = \log(\mu_0) - K_R \cdot T$$
Step 5: Interpret the Ferguson plot
Plotting log(ฮผ) vs. gel concentration T for different molecules yields straight lines. The slope (โKR) depends on molecular size, while the y-intercept gives log(ฮผ0), the free mobility:
$$K_R \propto R_s^2 \propto M_w^{2/3} \qquad \text{(for globular proteins)}$$
$$K_R \propto L \qquad \text{(for rod-like DNA, where } L \text{ is contour length)}$$
Step 6: SDS-PAGE size calibration
In SDS-PAGE, all proteins have the same charge-to-mass ratio (due to SDS coating), so ฮผ0 is constant. The mobility depends only on KR (molecular sieving). At fixed gel concentration, log(ฮผ) is approximately linear with log(Mw):
$$\log(\mu) = a - b \cdot \log(M_w) \qquad \text{(basis for MW determination by SDS-PAGE)}$$
This is why plotting log(Mw) of protein standards vs. relative mobility (Rf) yields a linear calibration curve for determining the molecular weight of unknown proteins.
Pulsed-Field Gel Electrophoresis (PFGE)
Standard agarose gels cannot resolve DNA fragments larger than ~50 kb (co-migration). PFGE alternates the electric field direction at defined intervals, forcing large DNA molecules to reorient before migrating. Resolves fragments from 50 kb to >10 Mb. Used for bacterial strain typing, chromosome-level karyotyping, and restriction mapping of large genomes. Switch times determine resolution range.
5. Blotting Techniques
Blotting techniques transfer separated macromolecules from a gel to a membrane (nitrocellulose or PVDF) for detection with specific probes or antibodies. Named for their targets:
Southern Blot (DNA)
Developed by Edwin Southern (1975).
- Digest genomic DNA with restriction enzyme
- Separate fragments by agarose gel electrophoresis
- Denature DNA in gel (NaOH)
- Transfer (capillary/vacuum) to nylon membrane
- Hybridize with labeled probe (radioactive or DIG-labeled)
- Wash at appropriate stringency, detect signal
Applications: RFLP analysis, gene copy number, transgene verification, viral integration mapping.
Northern Blot (RNA)
Named as a play on "Southern."
- Extract total RNA, separate on denaturing agarose gel (formaldehyde)
- Transfer to membrane
- Hybridize with complementary probe
- Detect specific mRNA species
Advantages over RT-qPCR: reveals transcript size, detects alternative splicing, shows degradation. Disadvantage: less sensitive, requires more RNA.
Western Blot (Protein)
Detects proteins with specific antibodies.
- Separate proteins by SDS-PAGE
- Transfer (wet/semi-dry) to PVDF or nitrocellulose
- Block non-specific binding (BSA or milk)
- Incubate with primary antibody (target-specific)
- Wash, add secondary antibody (HRP or AP conjugated)
- Detect by chemiluminescence (ECL), colorimetric, or fluorescence
Semi-quantitative. Loading controls: beta-actin, GAPDH, total protein stain.
Hybridization Stringency
The specificity of probe-target binding is controlled by wash stringency. Higher temperature and lower salt concentration increase stringency (more mismatches disrupted). The Tm of a probe-target hybrid depends on:
High stringency washes (0.1x SSC, 65 degrees C) allow only perfectly matched hybrids. Low stringency (2x SSC, 42 degrees C) permits cross-species hybridization for homolog detection.
6. CRISPR-Cas9 Gene Editing
6.1 Mechanism
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) with Cas9 nuclease is the most revolutionary genetic tool since PCR. Adapted from a bacterial adaptive immune system, it enables precise, programmable DNA cleavage guided by a short RNA sequence. Doudna and Charpentier received the 2020 Nobel Prize for its development.
Single Guide RNA (sgRNA) Design
The sgRNA is a synthetic fusion of two natural RNA components:
- Spacer (20 nt): Complementary to the target DNA. Determines specificity. Mismatches at the PAM-proximal end (seed region, positions 1-12) are less tolerated.
- Scaffold (~80 nt): Derived from tracrRNA. Forms the stem-loops that bind Cas9. Invariant structure required for RNP assembly.
Target DNA: 5'-...NNNNNNNNNNNNNNNNNNNNNNGG...-3'
3'-...NNNNNNNNNNNNNNNNNNNNNNCC...-5'
^PAM
sgRNA: 3'-NNNNNNNNNNNNNNNNNNNNN-[scaffold]-5'
||||||||||||||||||||||
Cas9 cleaves both strands 3 bp upstream of PAM
Cas9 PAM Recognition and R-loop Formation
SpCas9 (from Streptococcus pyogenes) requires a 5'-NGG-3' PAM (protospacer adjacent motif) immediately 3' of the target sequence on the non-target strand. The mechanism proceeds:
- Cas9-sgRNA complex scans DNA, recognizing PAM via the PAM-interacting domain (PID)
- Local DNA unwinding creates an R-loop: sgRNA displaces the non-target strand and base-pairs with the target strand
- Full R-loop formation triggers conformational change activating both nuclease domains
- RuvC domain cleaves non-target strand; HNH domain cleaves target strand
- Blunt-ended DSB created 3 bp upstream of PAM
CRISPR Efficiency Estimation
On-target efficiency depends on guide RNA design. The Doench et al. (2016) Rule Set 2 score models cutting efficiency as:
where sigma is the logistic function, x_i are position-specific nucleotide features, w are learned weights, and the interaction terms capture dinucleotide effects. GC content between 40-70%, absence of poly-T tracts (Pol III terminator), and specific nucleotide preferences at positions -4 to +3 relative to the cut site all contribute to efficiency.
Derivation: CRISPR Guide RNA Binding Thermodynamics
Starting from the thermodynamics of RNA:DNA hybridization, we derive the free energy of Cas9-sgRNA binding to its target, explaining the seed region requirement and PAM specificity.
Step 1: Decompose the total binding free energy
The overall binding free energy of the Cas9-sgRNA complex to target DNA has three main components:
$$\Delta G_{\text{binding}} = \Delta G_{\text{PAM}} + \Delta G_{\text{seed}} + \Delta G_{\text{non-seed}} + \Delta G_{\text{DNA unwinding}}$$
Step 2: PAM recognition energy
Cas9 first scans for the PAM (5'-NGG-3') through protein-DNA contacts (not base pairing). The PAM-interacting domain (PID) makes sequence-specific hydrogen bonds with the minor groove of the GG dinucleotide:
$$\Delta G_{\text{PAM}} \approx -3\;\text{to}\;-5\;\text{kcal/mol} \qquad \text{(protein-DNA H-bonds)}$$
Step 3: Seed region hybridization (positions 1โ12)
After PAM recognition, R-loop formation initiates at the PAM-proximal โseedโ region. Using nearest-neighbor RNA:DNA hybridization parameters, each matched base pair contributes:
$$\Delta G_{\text{seed}} = \sum_{i=1}^{12} \Delta G_{\text{nn},i} \approx 12 \times (-1.5) = -18\;\text{kcal/mol}$$
Step 4: Cost of DNA unwinding
R-loop formation requires melting the target DNA duplex. The cost of unwinding 20 bp of dsDNA (to be replaced by the RNA:DNA hybrid) is:
$$\Delta G_{\text{unwinding}} = +\sum_{i=1}^{20}\Delta G_{\text{DNA:DNA},i} \approx +20 \times 1.5 = +30\;\text{kcal/mol}$$
Step 5: Non-seed region and full R-loop
Positions 13โ20 (PAM-distal) contribute additional RNA:DNA base pairing. Mismatches here are better tolerated because the seed region has already committed the complex:
$$\Delta G_{\text{non-seed}} \approx 8 \times (-1.5) = -12\;\text{kcal/mol}$$
Step 6: Net binding energy and mismatch tolerance
The total free energy for a perfectly matched target:
$$\Delta G_{\text{total}} = (-4) + (-18) + (-12) + (+30) + \Delta G_{\text{conformational}} \approx -12\;\text{kcal/mol}$$
Each mismatch in the seed region costs ~+3 kcal/mol (loss of one base pair plus disruption of stacking), while mismatches in positions 15โ20 cost only ~+1.5 kcal/mol. This explains why seed mismatches are far more detrimental to on-target efficiency (~100ร reduction per mismatch) compared to PAM-distal mismatches (~3ร per mismatch), and why off-target prediction algorithms weight seed positions more heavily.
6.2 Repair Outcomes
NHEJ (Non-Homologous End Joining)
Error-prone repair that directly ligates broken ends. Active throughout the cell cycle.
- Produces insertions and deletions (indels) at the cut site
- Frameshift mutations typically destroy gene function (knockout)
- Efficiency: 20-80% of alleles edited depending on guide and cell type
- Detected by T7 Endonuclease I assay, TIDE analysis, or amplicon sequencing
HDR (Homology-Directed Repair)
Template-guided precise repair. Requires a donor DNA with homology arms.
- Restricted to S/G2 phase (requires sister chromatid or exogenous template)
- Efficiency: typically 1-30% (much lower than NHEJ)
- Donor templates: ssODN (for small edits, 60-200 nt) or plasmid (for large knock-ins, 500+ bp homology arms)
- Enhanced by NHEJ inhibitors (SCR7, NU7441) or cell cycle synchronization
Base Editing
Precise single-nucleotide changes without DSBs:
- CBE (Cytosine Base Editor): Cas9 nickase (D10A) fused to cytidine deaminase (APOBEC1) + UGI. Converts C-to-T (or G-to-A on opposite strand) within a ~5 nt editing window.
- ABE (Adenine Base Editor): Cas9 nickase fused to evolved TadA adenosine deaminase. Converts A-to-G (T-to-C on opposite strand). No natural enzyme existed; created by directed evolution.
- Efficiency: 20-80% at target positions. Lower off-target and indel rates than nuclease Cas9.
Prime Editing
"Search-and-replace" editing enabling all 12 transition/transversion mutations, small insertions, and deletions without DSBs or donor template:
- pegRNA: Extended guide RNA with a primer binding site (PBS) and RT template encoding the desired edit.
- PE2: Cas9 nickase (H840A) fused to M-MLV reverse transcriptase. Nicks target strand, RT copies edit from pegRNA into genomic DNA.
- PE3/PE3b: Additional nicking guide for the non-edited strand encourages mismatch repair to favor the edited strand. PE3b nicks only after editing, reducing indels.
- Lower efficiency than base editing (~5-50%) but greater versatility.
6.3 Applications
Basic Research
- Gene knockout libraries (genome-wide screens)
- CRISPRi/CRISPRa (dCas9 for transcription repression/activation)
- Lineage tracing (evolving barcodes)
- Disease model generation (mouse, organoid, iPSC)
Gene Therapy
- Sickle cell disease: Casgevy (exa-cel) - first approved CRISPR therapy (2023). Edits BCL11A enhancer in HSCs to reactivate fetal hemoglobin.
- CAR-T cells: CRISPR knockout of PD-1, TRAC, B2M for universal allogeneic CAR-T. Reduced graft-vs-host disease.
- In vivo editing: Liver-targeted LNP delivery for transthyretin amyloidosis (NTLA-2001, Intellia).
Gene Drives
- Super-Mendelian inheritance (>50% transmission)
- CRISPR-based gene drive converts heterozygotes to homozygotes
- Potential: suppress malaria mosquitoes (Anopheles), invasive species control
- Ethical concerns: irreversibility, ecosystem effects, biosecurity
- Daisy-chain drives proposed as self-limiting safeguard
7. Protein Analysis and Structural Biology
7.1 Mass Spectrometry
Mass spectrometry (MS) is the premier technology for protein identification, quantification, and post-translational modification mapping. It measures the mass-to-charge ratio (m/z) of ionized molecules with extraordinary precision.
Ionization Methods
- MALDI-TOF: Matrix-Assisted Laser Desorption/Ionization with Time-of-Flight analyzer. Sample co-crystallized with matrix (sinapinic acid, CHCA). Pulsed laser causes matrix ablation and analyte ionization. Predominantly singly-charged ions. Fast, tolerant of salts/detergents. Used for peptide mass fingerprinting, intact protein mass, microbial identification (MALDI Biotyper).
- ESI: Electrospray Ionization. Continuous flow from liquid chromatography. Produces multiply-charged ions enabling analysis of very large proteins. Coupled to various analyzers (quadrupole, Orbitrap, TOF).
Tandem MS (MS/MS)
Two stages of mass analysis for protein sequencing:
- First MS (MS1): Measures intact peptide ions (precursor ions)
- Fragmentation: Collision-induced dissociation (CID), higher-energy collisional dissociation (HCD), or electron-transfer dissociation (ETD) breaks peptide bonds
- Second MS (MS2): Measures fragment ions (b/y ion series)
- Database search (Mascot, MaxQuant, Proteome Discoverer) matches spectra to peptide sequences
Data-independent acquisition (DIA) fragments all peptides in defined m/z windows, enabling deeper and more reproducible proteome coverage than traditional data-dependent acquisition (DDA).
7.2 Structural Biology Methods
X-ray Crystallography
The workhorse of structural biology (>85% of all PDB structures). Requires protein crystallization, which remains the major bottleneck. X-rays (wavelength ~1 Angstrom) diffract off the electron cloud of atoms in the crystal lattice. Bragg's law governs diffraction:
The "phase problem" (diffraction patterns record intensities but not phases) is solved by molecular replacement, isomorphous replacement (heavy atom derivatives), or anomalous dispersion (SAD/MAD). Resolution: typically 1.5-3.0 Angstrom. Synchrotron radiation and micro-focus beams enable structure determination from crystals as small as 5 micrometers.
Cryo-Electron Microscopy (Cryo-EM)
The "resolution revolution" (Nobel Prize 2017 to Dubochet, Frank, Henderson). No crystallization required.
- Single-particle analysis: Purified protein flash-frozen in vitreous ice. TEM images thousands of individual particles in random orientations. 2D class averaging and 3D reconstruction yield near-atomic resolution structures.
- Resolution: Now routinely reaches 2-3 Angstrom, with exceptional cases below 1.5 Angstrom. Direct electron detectors (K3, Falcon 4) and energy filters were transformative.
- Advantages: Works with heterogeneous samples, can resolve multiple conformational states, requires less protein (~1 microgram), no crystal packing artifacts.
- Cryo-ET: Cryo-electron tomography images proteins in situ within cells. Tilt series reconstructed into 3D tomograms. Sub-tomogram averaging for molecular resolution.
NMR Spectroscopy
Nuclear Magnetic Resonance determines structure in solution, capturing dynamics and conformational exchange. Proteins must be isotopically labeled (15N, 13C). HSQC (Heteronuclear Single Quantum Coherence) spectra give one peak per amide N-H, fingerprinting the protein. NOE (Nuclear Overhauser Effect) provides distance restraints (<5 Angstrom). Limited to proteins <~40 kDa (TROSY extends to ~100 kDa). Uniquely reveals microsecond-millisecond dynamics, binding kinetics, and intrinsically disordered regions.
Circular Dichroism (CD)
Measures differential absorption of left- and right-circularly polarized UV light. Secondary structure elements have characteristic CD spectra: alpha-helices show minima at 208 and 222 nm; beta-sheets at 218 nm. Quick assessment of secondary structure content (~5 min), folding state, thermal stability (Tm determination by monitoring 222 nm ellipticity vs. temperature), and conformational changes upon ligand binding. Requires only ~0.1-0.5 mg protein.
8. Modern Omics Technologies
8.1 The Omics Landscape
Modern biology increasingly takes a systems-level view, profiling entire classes of biomolecules simultaneously rather than studying individual genes or proteins in isolation. These "omics" approaches generate massive datasets that require bioinformatics and machine learning for interpretation.
Genomics
Study of entire genomes. Enabled by NGS. Key applications include whole-genome sequencing for variant discovery (SNPs, indels, SVs, CNVs), genome-wide association studies (GWAS) linking variants to phenotypes, population genomics, pharmacogenomics (drug response prediction), and comparative genomics across species. The 1000 Genomes Project cataloged >88 million variants across 2,504 individuals from 26 populations.
Transcriptomics (RNA-seq)
Quantifies all RNA species in a sample. Workflow: RNA extraction, poly(A) selection or rRNA depletion, fragmentation, cDNA synthesis, library prep, sequencing, alignment (STAR/HISAT2), quantification (featureCounts/Salmon), differential expression (DESeq2/edgeR). Normalized as FPKM, TPM, or CPM. Reveals gene expression changes, alternative splicing, fusion transcripts, allele-specific expression, and non-coding RNA.
Proteomics
Large-scale study of all proteins in a sample. Bottom-up proteomics: digest proteins with trypsin, analyze peptides by LC-MS/MS. Top-down proteomics: analyze intact proteins. Quantification: label-free (intensity-based), TMT/iTRAQ (isobaric mass tags for multiplexed quantification), or SILAC (metabolic labeling with heavy amino acids). Post-translational modification (PTM) profiling: phosphoproteomics (TiO2/IMAC enrichment), ubiquitinomics, glycoproteomics, acetylomics.
Metabolomics
Comprehensive analysis of small molecule metabolites (<1500 Da). Platforms: LC-MS (broadest coverage), GC-MS (volatile metabolites), NMR (quantitative, non-destructive). Untargeted metabolomics discovers novel biomarkers; targeted metabolomics quantifies known pathways. Applications: disease diagnosis, drug metabolism (ADME), nutritional studies, microbiome metabolite profiling (short-chain fatty acids, bile acids).
8.2 Single-Cell and Spatial Omics
Traditional omics average signals across millions of cells, masking cellular heterogeneity. Single-cell and spatial technologies resolve individual cells and their spatial context.
Single-Cell RNA-seq (scRNA-seq)
Profiles the transcriptome of individual cells. Droplet-based methods (10x Genomics Chromium) encapsulate single cells with barcoded gel beads in nanoliter droplets. Each cell gets a unique barcode; each mRNA molecule gets a unique molecular identifier (UMI) to eliminate PCR amplification bias. Typical experiment: 5,000-50,000 cells, detecting 2,000-5,000 genes per cell.
Analysis pipeline: Cell Ranger (alignment/counting), Seurat or Scanpy (normalization, dimensionality reduction via PCA/UMAP, clustering, differential expression, trajectory inference, RNA velocity). Applications: cell atlas projects (Human Cell Atlas), tumor heterogeneity, developmental biology, immune cell profiling.
Single-Cell Multi-omics
Simultaneously measure multiple modalities from the same single cell:
- CITE-seq: scRNA-seq + surface protein (antibody-derived tags). Combines transcriptome with >200 protein markers per cell.
- 10x Multiome: Joint scRNA-seq + scATAC-seq from the same nucleus. Links gene expression to chromatin accessibility.
- SHARE-seq: Chromatin accessibility + gene expression. Reveals regulatory element-gene linkages.
- scCUT&Tag: Single-cell histone modification profiling. Maps epigenomic landscapes per cell.
- Perturb-seq: CRISPR perturbations + scRNA-seq readout. Genome-scale screens with rich phenotypic readout.
Spatial Transcriptomics
Preserves the spatial location of gene expression within tissue sections. Named "Method of the Year 2020" by Nature Methods.
MERFISH (Imaging-based)
Multiplexed Error-Robust FISH. Combinatorial barcoding of RNA species with sequential rounds of hybridization and imaging. Subcellular resolution. Detects 100-10,000 genes per experiment. Error-correcting Hamming codes enable reliable identification. Vizgen MERSCOPE platform.
Visium (Sequencing-based)
10x Genomics Visium: tissue section placed on a slide with spatially barcoded oligo capture spots (55 micrometer diameter, ~5,000 spots). mRNA captured, reverse-transcribed, and sequenced with spatial barcodes. Whole transcriptome but ~5-10 cell resolution. Visium HD achieves 2-micrometer (single-cell) resolution.
Computational Lab: PCR Amplification Simulation
This Python simulation models PCR amplification cycle-by-cycle with realistic sub-100% efficiency, primer depletion, and product saturation. It computes the amplification curve, determines the qPCR Ct value, and performs Delta-Ct quantification analysis. The output demonstrates the three phases of PCR (exponential, linear, plateau) and how real-world amplification diverges from the ideal 2^n model.
Python: PCR Amplification & qPCR Ct Analysis
Python#!/usr/bin/env python3
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Computational Lab: Sanger Sequencing Simulation
This Fortran program simulates the Sanger dideoxy chain-termination method. It generates a random DNA template sequence, performs four termination reactions (one per ddNTP), collects all terminated fragments, sorts them by length (simulating capillary electrophoresis), and reads off the terminal bases to reconstruct the complement strand. The simulation demonstrates how Sanger sequencing determines DNA sequence from a nested set of terminated fragments.
Fortran: Sanger Sequencing Simulation
FortranChain-termination sequencing with fragment sorting and sequence reconstruction
Click Run to execute the Fortran code
Code will be compiled with gfortran and executed on the server
Key Equations Summary
PCR Amplification (with efficiency E):
Primer Tm (nearest-neighbor):
Electrophoretic Mobility:
Phred Quality Score / Sequencing Error:
qPCR Relative Quantification:
Restriction Site Frequency:
Bragg's Law (X-ray Crystallography):
Congratulations!
You have completed the Molecular Biology course. You now have a comprehensive foundation in DNA structure, replication, repair, transcription, translation, gene regulation, recombination, and the full spectrum of laboratory techniques from classical cloning to CRISPR genome editing and single-cell multi-omics. These tools and concepts form the backbone of modern biomedical research, clinical diagnostics, and biotechnology.