DNA Sequencing Technologies
From Sanger to nanopores: the evolution of reading the genetic code
Sanger Sequencing (Chain Termination Method)
Developed by Frederick Sanger and colleagues in 1977, the chain termination method (also called dideoxy sequencing) was the gold standard of DNA sequencing for nearly three decades. It earned Sanger his second Nobel Prize in Chemistry (1980) and formed the technological backbone of the Human Genome Project.
Principle of Chain Termination
The method exploits the structural difference between normal deoxynucleotides (dNTPs) and 2',3'-dideoxynucleotides (ddNTPs). While dNTPs possess a 3'-OH group essential for phosphodiester bond formation, ddNTPs lack this hydroxyl group at both the 2' and 3' positions of the ribose sugar. When a ddNTP is incorporated into a growing DNA strand by DNA polymerase, the chain cannot be extended further because no 3'-OH is available for the next nucleotide addition.
Reaction Components
- Template DNA: Single-stranded DNA to be sequenced
- Primer: Short oligonucleotide complementary to a known region flanking the target
- DNA Polymerase: Modified T7 (Sequenase) or Taq polymerase for cycle sequencing
- dNTPs: dATP, dCTP, dGTP, dTTP at normal concentrations
- ddNTPs: ddATP, ddCTP, ddGTP, ddTTP -- each labeled with a different fluorescent dye (in modern automated versions)
In the original method, four separate reactions were performed, each containing one type of ddNTP. Modern automated Sanger sequencing uses a single reaction with all four ddNTPs, each labeled with a distinct fluorophore. The fragments are separated by capillary electrophoresis, where a laser excites the fluorescent labels as fragments migrate past a detector window, producing a four-color chromatogram (electropherogram). Each peak corresponds to a nucleotide, and the sequence is read from shortest to longest fragment.
Sanger Sequencing Characteristics
- Read length: 700-1,000 bp
- Accuracy: 99.999% per base
- Throughput: ~115 kb/run (96 capillaries)
- Error type: Primarily substitution errors
- Run time: ~2 hours per run
- Cost: ~$500/Mb (2001 era)
Despite being superseded by high-throughput methods for large-scale projects, Sanger sequencing remains widely used for validation of variants, sequencing of PCR amplicons, plasmid verification, and clinical diagnostic applications where extremely high per-base accuracy is essential.
Next-Generation Sequencing: Illumina Platform
Next-generation sequencing (NGS), also called massively parallel sequencing or second-generation sequencing, revolutionized genomics by enabling millions to billions of short DNA fragments to be sequenced simultaneously. The Illuminaplatform (based on Solexa technology, acquired in 2007) dominates the NGS market with an estimated 80-90% share of all sequencing data generated worldwide.
Library Preparation
Genomic DNA is fragmented (by sonication or enzymatic treatment) to a target size range (typically 200-800 bp). Platform-specific adapters are ligated to both ends of each fragment. These adapters contain: (1) sequences complementary to the flow cell oligonucleotides, (2) primer binding sites for sequencing, and (3) index sequences (barcodes) for sample multiplexing.
Bridge Amplification & Cluster Generation
The adapter-ligated library is loaded onto a glass flow cell coated with a dense lawn of two types of oligonucleotides complementary to the adapter sequences. Each fragment hybridizes to a surface oligo and is copied by a surface-bound polymerase. The original template is denatured and washed away, and the newly synthesized strand bends over to hybridize with an adjacent complementary oligo -- forming a "bridge."
Repeated cycles of denaturation, annealing, and extension (bridge amplification) generate dense clonal clusters of approximately 1,000 identical copies of each original fragment. This amplification is critical because the fluorescent signal from a single molecule would be too faint to detect reliably.
Sequencing by Synthesis (SBS)
After cluster generation, the reverse strands are cleaved and washed away, leaving forward strands. A sequencing primer hybridizes to the adapter region, and sequencing by synthesis proceeds one nucleotide at a time:
- Fluorescently labeled, 3'-O-azidomethyl-blocked nucleotides (reversible terminators) are flowed across the surface
- DNA polymerase incorporates the complementary nucleotide at each cluster
- Unincorporated nucleotides are washed away
- The entire flow cell is imaged -- each cluster fluoresces with a color corresponding to the incorporated base
- The fluorescent dye and 3' blocking group are chemically cleaved, regenerating a free 3'-OH for the next cycle
Illumina Platform Specifications
| Platform | Read Length | Output/Run | Reads/Run |
|---|---|---|---|
| MiSeq | 2 x 300 bp | ~15 Gb | ~25 million |
| NextSeq 2000 | 2 x 150 bp | ~360 Gb | ~1.2 billion |
| NovaSeq 6000 | 2 x 150 bp | ~6,000 Gb | ~20 billion |
| NovaSeq X Plus | 2 x 150 bp | ~16,000 Gb | ~52 billion |
The paired-end sequencing mode reads each fragment from both ends, generating two reads per cluster (e.g., 2 x 150 bp). The known insert size between paired reads provides valuable structural information for genome assembly and detection of structural variants.
Third-Generation Sequencing
Third-generation (or long-read) sequencing platforms overcome a fundamental limitation of short-read technologies: the inability to span repetitive regions, structural variants, and complex genomic rearrangements. These platforms sequence single molecules in real time, producing reads that are orders of magnitude longer than Illumina reads.
PacBio SMRT Sequencing
Pacific Biosciences (PacBio) developed Single Molecule, Real-Time (SMRT) sequencing. The technology employs zero-mode waveguides (ZMWs)-- nanophotonic confinement structures that are cylindrical holes (approximately 70 nm in diameter and 100 nm deep) fabricated in a thin metal film deposited on a glass substrate. Each ZMW confines the observation volume to the zeptoliter range ($\sim 10^{-21}$ liters), allowing detection of a single DNA polymerase molecule immobilized at the bottom.
A circular DNA template (SMRTbell) -- formed by ligating hairpin adapters to both ends of a double-stranded insert -- threads through the polymerase. As each fluorescently labeled nucleotide is incorporated, its fluorophore emits a characteristic pulse of light that is detected in real time. The fluorophore is attached to the terminal phosphate group and is cleaved upon incorporation, so no modification remains on the DNA.
PacBio Specifications
- Read length: Average 10-25 kb; HiFi reads 15-20 kb; ultra-long reads >100 kb possible
- Raw error rate: ~10-15% (random errors, predominantly insertions/deletions)
- HiFi mode: Circular consensus sequencing (CCS) achieves >99.9% accuracy by reading the same molecule multiple times (typically 10+ passes)
- Base modifications: Directly detects DNA methylation (5mC, 6mA, 4mC) from polymerase kinetics (interpulse duration)
- Latest platform: Revio system with ~25 million ZMWs per SMRT Cell
Oxford Nanopore Technology (ONT)
Oxford Nanopore Technologies takes a fundamentally different approach: instead of using polymerase-mediated synthesis, it directly reads the nucleotide sequence by threading a single strand of DNA through a biological nanopore -- a protein channel (typically the modified Mycobacterium smegmatis porin A, MspA, or CsgG from E. coli) embedded in an electrically resistant synthetic membrane.
A constant voltage (~180 mV) is applied across the membrane, driving ionic current through the pore. As each nucleotide passes through the narrowest constriction of the pore (the "reader head"), it partially blocks the ionic current. The magnitude of current blockade is characteristic of the specific nucleotide (or k-mer of approximately 5-6 bases) within the sensing region. A motor protein (helicase) controls the translocation speed to ~450 bases per second.
Nanopore Platform Comparison
| Device | Pores | Output | Use Case |
|---|---|---|---|
| Flongle | 126 | ~2.8 Gb | Small genomes, amplicons, rapid tests |
| MinION | 512 | ~50 Gb | Portable field sequencing, bacterial genomes |
| PromethION | ~3,000 per cell | ~290 Gb/cell | Human whole-genome sequencing, population studies |
The longest single nanopore read recorded exceeds 4 Mb, enabling the resolution of highly repetitive regions and complex structural variants in a single contiguous read.
Quality Scores & Error Profiles
Sequencing quality assessment is essential for downstream analysis. The universally adopted metric is the Phred quality score, originally developed for the Phred base-calling program used in Sanger sequencing and now applied across all platforms.
Phred Quality Score
The Phred score $Q$ is logarithmically related to the probability of an incorrect base call $P_e$:
Conversely, the error probability can be calculated from the quality score:
| Phred Score (Q) | Error Probability | Base Call Accuracy | Errors per kb |
|---|---|---|---|
| Q10 | 1 in 10 | 90% | 100 |
| Q20 | 1 in 100 | 99% | 10 |
| Q30 | 1 in 1,000 | 99.9% | 1 |
| Q40 | 1 in 10,000 | 99.99% | 0.1 |
| Q50 | 1 in 100,000 | 99.999% | 0.01 |
Error Profiles by Platform
Different sequencing technologies exhibit characteristic error profiles that must be accounted for in bioinformatic analyses:
- Illumina: Predominantly substitution errors (especially A$\rightarrow$C and G$\rightarrow$T). Error rate increases toward the 3' end of reads due to phasing/pre-phasing (desynchronization of clusters). Typical Q30 >85% of bases.
- PacBio (CLR): Predominantly insertion/deletion (indel) errors, especially single-base insertions. Errors are largely random, so consensus accuracy improves rapidly with coverage.
- PacBio (HiFi/CCS): Multiple passes around the circular template yield consensus accuracy of Q30-Q40+, with residual errors primarily in homopolymer regions.
- Oxford Nanopore: Systematic errors in homopolymer stretches (runs of identical bases, e.g., AAAAA), as the current levels for consecutive identical nucleotides are difficult to distinguish. Newer basecallers and pore chemistries (R10.4.1) have significantly improved homopolymer resolution.
Quality Control with FastQC
The most widely used tool for sequencing quality assessment is FastQC, which produces reports on per-base quality scores, per-sequence quality distribution, GC content, sequence duplication levels, adapter contamination, and overrepresented sequences. A typical Illumina run should show median per-base quality of Q30 or above across most of the read length, with the possible exception of the terminal 10-20 bases.
Sequencing Depth, Coverage & the Lander-Waterman Model
Two critical parameters in sequencing experimental design are sequencing depth (or coverage) and the related concept of breadth of coverage. Understanding these concepts is essential for determining how much sequencing is required for a given application.
Coverage (Depth of Sequencing)
The average coverage (or depth) $C$ represents the average number of times each base in the genome is sequenced. It is calculated as:
Where:
- $N$ = total number of reads
- $L$ = average read length (bp)
- $G$ = genome size (bp)
For example, sequencing a human genome ($G = 3.2 \times 10^9$ bp) with 1 billion 150 bp Illumina reads:
The Lander-Waterman Model
The Lander-Waterman equation (1988) provides a mathematical framework for predicting the expected coverage distribution, assuming reads are placed randomly and uniformly across the genome (a Poisson process).
Key Lander-Waterman Equations
The probability that a specific base is not covered by any read (i.e., falls in a sequencing gap) follows the Poisson distribution:
The expected fraction of the genome covered by at least one read:
More generally, the probability that a given position is covered by exactly $k$ reads at mean coverage $C$:
| Coverage (C) | % Genome Covered | Gap Probability | Application |
|---|---|---|---|
| 1x | 63.2% | 36.8% | Skim sequencing, basic survey |
| 5x | 99.3% | 0.67% | Low-pass WGS, imputation-based genotyping |
| 10x | 99.995% | 0.005% | Variant calling (low sensitivity) |
| 30x | >99.9999% | $9.4 \times 10^{-14}$ | Standard WGS (germline variant calling) |
| 100x | ~100% | $\sim 0$ | Somatic variant detection in tumors |
Cost per Genome: Moore's Law and Beyond
The cost of sequencing a human genome has dropped at a rate exceeding Moore's Law. From approximately $2.7 billion for the first human genome (HGP, 2003), costs fell to ~$10 million in 2007 (pre-NGS), ~$10,000 in 2011 (early Illumina HiSeq era), ~$1,000 in 2014 (Illumina HiSeq X), and reached approximately $200 by 2023 (Illumina NovaSeq X). The dramatic reduction in 2008 corresponds to the commercial introduction of NGS platforms, creating a discontinuity in the cost curve that outpaced even semiconductor scaling.
Choosing the Right Platform
Platform selection depends on the biological question. Short reads (Illumina) excel at SNP/indel detection, quantitative applications (RNA-seq, ChIP-seq), and population-scale projects due to high accuracy and throughput. Long reads (PacBio, ONT) are essential for de novo assembly, structural variant detection, repeat resolution, haplotype phasing, and direct epigenetic modification detection. Many modern projects employ hybrid approaches -- combining short reads for base-level accuracy with long reads for structural completeness.