Module 3
Genome Assembly & Comparative Genomics
Assembly reconstructs a genome from shotgun reads using overlap-layout-consensus (for long reads) or de Bruijn graphs (for short reads). Long-read sequencing (ONT, PacBio HiFi) has transformed assembly quality. Comparative genomics then aligns assembled genomes to reveal synteny and rearrangements.
1. De Bruijn Graphs
A de Bruijn graph on k-mers: for each (kβ1)-mer X and (kβ1)-mer Y that overlap in the middle, draw an edge X β Y labelled by their joined k-mer. Assembly becomes finding an Eulerian path.
\[ \text{Nodes}: (k-1)\text{-mers}\quad \text{Edges}: k\text{-mers},\ X \to Y\ \text{if}\ X[1:] = Y[:-1] \]
Short-read assemblers (SPAdes, ABySS, Velvet) build de Bruijn graphs, collapse bubbles from sequencing errors, resolve repeats with paired-end information, and emit contigs. Pevzner 2001 demonstrated the approach; modern variants use succinct graph representations for human-scale assemblies.
2. Long-Read Assembly (OLC)
Long reads (ONT 10β100 kb, PacBio HiFi 15β25 kb with 99%+ accuracy) favour overlap-layout-consensus: pairwise align all reads, build an overlap graph, then consensus each path. Flye, Canu, hifiasm, and Verkko are current tools. Near-telomere-to-telomere human assemblies (T2T, Nurk 2022 Science) now span each chromosome end-to-end β closing the ~8% of the genome that short-read assembly had left unresolved.
Simulation: De Bruijn Graph
Click Run to execute the Python code
Code will be executed with Python 3 on the server
3. Assembly Quality & N50
N50 is the contig length such that 50% of the assembly resides in contigs of that length or longer. Higher N50 β larger contigs β is better. BUSCO scores assess completeness by checking for expected single-copy orthologs. Both metrics together define assembly quality for a given genome.
4. Comparative Genomics
Once assembled, genomes are compared with whole-genome aligners: MUMmer (exact- match-based), LAST, minimap2. Synteny blocks reveal conserved gene orders; rearrangement breakpoints mark genome evolution events. Tools like Mauve, progressiveCactus, and SynTracker integrate whole-clade alignment. Pan-genome construction adds presence/absence variation to the reference picture.
Key References
β’ Pevzner, P. A. et al. (2001). βAn Eulerian path approach to DNA fragment assembly.β Proc. Natl. Acad. Sci., 98, 9748β9753.
β’ Bankevich, A. et al. (2012). βSPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.β J. Comput. Biol., 19, 455β477.
β’ Nurk, S. et al. (2022). βThe complete sequence of a human genome.β Science, 376, 44β53.
β’ Marcais, G. & Kingsford, C. (2011). βA fast, lock-free approach for efficient parallel counting of k-mers.β Bioinformatics, 27, 764β770.