Part I: Genomics | Chapter 4

Comparative & Functional Genomics

From genome comparisons to functional screens: linking sequence variation to biological function

Comparative Genomics: Synteny, Orthology & Phylogenomics

Comparative genomics systematically analyzes the similarities and differences between genomes of different species to understand evolutionary relationships, identify conserved functional elements, and infer the genomic changes that underlie phenotypic diversity. With thousands of genomes now sequenced, comparative approaches have become one of the most powerful tools in modern biology.

Orthology and Paralogy

Homologous genes are classified based on the evolutionary event that produced them:

  • Orthologs: Genes in different species derived from a single ancestral gene through speciation. Orthologs typically retain the same function across species (e.g., human TP53 and mouse Trp53). Ortholog identification is the basis for functional annotation transfer between species.
  • Paralogs: Genes within the same species derived from a gene duplication event. Paralogs often diverge in function over time through subfunctionalization (division of ancestral functions) or neofunctionalization (acquisition of new functions). Example: the human $\alpha$-globin and $\beta$-globin genes.
  • Xenologs: Homologs acquired through horizontal gene transfer, particularly common in prokaryotes (e.g., antibiotic resistance genes).

Synteny

Synteny refers to the conservation of gene order and orientation between chromosomal regions of different species. Macro-synteny (conservation of large chromosomal blocks) is observed even between distantly related species -- for example, large syntenic blocks are shared between human and chicken genomes despite ~310 million years of divergence. Micro-synteny (conservation of local gene neighborhoods of 2-10 genes) is even more widely preserved.

Synteny analysis uses tools such as MCScanX, SynMap (in CoGe), and Satsuma. Synteny maps are visualized as dot plots or Circos diagrams. Breaks in synteny correspond to chromosomal rearrangements (inversions, translocations, fusions, fissions) that have occurred since the last common ancestor.

The Molecular Clock

The molecular clock hypothesis, proposed by Zuckerkandl and Pauling (1962), posits that the number of amino acid or nucleotide substitutions between homologous sequences of two species is approximately proportional to the time since their divergence:

$$d = 2rt$$

Where:

  • $d$ = observed number of substitutions per site between two sequences
  • $r$ = substitution rate per site per year
  • $t$ = divergence time (years since the last common ancestor)
  • The factor of 2 accounts for substitutions accumulating independently in both lineages

In practice, substitution rates vary across lineages, genes, and genomic regions. Modern phylogenomics uses relaxed molecular clock models (implemented in BEAST, MCMCtree) that allow rate variation and are calibrated with fossil evidence.

Phylogenomics

Phylogenomics extends traditional phylogenetics by using genome-scale data to reconstruct evolutionary relationships. Rather than relying on a single gene (e.g., 16S rRNA), phylogenomic analyses use hundreds or thousands of orthologous genes, either through concatenation (supermatrix approach) or by inferring individual gene trees and reconciling them (supertree/coalescent-based approaches such as ASTRAL). Whole-genome alignments and gene content comparisons provide additional phylogenetic signals.

Genome-Wide Association Studies (GWAS)

Genome-wide association studies (GWAS) represent one of the most impactful applications of genomics to human health. GWAS systematically scan the genomes of large populations (typically thousands to millions of individuals) to identify genetic variants -- primarily single nucleotide polymorphisms (SNPs) -- that are statistically associated with a particular trait or disease.

Study Design & SNP Arrays

In a typical case-control GWAS design, hundreds of thousands of common SNPs (minor allele frequency > 1-5%) are genotyped simultaneously using SNP microarrays(e.g., Illumina Global Screening Array, Affymetrix Axiom). Modern arrays interrogate 500,000 to over 2 million SNP positions, and ungenotyped variants can be statistically inferred through genotype imputation using reference panels (e.g., the 1000 Genomes Project or the TOPMed panel), bringing the effective number of tested variants to 10-40 million.

Hardy-Weinberg Equilibrium

A fundamental quality control step in GWAS is testing each SNP for deviation from Hardy-Weinberg equilibrium (HWE). For a biallelic locus with alleles$A$ (frequency $p$) and $a$ (frequency $q = 1 - p$), the expected genotype frequencies under random mating are:

$$\text{freq}(AA) = p^2, \quad \text{freq}(Aa) = 2pq, \quad \text{freq}(aa) = q^2$$
$$p^2 + 2pq + q^2 = 1$$

Significant deviation from HWE in controls (tested by chi-squared goodness-of-fit) may indicate genotyping errors, population stratification, or selection, and such SNPs are typically excluded from analysis.

Statistical Testing & Manhattan Plots

For each SNP, the association with the trait is tested, typically using logistic regression for binary traits or linear regression for quantitative traits. The key output is a p-valuefor each SNP, representing the probability of observing an association as strong as (or stronger than) the observed one under the null hypothesis of no association.

Odds Ratio

For case-control studies, the effect size is commonly expressed as the odds ratio (OR):

$$\text{OR} = \frac{p_{\text{case}} / (1 - p_{\text{case}})}{p_{\text{control}} / (1 - p_{\text{control}})}$$

Where $p_{\text{case}}$ and $p_{\text{control}}$ are the frequencies of the risk allele in cases and controls, respectively. An OR > 1 indicates increased risk; OR < 1 indicates a protective effect. Most common GWAS variants have modest effect sizes (OR = 1.05-1.3).

Results are visualized in a Manhattan plot, where each dot represents a SNP, plotted with genomic position on the x-axis and $-\log_{10}(p\text{-value})$on the y-axis. Peaks rising above the genome-wide significance threshold appear as "skyscrapers" in the Manhattan skyline, indicating genomic loci harboring trait-associated variants.

Multiple Testing Correction: Bonferroni Threshold

Testing millions of SNPs simultaneously creates a massive multiple testing burden. To maintain a family-wise error rate of $\alpha = 0.05$, the Bonferroni correctionadjusts the significance threshold:

$$p_{\text{threshold}} = \frac{\alpha}{m}$$

For a typical GWAS testing $m \approx 10^6$ independent common variants (accounting for linkage disequilibrium), this yields:

$$p_{\text{threshold}} = \frac{0.05}{10^6} = 5 \times 10^{-8}$$

This genome-wide significance threshold of $5 \times 10^{-8}$ has become the standard in the field. A less conservative alternative is the False Discovery Rate (FDR)correction using the Benjamini-Hochberg procedure.

Linkage Disequilibrium (LD)

Linkage disequilibrium is the non-random association of alleles at different loci. When two SNPs are in high LD, knowing the genotype at one SNP provides strong prediction of the genotype at the other. LD is quantified by $r^2$ (ranging from 0 to 1) or $D'$. GWAS exploits LD because the genotyped tag SNPs serve as proxies for ungenotyped causal variants -- a significant GWAS signal therefore implicates a genomic region (locus) rather than pinpointing the exact causal variant. Fine-mapping methods (e.g., FINEMAP, SuSiE) and functional follow-up experiments are needed to identify the causal variant(s).

Epigenomics: DNA Methylation & Histone Modifications

Epigenomics studies heritable changes in gene expression that do not involve alterations to the DNA sequence itself. The two major epigenetic mechanisms are DNA methylation and histone modifications, both of which alter chromatin accessibility and thereby regulate transcription.

DNA Methylation

In mammals, DNA methylation occurs predominantly at the 5-carbon position of cytosine in CpG dinucleotides, catalyzed by DNA methyltransferases (DNMTs). DNMT3A and DNMT3B are de novo methyltransferases that establish new methylation patterns, while DNMT1 is the maintenance methyltransferase that copies methylation patterns to the newly synthesized strand during replication. Approximately 70-80% of CpG sites in the human genome are methylated.

Techniques for Studying DNA Methylation

  • Bisulfite Sequencing (BS-seq): The gold standard. Sodium bisulfite converts unmethylated cytosines to uracil (read as thymine after PCR), while methylated cytosines are protected. Whole-genome bisulfite sequencing (WGBS) provides single-base resolution methylation maps but requires deep sequencing (~30x coverage).
  • Reduced Representation (RRBS): Uses restriction enzyme digestion (MspI) to enrich for CpG-rich regions, reducing sequencing costs while capturing most CpG islands and promoters.
  • Methylation Arrays: Illumina Infinium arrays (EPIC array covers ~850,000 CpG sites) provide cost-effective methylation profiling for large cohorts.
  • Nanopore Sequencing: Direct detection of 5-methylcytosine (5mC) and N6-methyladenine (6mA) from native DNA without bisulfite conversion, preserving long-range phasing information.

Histone Modifications & ChIP-seq

The N-terminal tails of core histones are subject to a vast array of covalent post-translational modifications that constitute the histone code. Key modifications include:

ModificationLocationAssociationFunction
H3K4me3Active promotersActivationRecruits TFIID; marks transcription start sites
H3K4me1EnhancersPoised/ActiveMarks enhancer regions (with H3K27ac = active)
H3K27acActive promoters & enhancersActivationDistinguishes active from poised enhancers
H3K36me3Gene bodiesTranscription elongationMarks actively transcribed gene bodies
H3K9me3Constitutive heterochromatinRepressionRecruits HP1; silences repeats and pericentromeric regions
H3K27me3Facultative heterochromatinRepressionDeposited by Polycomb (PRC2); silences developmental genes

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is the standard method for mapping histone modifications, transcription factor binding sites, and other chromatin-associated proteins genome-wide. The protocol involves: (1) crosslinking proteins to DNA with formaldehyde, (2) sonicating chromatin to 200-500 bp fragments, (3) immunoprecipitation with an antibody specific to the target modification or protein, (4) reversing crosslinks and purifying DNA, and (5) sequencing the enriched fragments. Peak-calling algorithms (MACS2, HOMER) identify genomic regions significantly enriched over input controls. The ENCODE and Roadmap Epigenomics projects have generated comprehensive ChIP-seq catalogs across hundreds of cell types and tissues.

Functional Genomics: CRISPR Screens, RNAi & Gene Knockouts

While comparative and association-based approaches identify candidate genes and variants, understanding gene function ultimately requires experimental perturbation. Functional genomics encompasses the technologies used to systematically disrupt, modify, or modulate gene expression and measure the phenotypic consequences at a genome-wide scale.

CRISPR-Cas9 Technology

The CRISPR-Cas9 system (Clustered Regularly Interspaced Short Palindromic Repeats and CRISPR-associated protein 9) has revolutionized functional genomics since its adaptation as a genome editing tool by Doudna and Charpentier (2012). The system consists of two components:

  • Cas9: An RNA-guided endonuclease (from Streptococcus pyogenes) that introduces double-strand breaks (DSBs) at target sites
  • sgRNA (single guide RNA): A ~100 nt synthetic RNA combining a 20 nt target-complementary spacer sequence with a scaffold that binds Cas9. The spacer determines the genomic target, which must be adjacent to a PAM (protospacer adjacent motif)sequence (NGG for SpCas9)

CRISPR Functional Screens

Genome-wide CRISPR screens use libraries of tens of thousands of sgRNAs (4-10 guides per gene, covering ~18,000-20,000 genes) to systematically knock out every gene in a population of cells:

  • Knockout screens (CRISPRko): Wild-type Cas9 creates DSBs repaired by error-prone NHEJ, causing frameshift mutations that disrupt gene function
  • Inhibition screens (CRISPRi): Catalytically dead Cas9 (dCas9) fused to a repressor domain (KRAB) silences transcription without altering DNA sequence
  • Activation screens (CRISPRa): dCas9 fused to transcriptional activators (VP64, p65, Rta) upregulates gene expression

After selection (e.g., drug treatment, growth advantage, reporter expression), enriched or depleted sgRNAs are identified by next-generation sequencing, revealing genes essential for the phenotype under study. Analysis tools include MAGeCK and BAGEL.

RNA Interference (RNAi)

Before CRISPR, RNA interference (RNAi) was the primary tool for genome-wide loss-of-function screens. RNAi uses small interfering RNAs (siRNAs, ~21 nt duplexes) or short hairpin RNAs (shRNAs) to trigger post-transcriptional degradation of complementary mRNAs through the RISC (RNA-induced silencing complex) pathway. Unlike CRISPR knockouts, RNAi produces knockdowns (partial reduction of expression) rather than complete gene ablation, which can be advantageous for studying essential genes. However, RNAi is limited by significant off-target effects (seed-region complementarity) and incomplete knockdown efficiency.

Traditional Gene Knockouts

In model organisms, targeted gene disruption through homologous recombination has been used extensively. The International Knockout Mouse Consortium (IKMC) aimed to create knockout alleles for every protein-coding gene in the mouse genome (~20,000 genes). Gene trap and conditional knockout technologies (Cre-loxP system) enable tissue-specific and temporally controlled gene inactivation, overcoming the embryonic lethality associated with ~30% of constitutive knockouts.

Gene Ontology Enrichment & Pharmacogenomics

Gene Set Enrichment Analysis

A common challenge in functional genomics is interpreting long lists of genes identified by GWAS, differential expression, or functional screens. Gene Ontology (GO) enrichment analysis asks whether a particular GO term is overrepresented in the gene list compared to what would be expected by chance. The statistical test is typically a hypergeometric test (Fisher's exact test):

$$P(X \geq k) = \sum_{i=k}^{\min(K,n)} \frac{\binom{K}{i}\binom{N-K}{n-i}}{\binom{N}{n}}$$

Where:

  • $N$ = total number of genes in the genome (background)
  • $K$ = number of genes annotated with the GO term of interest
  • $n$ = number of genes in the input list (foreground)
  • $k$ = number of genes in the input list annotated with the GO term

Enrichment Analysis Tools & Methods

Several approaches exist for pathway and functional enrichment:

  • Over-Representation Analysis (ORA): Tests a predefined gene list against GO/KEGG terms (e.g., DAVID, g:Profiler, clusterProfiler)
  • Gene Set Enrichment Analysis (GSEA): Uses a ranked list of all genes (by fold change or p-value) and tests whether genes in a set tend to cluster at the top or bottom of the ranked list, avoiding the need for an arbitrary cutoff
  • Network-based methods: Integrate protein-protein interaction networks (STRING, BioGRID) to identify functional modules enriched in the gene list

Multiple testing correction (Benjamini-Hochberg FDR) is essential when testing hundreds of GO terms simultaneously. Adjusted p-values (q-values) < 0.05 are typically required for significance.

Pharmacogenomics

Pharmacogenomics applies genomic information to understand individual variability in drug response -- encompassing drug efficacy, adverse reactions, and optimal dosing. It represents one of the most clinically actionable areas of genomics.

Key Pharmacogenomic Examples

GeneDrugClinical Impact
CYP2D6Codeine, TamoxifenPoor metabolizers: no analgesic effect from codeine; ultra-rapid: toxicity risk
CYP2C19ClopidogrelPoor metabolizers: reduced anti-platelet effect; increased cardiovascular risk
HLA-B*57:01Abacavir (HIV)Carriers at high risk for severe hypersensitivity reaction; pre-screening required
VKORC1WarfarinPolymorphisms affect warfarin sensitivity; guides dosing algorithms
DPYD5-FluorouracilDPD deficiency causes severe/fatal toxicity; pre-treatment testing recommended

The Clinical Pharmacogenetics Implementation Consortium (CPIC)provides evidence-based guidelines for translating pharmacogenomic test results into prescribing decisions. The PharmGKB database curates gene-drug-phenotype relationships and clinical annotations. As whole-genome sequencing becomes routine in clinical settings, pre-emptive pharmacogenomic testing (genotyping a panel of pharmacogenes before any drug is prescribed) is increasingly being integrated into electronic health records for clinical decision support.

From Genomics to Precision Medicine

The convergence of GWAS discoveries, epigenomic profiling, functional screen results, and pharmacogenomic knowledge is driving the paradigm shift toward precision medicine -- tailoring disease prevention, diagnosis, and treatment strategies to an individual's genomic profile. Polygenic risk scores (PRS), which aggregate the effects of thousands of GWAS variants into a single risk metric, are increasingly being evaluated for clinical utility in predicting disease susceptibility for conditions such as coronary artery disease, type 2 diabetes, and breast cancer.