Part I: Genomics | Chapter 1

Genome Organization & Architecture

From nucleotides to chromosomes: genome size, chromatin structure, repetitive elements, and the blueprint of life

Genome Size & the C-Value Paradox

The C-value refers to the total amount of DNA contained within a haploid genome of a species, measured in picograms (pg) or base pairs (bp). One might intuitively expect that more complex organisms would possess larger genomes. However, genome size does not correlate linearly with organismal complexity -- a phenomenon famously known as the C-value paradox (or C-value enigma).

Definition: C-Value

The C-value is defined as the mass of DNA in the haploid genome:

$$C = \frac{\text{Total haploid DNA (pg)}}{1} \quad \text{where } 1\,\text{pg} \approx 978\,\text{Mb}$$

The conversion factor between picograms and megabases arises from the molecular weight of an average nucleotide pair (~650 Da) and Avogadro's number.

For example, the single-celled amoeba Amoeba dubia has a genome of approximately 670 billion base pairs -- over 200 times larger than the human genome (3.2 billion bp). Meanwhile, the pufferfish Tetraodon nigroviridispossesses a compact vertebrate genome of only ~340 Mb, yet it has a gene count comparable to humans. This dramatic variation is largely explained by differences in the amount of non-coding DNA, repetitive sequences, and transposable elements.

OrganismGenome Size (Mb)Estimated GenesGene Density (genes/Mb)
Escherichia coli4.6~4,300~935
Saccharomyces cerevisiae12.1~6,000~496
Drosophila melanogaster180~14,000~78
Homo sapiens3,200~20,000~6.3
Triticum aestivum (wheat)17,000~107,000~6.3
Paris japonica149,000UnknownVery low

As gene density decreases dramatically from prokaryotes to higher eukaryotes, the relationship between genome size $G$ (in Mb) and approximate gene count $n$ can be roughly modeled for eukaryotes by a logarithmic relationship:

$$n \approx k \cdot \ln(G) + c$$

This illustrates that gene number increases far more slowly than genome size, due primarily to the accumulation of non-coding and repetitive sequences in larger genomes.

Chromatin Structure & Nucleosome Organization

In eukaryotic cells, genomic DNA does not exist as naked double-stranded molecules. Instead, it is packaged into a highly organized nucleoprotein complex known as chromatin. The fundamental repeating unit of chromatin is the nucleosome, which achieves the first level of DNA compaction.

The Nucleosome Core Particle

Each nucleosome consists of approximately 147 base pairs of DNA wrapped in 1.67 left-handed superhelical turns around a histone octamer. The octamer is composed of two copies each of the four core histone proteins: H2A, H2B, H3, and H4. These histones are small, highly basic proteins rich in lysine and arginine residues, which facilitate electrostatic interactions with the negatively charged phosphate backbone of DNA.

Histone Octamer Assembly

The octamer assembles in a stepwise manner. First, an $(\text{H3-H4})_2$ tetramer forms, followed by the addition of two $\text{H2A-H2B}$ dimers:

$$\text{Octamer} = (\text{H3})_2(\text{H4})_2(\text{H2A})_2(\text{H2B})_2$$

The linker histone H1 binds to the DNA entering and exiting the nucleosome, sealing approximately 20 bp of additional DNA and stabilizing higher-order structures. The resulting complex with H1 is called the chromatosome (~167 bp total).

Nucleosome Repeat Length

Nucleosomes are connected by stretches of linker DNA whose length varies between species and cell types. The nucleosome repeat length (NRL) is defined as:

$$\text{NRL} = \text{Core DNA} + \text{Linker DNA} = 147\,\text{bp} + L$$

Typical NRL values range from ~165 bp in yeast (with very short linker DNA) to ~200 bp in most mammalian cells, and up to ~260 bp in some sea urchin sperm. The average human NRL is approximately 185-200 bp.

Higher-Order Chromatin Structure

The "beads-on-a-string" 10 nm fiber (nucleosomal array) undergoes further compaction into the 30 nm chromatin fiber. Two competing models describe this structure: the solenoid model (six nucleosomes per helical turn) and the zigzag model (nucleosomes from alternating sides). Recent cryo-EM and Hi-C data suggest the in vivo structure may be more irregular, with local variations in packing depending on histone modifications and chromatin remodeling activities.

Beyond the 30 nm fiber, chromatin forms looped domains (300 nm), which are organized into rosettes and further compressed into the highly condensed metaphase chromosome structure (1,400 nm wide).

DNA Packaging Ratio

The packaging ratio quantifies the degree of DNA compaction at each level of chromatin organization:

$$\text{Packaging Ratio} = \frac{\text{Length of DNA}}{\text{Length of structure containing it}}$$
  • Nucleosome (10 nm fiber): ~6-7 fold compaction
  • 30 nm fiber: ~40 fold compaction
  • Looped domains (300 nm): ~680 fold compaction
  • Metaphase chromosome: ~10,000 fold compaction

For the entire human genome, the total packaging ratio from naked B-form DNA (~2 meters stretched end-to-end) to a metaphase chromosome set that fits within a nucleus of ~6 $\mu$m diameter:

$$\text{Overall Ratio} = \frac{2 \times 10^9\,\mu\text{m}}{6\,\mu\text{m}} \approx 3.3 \times 10^8$$

Euchromatin, Heterochromatin & Chromosome Territories

Interphase chromatin exists in two cytologically distinguishable states: euchromatin and heterochromatin. This distinction has profound implications for gene expression and genome regulation.

Euchromatin

  • Less condensed, more open conformation
  • Transcriptionally active or poised for activation
  • Enriched in acetylated histones (H3K9ac, H3K27ac)
  • Associated with trimethylated H3K4 (H3K4me3)
  • Higher gene density regions
  • Replicates early in S phase
  • GC-rich isochores (in mammals)

Heterochromatin

  • Highly condensed, compact conformation
  • Generally transcriptionally silent
  • Enriched in methylated histones (H3K9me3, H3K27me3)
  • Associated with HP1 (Heterochromatin Protein 1)
  • Lower gene density; repeat-rich
  • Replicates late in S phase
  • AT-rich sequences predominate

Heterochromatin is further subdivided into constitutive heterochromatin, which remains permanently condensed in all cell types (found at centromeres, telomeres, and pericentromeric regions, composed largely of satellite DNA), and facultative heterochromatin, which can interconvert with euchromatin depending on developmental stage, cell type, or environmental signals. The classic example of facultative heterochromatin is the Barr body -- the inactivated X chromosome in mammalian females, silenced by the long non-coding RNA XIST.

Chromosome Territories

Fluorescence in situ hybridization (FISH) and chromosome conformation capture (Hi-C) experiments have revealed that each chromosome occupies a distinct, non-overlapping region within the interphase nucleus called a chromosome territory. Gene-rich chromosomes tend to localize toward the nuclear interior, while gene-poor chromosomes are positioned near the nuclear periphery (associated with the nuclear lamina).

Within chromosome territories, chromatin is organized into topologically associating domains (TADs) -- megabase-scale regions where DNA sequences interact more frequently with each other than with sequences outside the domain. TADs are bounded by insulator elements often occupied by the architectural protein CTCF and the cohesin complex, which mediate loop extrusion to establish domain boundaries. TAD disruption by structural variants can lead to gene misregulation and developmental disorders.

Key Concept: A/B Compartments

Hi-C data further subdivides the genome into A compartments (active, euchromatic, gene-rich, early-replicating) and B compartments (inactive, heterochromatic, gene-poor, late-replicating). These compartments can switch between states during differentiation, reflecting the dynamic nature of three-dimensional genome organization.

Repetitive Elements & Non-Coding DNA

A striking feature of eukaryotic genomes is that the majority of DNA does not encode proteins. In the human genome, only approximately 1.5% of the 3.2 billion base pairs code for proteins. The remaining ~98.5% includes regulatory sequences, introns, non-coding RNAs, and vast stretches of repetitive DNA. Much of this was historically (and misleadingly) labeled "junk DNA," though we now understand that many non-coding regions serve important regulatory and structural roles.

Transposable Elements

Transposable elements (TEs) constitute approximately 45% of the human genome. These mobile genetic elements are classified into two major classes:

Element TypeMechanism% of Human GenomeKey Features
LINEs (Long Interspersed Nuclear Elements)Copy-and-paste (retrotransposition)~21%LINE-1 (L1) is ~6 kb; autonomous; encodes ORF1p and ORF2p (with RT and endonuclease)
SINEs (Short Interspersed Nuclear Elements)Copy-and-paste (non-autonomous)~13%Alu elements (~300 bp); depend on LINE machinery for retrotransposition
LTR RetrotransposonsCopy-and-paste (via RNA intermediate)~8%Endogenous retroviruses (HERVs); flanked by long terminal repeats
DNA TransposonsCut-and-paste (via transposase)~3%Flanked by terminal inverted repeats (TIRs); most are now inactive fossils

Tandem Repeats & Satellite DNA

In addition to interspersed repeats, genomes contain tandem repeats -- sequences repeated head-to-tail at a single locus. These include:

  • Satellite DNA: Very large arrays (up to megabases) of short repeat units (5-200 bp), found primarily at centromeres and pericentromeric regions. Alpha-satellite DNA (171 bp monomer) is the major component of human centromeres.
  • Minisatellites (VNTRs): Repeat units of 10-60 bp, often found near telomeres. Used in early DNA fingerprinting methods.
  • Microsatellites (STRs/SSRs): Very short repeat units (1-6 bp), e.g., (CA)$_n$ or (CAG)$_n$. Widely used as genetic markers due to their high polymorphism. Expansion of trinucleotide repeats causes diseases such as Huntington's disease and Fragile X syndrome.

CpG Islands

CpG islands are regions of at least 200 bp with a GC content greater than 50% and an observed-to-expected CpG ratio greater than 0.6. While CpG dinucleotides are generally underrepresented in the genome (due to spontaneous deamination of 5-methylcytosine to thymine), CpG islands are exceptions. Approximately 70% of human gene promoters are associated with CpG islands. Their methylation status is a critical epigenetic mark:

  • Unmethylated CpG islands are generally associated with active transcription
  • Methylated CpG islands recruit methyl-CpG-binding domain (MBD) proteins that promote chromatin compaction and gene silencing

The Human Genome Project & Key Statistics

The Human Genome Project (HGP) was an international collaborative effort launched in 1990 and completed in 2003, with the goal of determining the complete nucleotide sequence of the human genome and identifying all human genes. It remains one of the most ambitious and transformative undertakings in the history of biology.

Human Genome: Key Statistics

  • Genome size: ~3.2 billion base pairs (3,200 Mb)
  • Chromosomes: 22 autosomes + X/Y = 24 distinct
  • Protein-coding genes: ~20,000-25,000
  • Coding sequences: ~1.5% of genome
  • Introns: ~25% of genome
  • Average gene size: ~27 kb (including introns)
  • Repetitive DNA: ~45% transposable elements
  • GC content: ~41%
  • Average exon size: ~170 bp
  • Average intron size: ~6,000 bp
  • SNP frequency: ~1 per 300 bp between individuals
  • CpG islands: ~29,000

One of the most surprising findings of the HGP was the relatively low number of protein-coding genes -- approximately 20,000, far fewer than the 100,000+ originally predicted. This number is comparable to organisms like Caenorhabditis elegans (~19,000 genes) and Drosophila melanogaster(~14,000 genes). The biological complexity of humans thus arises not simply from gene count but from:

  • Alternative splicing: A single gene can produce multiple mRNA variants and protein isoforms. The average human gene has ~8.8 exons, and it is estimated that over 95% of multi-exon genes undergo alternative splicing.
  • Post-translational modifications: Phosphorylation, glycosylation, ubiquitination, and other modifications vastly expand the functional proteome.
  • Non-coding RNAs: miRNAs, lncRNAs, and other regulatory RNAs modulate gene expression at multiple levels.
  • Combinatorial regulation: Complex regulatory networks involving transcription factors, enhancers, and three-dimensional genome topology.

Centromeres, Telomeres & Chromosomal Architecture

Beyond local chromatin organization, chromosomes possess specialized structural elements that are essential for faithful DNA replication, segregation during cell division, and protection of chromosomal ends.

Centromeres

The centromere is the chromosomal region where the kinetochore assembles -- the protein complex that mediates attachment to spindle microtubules during mitosis and meiosis. In humans, centromeres span 0.3-5 Mb and are composed primarily of alpha-satellite DNA (171 bp AT-rich monomeric repeats organized into higher-order repeats). The centromere-specific histone variant CENP-A (replacing H3 in centromeric nucleosomes) serves as the epigenetic mark that defines centromere identity -- centromere positioning is determined epigenetically, not solely by DNA sequence (demonstrated by the existence of neocentromeres).

Telomeres

Telomeres are the protective caps at the ends of linear chromosomes. In humans, telomeric DNA consists of tandem repeats of the hexanucleotide sequence$\textbf{5'-TTAGGG-3'}$, extending for 5-15 kb. The 3' end features a single-stranded G-rich overhang of 100-200 nucleotides that loops back and invades the double-stranded telomeric region to form the T-loop, creating a protective structure along with the shelterin protein complex (TRF1, TRF2, POT1, TIN2, TPP1, RAP1).

The End Replication Problem & Telomerase

Due to the requirement for an RNA primer and the inability of DNA polymerase to fully replicate the lagging strand template to its very 3' end, chromosomes shorten by approximately 50-200 bp per cell division. This is the end replication problem. The enzyme telomerase, a ribonucleoprotein reverse transcriptase (with catalytic subunit TERT and RNA template TERC), counteracts this by adding telomeric repeats to the 3' end:

$$(\text{TTAGGG})_n + \text{dNTPs} \xrightarrow{\text{Telomerase}} (\text{TTAGGG})_{n+k}$$

Telomerase is active in stem cells, germ cells, and most cancer cells, but is silenced in most somatic cells, contributing to cellular senescence and the Hayflick limit (~50-70 divisions).

Origins of Replication

Eukaryotic chromosomes contain multiple origins of replication (ORIs)to ensure that the entire genome can be duplicated within the time constraints of S phase. The human genome contains an estimated 30,000-50,000 replication origins, spaced on average every 50-100 kb. Each origin initiates a bidirectional replication fork, creating a replicon. Origins fire in a temporally regulated program, with euchromatic regions replicating early and heterochromatic regions replicating late.

Key Equation: DNA Content in the Cell Cycle

The DNA content per cell varies through the cell cycle, expressed in terms of C (the haploid genome mass):

$$\text{G}_1: 2C \quad \xrightarrow{\text{S phase}} \quad \text{G}_2: 4C \quad \xrightarrow{\text{M phase}} \quad \text{G}_1: 2C$$

Flow cytometry using DNA-binding fluorescent dyes (e.g., propidium iodide) can distinguish cell cycle phases based on the characteristic 2C and 4C peaks.