Part II: Transcriptomics | Chapter 5

Gene Expression & the Transcriptome

From DNA to RNA: understanding the complete landscape of gene expression, transcript diversity, and quantification

1. The Central Dogma & the Transcriptome

The central dogma of molecular biology, first articulated by Francis Crick in 1958, describes the flow of genetic information within a biological system: DNA is transcribed into RNA, and RNA is translated into protein. While this framework has been refined over the decades, notably with the discovery of reverse transcriptase and RNA-dependent RNA polymerases, it remains the foundational paradigm for understanding gene expression.

Definition: The Transcriptome

The transcriptome is the complete set of all RNA molecules (transcripts) produced by the genome of a cell, tissue, or organism at a specific time point and under specific conditions. Unlike the genome, which is relatively static, the transcriptome is highly dynamic and varies across cell types, developmental stages, environmental conditions, and disease states. It includes messenger RNAs (mRNAs), ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), and a diverse array of non-coding RNAs.

Transcription is catalyzed by RNA polymerase II (Pol II) for protein-coding genes in eukaryotes. The process involves three stages: initiation, where Pol II is recruited to the promoter through interactions with general transcription factors (GTFs) such as TFIID, TFIIB, and the Mediator complex; elongation, where the enzyme synthesizes the nascent RNA chain in the 5' to 3' direction; and termination, where the polymerase dissociates from the template and the transcript is released.

Eukaryotic transcription differs fundamentally from prokaryotic transcription in its complexity. In eukaryotes, chromatin structure imposes an additional layer of regulation: DNA must be accessible for transcription factors to bind. Nucleosome remodeling complexes such as SWI/SNF and histone-modifying enzymes play critical roles in modulating this accessibility. Additionally, the spatial organization of chromatin within the nucleus, including topologically associating domains (TADs) and chromatin loops, influences which genes are expressed.

The rate of transcription can be modeled as: $$R_{txn} = k_{on} \cdot [TF] \cdot P_{open} \cdot k_{elong}$$ where \(k_{on}\) is the rate constant for transcription factor (TF) binding, \([TF]\) is the concentration of the relevant transcription factor, \(P_{open}\) is the probability that the chromatin is in an open, accessible state, and \(k_{elong}\) is the elongation rate constant.

2. mRNA Processing: From Pre-mRNA to Mature Transcript

In eukaryotes, the primary transcript (pre-mRNA) undergoes three major co-transcriptional and post-transcriptional processing events before it can be exported to the cytoplasm and translated. These modifications are essential for mRNA stability, nuclear export, and efficient translation.

5' Capping

The 5' cap is added co-transcriptionally when the nascent transcript is approximately 20-30 nucleotides long. The capping process involves three enzymatic steps: (1) RNA triphosphatase removes the terminal phosphate from the 5' end, (2) guanylyltransferase adds a GMP residue via a 5'-5' triphosphate linkage, and (3) methyltransferase adds a methyl group to the N-7 position of guanine, forming the m7G cap (Cap 0). Additional methylations at the 2'-O position of the first and second ribose sugars yield Cap 1 and Cap 2 structures, respectively. The cap structure is recognized by the cap-binding complex (CBC) in the nucleus and by eIF4E during translation initiation, protecting the mRNA from 5' exonuclease degradation.

Splicing

Pre-mRNA splicing removes intronic sequences and joins exons to form the mature mRNA. This process is catalyzed by the spliceosome, a large ribonucleoprotein complex composed of five small nuclear ribonucleoproteins (snRNPs: U1, U2, U4, U5, and U6) and numerous associated proteins. Splicing occurs through two sequential transesterification reactions. In the first step, the 2'-OH of a conserved branch-point adenosine attacks the 5' splice site, forming a lariat intermediate. In the second step, the 3'-OH of the upstream exon attacks the 3' splice site, joining the two exons and releasing the lariat intron.

Consensus Splice Site Sequences

ElementConsensus SequenceFunction
5' splice site (donor)AG|GURAGURecognized by U1 snRNP
Branch pointYNYURAYNucleophile for first transesterification
Polypyrimidine tractY(n)Bound by U2AF65
3' splice site (acceptor)YAG|GRecognized by U2AF35

3' Polyadenylation

The 3' end of most eukaryotic mRNAs is formed by endonucleolytic cleavage followed by the addition of a poly(A) tail of approximately 200-250 adenosine residues. The polyadenylation signal (AAUAAA) is recognized by the cleavage and polyadenylation specificity factor (CPSF), while the downstream GU-rich element is bound by the cleavage stimulation factor (CstF). Cleavage occurs 10-30 nucleotides downstream of the AAUAAA signal, and poly(A) polymerase (PAP) then adds the poly(A) tail in a template-independent manner. The poly(A) tail, bound by poly(A)-binding proteins (PABPs), promotes mRNA stability, nuclear export, and translation efficiency.

3. Alternative Splicing & Transcript Isoforms

Alternative splicing is a major source of transcriptome and proteome diversity in eukaryotes. It is estimated that over 95% of multi-exon human genes undergo alternative splicing, generating multiple transcript isoforms from a single gene. This mechanism enables cells to produce proteins with distinct structural and functional properties from a limited number of genes, vastly expanding the coding capacity of the genome.

Major Types of Alternative Splicing Events

Exon Skipping (Cassette Exon)

The most common type in vertebrates. An entire exon is either included or excluded from the mature transcript. Example: the Drosophila Dscam gene can generate over 38,000 isoforms through combinatorial exon skipping.

Intron Retention

An intron remains in the mature mRNA rather than being excised. Most common in plants and fungi. Often leads to premature stop codons and mRNA degradation via nonsense-mediated decay (NMD), serving as a regulatory mechanism.

Alternative 5' Splice Site

Two or more splice sites are recognized at the 5' end of an intron, resulting in exons of different lengths. This alters the coding sequence at the 3' end of the upstream exon.

Alternative 3' Splice Site

Two or more splice sites are recognized at the 3' end of an intron, leading to variation at the 5' end of the downstream exon. Both alternative 5' and 3' splice sites can produce in-frame or out-of-frame variants.

Alternative splicing is regulated by a complex interplay of cis-acting elements and trans-acting factors. Cis-regulatory elements include exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs). Trans-acting factors include SR proteins (e.g., SRSF1, SRSF2), which generally promote exon inclusion by binding to ESEs, and heterogeneous nuclear ribonucleoproteins (hnRNPs, e.g., hnRNP A1), which often antagonize SR protein function and promote exon skipping.

Aberrant alternative splicing is implicated in numerous human diseases. Mutations that disrupt splice sites or splicing regulatory elements can lead to exon skipping, intron retention, or activation of cryptic splice sites, resulting in dysfunctional proteins. Examples include spinal muscular atrophy (SMA), where a C-to-T transition in SMN2 exon 7 weakens an ESE, leading to exon skipping; and certain forms of beta-thalassemia caused by mutations in beta-globin splice sites.

4. Non-Coding RNAs: Beyond mRNA

The transcriptome encompasses far more than protein-coding mRNAs. Non-coding RNAs (ncRNAs) constitute the vast majority of transcribed sequences in complex organisms and play critical roles in gene regulation, chromatin remodeling, and cellular homeostasis. The ENCODE project revealed that while only about 1.5% of the human genome encodes proteins, approximately 80% is transcribed, underscoring the prevalence and potential importance of ncRNAs.

Classes of Non-Coding RNA

ncRNA ClassSizeFunctionMechanism
miRNA~22 ntPost-transcriptional gene silencingBinds 3' UTR via RISC complex; mRNA degradation or translational repression
siRNA20-25 ntRNA interference (RNAi)Double-stranded precursor processed by Dicer; guides RISC to target mRNA for cleavage
lncRNA>200 ntDiverse regulatory rolesChromatin remodeling (e.g., Xist), transcriptional regulation, scaffolding
rRNA120-5000 ntRibosome structural & catalytic componentForms ribosome; catalyzes peptide bond formation (ribozyme activity)
tRNA76-90 ntAmino acid delivery to ribosomeAnticodon pairs with mRNA codon; aminoacyl-tRNA synthetase charges tRNA

MicroRNAs (miRNAs) are among the best-characterized small ncRNAs. They are transcribed as primary miRNAs (pri-miRNAs) by RNA Pol II, processed in the nucleus by the Drosha-DGCR8 complex into precursor miRNAs (pre-miRNAs), exported to the cytoplasm by Exportin-5, and further processed by Dicer into mature ~22 nt duplexes. The guide strand is loaded into the RNA-induced silencing complex (RISC), where Argonaute proteins mediate target recognition through base-pairing with the seed region (nucleotides 2-8) in the 3' UTR of target mRNAs. A single miRNA can regulate hundreds of target genes, and a single mRNA can be targeted by multiple miRNAs, creating complex regulatory networks.

Long non-coding RNAs (lncRNAs) represent a diverse class with functions spanning transcriptional regulation (e.g., HOTAIR recruits PRC2 to silence HOX gene clusters), X-chromosome inactivation (Xist coats the inactive X chromosome), nuclear architecture (NEAT1 is essential for paraspeckle formation), and post-transcriptional regulation (MALAT1 modulates alternative splicing). The functional annotation of lncRNAs remains a major challenge in modern genomics.

5. Transcription Factors & Gene Expression Regulation

Transcription factors (TFs) are proteins that bind to specific DNA sequences to regulate gene transcription. The human genome encodes approximately 1,600 transcription factors, which can be classified by their DNA-binding domains into families including zinc finger proteins, helix-turn-helix proteins, leucine zippers, and helix-loop-helix proteins. Transcription factors interact with promoter and enhancer elements to activate or repress gene expression.

Promoters and Regulatory Elements

The core promoter, located immediately upstream of the transcription start site (TSS), contains elements such as the TATA box (consensus TATAAA, positioned at approximately -30), the initiator element (Inr), and the downstream promoter element (DPE). Proximal promoter elements lie within a few hundred base pairs of the TSS, while distal regulatory elements, including enhancers and silencers, can be located hundreds of kilobases away and exert their effects through chromatin looping.

Layers of Gene Expression Regulation

1.
Epigenetic Control: DNA methylation at CpG dinucleotides (catalyzed by DNMTs) typically represses transcription. Histone modifications including H3K4me3 (active promoters), H3K27me3 (polycomb-repressed), H3K36me3 (active gene bodies), and H3K9me3 (heterochromatin) define the chromatin state.
2.
Enhancers: Distal regulatory elements that increase transcription rates by 10-1000 fold. Active enhancers are marked by H3K4me1 and H3K27ac. Super-enhancers are clusters of enhancers that drive expression of cell-identity genes and are often associated with oncogenes in cancer.
3.
Silencers: Regulatory elements that recruit repressor proteins to inhibit transcription. Polycomb repressive complexes (PRC1, PRC2) catalyze H3K27 methylation and histone H2A ubiquitination to establish and maintain silenced chromatin states.
4.
Post-transcriptional: mRNA stability (determined by 3' UTR elements, AU-rich elements), translational regulation (4E-BP, mTOR pathway), and RNA modifications (m6A methylation by METTL3/METTL14) provide additional regulatory layers.

Housekeeping vs. Tissue-Specific Genes

Housekeeping genes are expressed constitutively across virtually all cell types and are essential for basic cellular maintenance. They typically have CpG island-associated promoters, broad H3K4me3 domains, and are often located in regions of open chromatin. Examples include GAPDH, beta-actin, and ribosomal protein genes. In contrast, tissue-specific genes are expressed in restricted cell populations and are regulated by lineage-determining transcription factors. For instance, MyoD drives skeletal muscle gene expression, PAX6 controls eye development genes, and GATA1 regulates erythroid differentiation. The interplay between pioneer transcription factors, which can bind nucleosomal DNA, and tissue-specific TFs determines cell fate decisions during development.

6. Expression Quantification Units

Accurate quantification of gene expression requires normalization to account for systematic biases introduced during library preparation and sequencing. The three most widely used normalization metrics, RPKM, FPKM, and TPM, each address different aspects of the quantification problem. Understanding the mathematical foundations and practical implications of each metric is essential for proper interpretation of transcriptomic data.

RPKM (Reads Per Kilobase per Million mapped reads)

RPKM was introduced for single-end RNA-Seq data and normalizes for both sequencing depth and gene length. It measures the number of reads mapping to a gene, normalized first by the total number of mapped reads (in millions) and then by the gene length (in kilobases).

$$\text{RPKM}_i = \frac{X_i}{\left(\frac{N}{10^6}\right) \cdot \left(\frac{L_i}{10^3}\right)} = \frac{X_i \cdot 10^9}{N \cdot L_i}$$ where \(X_i\) is the number of reads mapped to gene \(i\), \(N\) is the total number of mapped reads in the sample, and \(L_i\) is the length of gene \(i\) in base pairs.

FPKM (Fragments Per Kilobase per Million mapped fragments)

FPKM is the paired-end sequencing analogue of RPKM. In paired-end sequencing, two reads correspond to a single cDNA fragment. FPKM counts fragments rather than individual reads to avoid double-counting.

$$\text{FPKM}_i = \frac{F_i \cdot 10^9}{N_f \cdot L_i}$$ where \(F_i\) is the number of fragments mapped to gene \(i\), \(N_f\) is the total number of mapped fragments, and \(L_i\) is the effective gene length in base pairs. For perfectly concordant paired-end reads, \(F_i\) equals the number of read pairs mapping to the gene.

TPM (Transcripts Per Million)

TPM has become the preferred normalization method because, unlike RPKM/FPKM, the sum of all TPM values is the same across samples (always equal to 10^6), making cross-sample comparisons more straightforward. TPM first normalizes by gene length, then normalizes by sequencing depth.

First, compute the rate for each gene: $$r_i = \frac{X_i}{L_i}$$ Then normalize so that all rates sum to one million: $$\text{TPM}_i = \frac{r_i}{\sum_{j} r_j} \times 10^6 = \frac{\frac{X_i}{L_i}}{\sum_{j} \frac{X_j}{L_j}} \times 10^6$$ The key advantage is that \(\sum_i \text{TPM}_i = 10^6\) for every sample, meaning TPM values can be interpreted as the relative molar concentration of each transcript in the pool.

Comparison of Normalization Units

PropertyRPKMFPKMTPM
Read typeSingle-endPaired-endEither
Length normalizationYesYesYes (first step)
Depth normalizationYesYesYes (second step)
Consistent sum across samplesNoNoYes (= 10^6)
Cross-sample comparabilityLimitedLimitedImproved

Key Concept: Why TPM Is Preferred

The fundamental issue with RPKM and FPKM is that the normalization factor (total mapped reads) can differ between samples due to compositional differences, not just sequencing depth. If a small number of highly expressed genes consume a disproportionate fraction of reads in one sample, all other genes will appear to have lower RPKM/FPKM values even if their true expression has not changed. TPM partially addresses this by ensuring the sum is constant, though for rigorous differential expression analysis, model-based normalization methods (e.g., DESeq2 size factors, TMM normalization) are preferred because they explicitly account for compositional biases.

7. Transcriptome Complexity & Dynamics

The complexity of the transcriptome far exceeds what might be predicted from gene number alone. The human genome contains approximately 20,000 protein-coding genes, yet the GENCODE project (v44) has annotated over 250,000 transcript isoforms. This complexity arises from multiple sources: alternative splicing, alternative promoter usage, alternative polyadenylation, RNA editing, and allele-specific expression.

The theoretical number of transcript isoforms from a gene with \(n\) cassette exons is: $$N_{isoforms} = 2^n$$ For example, the Drosophila DSCAM gene has 95 alternatively spliced exons organized in four clusters, yielding a theoretical maximum of: $$N_{DSCAM} = 12 \times 48 \times 33 \times 2 = 38{,}016 \text{ isoforms}$$ which exceeds the total number of genes in the Drosophila genome.

Transcriptome dynamics can be studied at multiple timescales. Rapid transcriptional responses to environmental stimuli occur within minutes (e.g., immediate-early genes such as c-Fos and c-Jun in response to growth factors). Circadian gene expression programs oscillate with 24-hour periodicity, affecting approximately 10-20% of the transcriptome in any given tissue. Developmental transcriptome changes unfold over days to weeks, involving coordinated activation and repression of gene regulatory networks. Single-cell technologies have further revealed that even genetically identical cells within the same tissue exhibit substantial transcriptomic heterogeneity, driven by stochastic gene expression (transcriptional bursting) and differences in cell state.

Chapter Summary

  • โ—The transcriptome is the complete set of RNA molecules in a cell and is highly dynamic, varying with cell type, condition, and time.
  • โ—Pre-mRNA undergoes 5' capping, splicing, and 3' polyadenylation to become mature mRNA.
  • โ—Alternative splicing vastly expands proteome diversity; over 95% of human multi-exon genes are alternatively spliced.
  • โ—Non-coding RNAs, including miRNAs, lncRNAs, and others, play crucial regulatory roles in gene expression.
  • โ—Gene expression is regulated at multiple levels: epigenetic, transcriptional, post-transcriptional, and translational.
  • โ—TPM is the preferred expression unit because it maintains a constant sum across samples, enabling meaningful cross-sample comparisons.