Part 5: Transcription
Overview: DNA to RNA
Transcription is the synthesis of an RNA molecule from a DNA template. It is the first major step in gene expression β the process by which the information stored in a gene's nucleotide sequence is converted into a functional product (protein or functional RNA). RNA polymerase (RNAP) reads the template strand of DNA in the 3β² β 5β² direction while synthesizing the RNA transcript in the 5β² β 3β² direction, adding ribonucleoside triphosphates (NTPs) complementary to the template.
The newly synthesized RNA is identical in sequence to the coding strand (also called the non-template or sense strand), except that uracil (U) replaces thymine (T). Unlike DNA replication, transcription does not require a primer and copies only selected portions of the genome β specific genes or operons rather than the entire chromosome.
The Three Phases of Transcription
1. Initiation
RNAP binds the promoter, melts the DNA duplex to form the transcription bubble, and begins RNA synthesis. The most highly regulated phase.
2. Elongation
RNAP moves along the template, extending the RNA chain at ~40-80 nt/s (prokaryotes) or ~20-30 nt/s (eukaryotes), maintaining the transcription bubble.
3. Termination
Specific signals cause RNAP to halt, release the completed RNA transcript, and dissociate from the DNA template.
Template vs. Coding Strand
5β²-...A T G C C G T A A G G C T T A...-3β² Β (Coding / Sense strand)
3β²-...T A C G G C A T T C C G A A T...-5β² Β (Template / Antisense strand)
5β²-...A U G C C G U A A G G C U U A...-3β² Β (mRNA transcript)
The mRNA sequence matches the coding strand (with U replacing T). RNAP reads the template strand.
Prokaryotic Transcription
RNA Polymerase: Structure and Function
Bacteria possess a single RNA polymerase responsible for synthesizing all classes of RNA (mRNA, rRNA, tRNA). The enzyme is a large multi-subunit complex with a distinctive "crab claw" shape, enclosing an internal channel that accommodates ~17 bp of DNA.
Core Enzyme ($\alpha_2\beta\beta'\omega$)
- $\alpha$ subunits (x2): Assembly platform; $\alpha$-CTD contacts UP element upstream of promoter
- $\beta$ subunit: Contains catalytic center; binds NTP substrates; rifampicin binding site
- $\beta'$ subunit: Binds DNA template; contains the two catalytic Mg$^{2+}$ ions
- $\omega$ subunit: Structural role; assists $\beta'$ folding and assembly
Holoenzyme ($\alpha_2\beta\beta'\omega\sigma$)
- $\sigma$ factor: Required for promoter recognition; dissociates after initiation
- $\sigma^{70}$: Housekeeping genes (most common in E. coli)
- $\sigma^{32}$: Heat shock response genes
- $\sigma^{54}$: Nitrogen metabolism genes (unique mechanism)
- $\sigma^{38}$ (RpoS): Stationary phase / stress response
Two-Metal-Ion Catalytic Mechanism
The active site contains two Mg$^{2+}$ ions coordinated by conserved aspartate residues in the $\beta'$ subunit. Metal A activates the 3β²-OH of the growing RNA chain for nucleophilic attack on the $\alpha$-phosphate of the incoming NTP. Metal B stabilizes the pyrophosphate leaving group and helps position the NTP substrate.
$$\text{RNA}_{n} + \text{NTP} \xrightarrow{2\text{Mg}^{2+}} \text{RNA}_{n+1} + \text{PP}_i$$
Pyrophosphate (PP$_i$) is subsequently hydrolyzed by pyrophosphatase, making the reaction effectively irreversible ($\Delta G \approx -7$ kcal/mol per nucleotide addition when coupled with PP$_i$ hydrolysis).
Promoter Recognition
Prokaryotic promoters are defined by conserved sequence elements upstream of the transcription start site (+1). The $\sigma$ factor of the holoenzyme makes specific contacts with two key elements:
Consensus Promoter Elements ($\sigma^{70}$)
Position:Β Β -35Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β -10Β Β Β Β Β Β Β Β Β Β +1
Β Β Β Β Β Β Β Β TTGACAΒ Β Β ...17 bp spacer...Β Β Β TATAATΒ Β Β ...Β Β Β +1
- -35 element (TTGACA): Recognized by $\sigma$ region 4.2. Initial contact point for holoenzyme. Critical for promoter strength.
- -10 element / Pribnow box (TATAAT): Recognized by $\sigma$ region 2.4. AT-rich to facilitate strand separation (only 2 H-bonds per A-T vs 3 for G-C). The site of initial DNA melting.
- Spacer (17 bp optimal): Distance between -35 and -10 elements is critical. Deviations from 17 bp reduce promoter strength because both $\sigma$ contacts must be on the same face of the DNA helix.
- UP element (-40 to -60): AT-rich sequence contacted by $\alpha$-CTD. Enhances transcription up to 30-fold (e.g., rrnB P1 promoter for ribosomal RNA).
- Extended -10 (TGn): Some promoters lack a recognizable -35 element and instead have a TGn motif immediately upstream of the -10 box, contacted by $\sigma$ region 3.0.
Promoter Strength
Promoter strength varies over a 1000-fold range and is determined by how closely the sequence matches the consensus. The strongest promoters (like rrnB P1) initiate transcription every ~1-2 seconds, while weak promoters may fire only once every 10-30 minutes. This can be quantified as a catalytic efficiency:
$$\text{Promoter strength} \propto \frac{k_{\text{cat}}}{K_m} = \frac{k_{\text{init}}}{K_d^{\text{RNAP-promoter}}}$$
where $k_{\text{init}}$ is the rate of productive initiation and $K_d$ is the dissociation constant for RNAP-promoter binding.
Initiation
Transcription initiation is a multi-step process that converts the initial RNAP-promoter encounter into a productively elongating complex. It is the primary point of regulation for most genes.
Step 1: Closed Complex (RP$_c$)
The holoenzyme ($\alpha_2\beta\beta'\omega\sigma$) binds the double-stranded promoter DNA. $\sigma$ region 4.2 contacts the -35 element and $\sigma$ region 2.4 contacts the -10 element. DNA remains fully base-paired. This is a reversible, relatively rapid step. The holoenzyme slides along DNA and can scan ~2000 bp in a single binding event.
Step 2: Open Complex (RP$_o$)
The DNA duplex melts (unwinds) over a region of approximately 14 bp from roughly -11 to +3, forming the transcription bubble. The template strand descends into the active site channel. This isomerization is essentially irreversible at most promoters (for $\sigma^{70}$) and is the rate-limiting step. The AT-rich Pribnow box facilitates melting. Aromatic residues in $\sigma$ region 2.3 stabilize the single-stranded non-template DNA.
Step 3: Abortive Initiation
RNAP begins synthesizing short RNA transcripts (2-12 nt) but repeatedly releases them and reinitiates without leaving the promoter. This "scrunching" mechanism involves pulling downstream DNA into the enzyme, storing energy as stressed single-stranded DNA. Typically 5-10 abortive cycles occur before productive elongation begins.
Step 4: Promoter Escape
Once the RNA reaches ~10-12 nt, the accumulated scrunching stress is released, breaking the $\sigma$-promoter contacts. The $\sigma$ factor is released (or repositioned), converting the initiation complex into a stable elongation complex (EC). The RNA-DNA hybrid in the EC is ~8-9 bp. This transition is the commitment point β the enzyme is now processive and will transcribe the entire gene.
Kinetic Scheme
$$R + P \underset{k_{-1}}{\overset{k_1}{\rightleftharpoons}} RP_c \xrightarrow{k_2} RP_o \xrightarrow{k_3} \text{EC} + \sigma$$
R = RNAP holoenzyme, P = promoter, RP$_c$ = closed complex, RP$_o$ = open complex, EC = elongation complex. The rate of transcript production depends on the slowest step: $k_2$ (isomerization) for most promoters.
Derivation: Transcription Initiation Rate from Thermodynamic Promoter Binding
Starting from the thermodynamic model of RNAP-promoter interaction, we derive the rate of productive transcription initiation as a function of RNAP concentration and promoter binding affinity.
Step 1: Equilibrium binding of RNAP to promoter (closed complex)
The formation of the closed complex follows a binding equilibrium:
$$R + P \rightleftharpoons RP_c, \quad K_B = \frac{[RP_c]}{[R][P]} = \frac{k_1}{k_{-1}}$$
The fraction of promoters occupied by RNAP at equilibrium:
$$f_{\text{bound}} = \frac{[R]}{K_d + [R]} = \frac{K_B[R]}{1 + K_B[R]}$$
Step 2: Irreversible isomerization to open complex
The closed complex undergoes an essentially irreversible isomerization (for strong promoters) to the open complex with rate constant $k_2$. This is the rate-limiting step for most promoters:
$$RP_c \xrightarrow{k_2} RP_o$$
The free energy for melting ~14 bp of DNA at the Pribnow box is provided by favorable RNAP-ssDNA contacts.
Step 3: Apply the steady-state approximation to RP$_c$
Setting $d[RP_c]/dt = 0$:
$$k_1[R][P] = (k_{-1} + k_2)[RP_c]$$
The effective initiation rate (transcripts per promoter per second):
$$J_{\text{init}} = k_2[RP_c] = \frac{k_1 k_2 [R]}{k_{-1} + k_2} \cdot [P]_{\text{free}}$$
Step 4: Simplify for different promoter strengths
Strong promoters ($k_2 \gg k_{-1}$): isomerization is fast once RNAP binds, so binding is rate-limiting:
$$J_{\text{init}} \approx k_1[R] \quad \text{(diffusion-limited)}$$
Weak promoters ($k_{-1} \gg k_2$): RNAP frequently dissociates before isomerization, so:
$$J_{\text{init}} \approx K_B \cdot k_2 \cdot [R] = \frac{k_1 k_2}{k_{-1}}[R]$$
Step 5: Thermodynamic model with multiple competing promoters
In a cell with $N$ promoters competing for a limited pool of RNAP, the probability of occupancy at promoter $i$ is given by the Boltzmann weight of its binding energy $\Delta G_i$:
$$p_i = \frac{e^{-\Delta G_i / k_B T}}{1 + \sum_{j=1}^{N} e^{-\Delta G_j / k_B T}}$$
This thermodynamic framework explains the 1000-fold range of promoter strengths observed in E. coli.
Step 6: Numerical example
For the lacUV5 promoter: $K_B \approx 10^7$ M$^{-1}$, $k_2 \approx 0.1$ s$^{-1}$. With [RNAP] $\approx 30$ nM free in E. coli:
$$J_{\text{init}} = \frac{K_B \cdot k_2 \cdot [R]}{1 + K_B[R]} = \frac{10^7 \times 0.1 \times 3 \times 10^{-8}}{1 + 10^7 \times 3 \times 10^{-8}} \approx 0.023 \text{ s}^{-1}$$
This corresponds to ~1 transcript initiated every 43 seconds, consistent with measured rates.
Elongation
The elongation complex (EC) is remarkably stable and processive, capable of transcribing genes spanning hundreds of kilobases without dissociating. It maintains a characteristic structural organization:
Transcription Bubble Architecture
- Transcription bubble: ~17 bp of unwound DNA
- RNA-DNA hybrid: ~8 bp at the active site; peeled off by the lid domain
- RNA exit channel: Single-stranded RNA exits through a channel between the $\beta$ flap domain and the lid
- Downstream DNA: Double-stranded DNA enters through the main channel (jaws/clamp)
NTP Addition Cycle
Each nucleotide addition follows a conserved cycle:
- NTP binding: Complementary NTP enters through the secondary channel and binds at the active site (trigger loop open)
- Active site closure: Trigger loop folds, positioning the NTP for catalysis and checking base-pair complementarity (fidelity checkpoint)
- Phosphodiester bond formation: Nucleophilic attack of the 3β²-OH on the $\alpha$-phosphate; PP$_i$ release
- Translocation: RNAP moves forward by 1 bp, driven by a Brownian ratchet mechanism; the trigger loop reopens
Elongation Speed
$$v_{\text{elong}} = k_{\text{cat}} \approx 40\text{-}80 \text{ nt/s (prokaryotes)}$$
Speed varies with sequence context, NTP concentrations, and pausing. The net transcription rate over an operon can be expressed as:
$$\langle v \rangle = \frac{L}{t_{\text{pause}} + L/v_{\text{max}}}$$
where L = gene length, $t_{\text{pause}}$ = cumulative pause time
Derivation: RNA Polymerase Elongation from the Brownian Ratchet Model
Starting from the idea that RNAP does not actively βwalkβ along DNA but instead undergoes thermal diffusion (Brownian motion) between pre- and post-translocated states, rectified by NTP binding.
Step 1: Define the two translocation states
After phosphodiester bond formation, RNAP fluctuates between the pre-translocated state (active site occupied by the 3β² end of the RNA) and the post-translocated state (active site empty, ready for next NTP). The forward and backward rates are governed by thermal energy:
$$\text{Pre} \underset{k_-}{\overset{k_+}{\rightleftharpoons}} \text{Post}$$
In a simple Brownian ratchet, $k_+ \approx k_-$ (no intrinsic directional bias).
Step 2: NTP binding rectifies the ratchet
When RNAP is in the post-translocated state, the incoming NTP binds to the open active site with rate $k_{\text{bind}}[\text{NTP}]$. This locks RNAP in the forward position, preventing backward translocation. The NTP acts as a βpawlβ on the ratchet:
$$\text{Post} + \text{NTP} \xrightarrow{k_{\text{bind}}} \text{Post-NTP} \xrightarrow{k_{\text{chem}}} \text{Pre}_{n+1} + \text{PP}_i$$
Step 3: Derive the steady-state elongation rate
Using the steady-state fraction of time in the post-translocated state: $f_{\text{post}} = k_+/(k_+ + k_-)$. The rate of NTP incorporation:
$$v_{\text{elong}} = \frac{k_+ \cdot k_{\text{bind}}[\text{NTP}] \cdot k_{\text{chem}}}{(k_+ + k_-)(k_{\text{bind}}[\text{NTP}] + k_-) + k_+ \cdot k_{\text{bind}}[\text{NTP}]}$$
Step 4: Simplify at saturating NTP
When $[\text{NTP}] \to \infty$, NTP binding is instantaneous and the rate becomes limited by forward translocation:
$$v_{\max} = \frac{k_+ \cdot k_{\text{chem}}}{k_+ + k_{\text{chem}}}$$
If chemistry is fast ($k_{\text{chem}} \gg k_+$): $v_{\max} \approx k_+$ (translocation-limited). If translocation is fast ($k_+ \gg k_{\text{chem}}$): $v_{\max} \approx k_{\text{chem}}$ (chemistry-limited).
Step 5: Backtracking as a competing pathway
RNAP can also slide backward by multiple base pairs (backtracking), extruding the 3β² RNA end through the secondary channel. This is favored at pause/arrest sites and acts as an off-pathway state:
$$\text{Pre} \xrightarrow{k_{\text{bt}}} \text{Backtracked} \xrightarrow{k_{\text{cleavage}}} \text{Pre}$$
The effective elongation rate including backtracking pauses:
$$\langle v \rangle = v_{\text{elong}} \cdot \frac{k_{\text{cleavage}}}{k_{\text{cleavage}} + k_{\text{bt}} \cdot f_{\text{pause}}}$$
Step 6: Numerical estimates
For E. coli RNAP: $k_+ \approx 200$ s$^{-1}$, $k_- \approx 100$ s$^{-1}$ (slight forward bias from RNA-DNA hybrid stability), $k_{\text{chem}} \approx 300$ s$^{-1}$. This gives $v_{\max} \approx 120$ nt/s for pure elongation. The observed rate of ~40-80 nt/s reflects frequent pausing events (ubiquitous pausing at ~1-5% of template positions for 1-10 seconds each).
Proofreading
RNAP lacks the 3β²β5β² exonuclease of DNA polymerase, but achieves ~10$^{-4}$ to 10$^{-5}$ error rate through: (1) Kinetic selection β mismatched NTPs have slower trigger loop closure; (2) Backtracking β RNAP slides backward, extruding the 3β² end through the secondary channel, which can be cleaved by intrinsic endonuclease activity or the GreA/GreB factors (analogous to eukaryotic TFIIS).
Termination
Prokaryotic transcription termination occurs by two distinct mechanisms. Both ultimately destabilize the elongation complex, causing release of the RNA transcript and dissociation of RNAP from the DNA template.
Rho-Independent (Intrinsic) Termination
Accounts for ~50% of E. coli terminators. Requires two elements in the RNA:
- GC-rich hairpin: A palindromic sequence forms a stable stem-loop structure ($\Delta G \approx -10$ to $-20$ kcal/mol) in the RNA exit channel. The hairpin physically disrupts the RNA-DNA hybrid and destabilizes the elongation complex by pulling on the RNA within the exit channel.
- Poly-U tract: A run of ~7-9 uridines immediately follows the hairpin. The rU-dA hybrid is the weakest of all base-pair combinations ($T_m$ depressed), providing minimal resistance to transcript dissociation. Together, the hairpin + poly-U cause RNAP to pause and release.
Rho-Dependent Termination
Requires the Rho ($\rho$) protein, an ATP-dependent RNA translocase/helicase:
- Rho structure: Homohexameric ring (6 identical subunits), encircles single-stranded RNA, translocates 5β²β3β² along the transcript using ATP hydrolysis
- rut site (Rho utilization): A C-rich, G-poor, unstructured region (~70-80 nt) in the RNA where Rho initially binds. Lack of secondary structure allows Rho loading.
- Mechanism: Rho binds rut site, threads RNA through its central pore, translocates toward RNAP. When it catches a paused RNAP at a termination site, Rho's helicase activity unwinds the RNA-DNA hybrid, releasing the transcript.
- Bicyclomycin: Antibiotic that specifically inhibits Rho, demonstrating the essential nature of Rho-dependent termination.
Eukaryotic Transcription
Three Nuclear RNA Polymerases
Unlike prokaryotes, eukaryotes employ three distinct RNA polymerases (Pol I, II, III) for different classes of RNA, each with 12-17 subunits. A fourth, Pol IV/V, exists in plants for siRNA-directed DNA methylation.
| Polymerase | Location | RNA Products | % Transcription | $\alpha$-amanitin sensitivity |
|---|---|---|---|---|
| Pol I | Nucleolus | 18S, 5.8S, 28S rRNA (45S precursor) | ~50-70% | Resistant |
| Pol II | Nucleoplasm | mRNA, snRNA, miRNA, lncRNA | ~20-40% | Highly sensitive (low [nM]) |
| Pol III | Nucleoplasm | tRNA, 5S rRNA, U6 snRNA, 7SL RNA | ~10% | Sensitive (high [$\mu$M]) |
$\alpha$-amanitin (from Amanita phalloides mushroom) is the classic tool for distinguishing the polymerases. It binds the bridge helix/trigger loop of Pol II, blocking translocation.
RNA Polymerase II Promoter Architecture
Pol II promoters are far more complex and variable than prokaryotic promoters. Core promoter elements span roughly -40 to +40 relative to the transcription start site (+1). No single element is universal β different combinations are found at different genes.
Core Promoter Elements
- TATA box (consensus: TATA(A/T)A(A/T)): Located at -25 to -30. Bound by TBP (TATA-binding protein), a subunit of TFIID. Present in only ~10-20% of human promoters (mostly tissue-specific, highly regulated genes). Determines the position of +1.
- Initiator (Inr, consensus: Py-Py-A+1-N-T/A-Py-Py): Spans the +1 start site. Bound by TAF1/TAF2 subunits of TFIID. Can function independently of TATA box.
- Downstream Promoter Element (DPE): Located at +28 to +34. Bound by TAF6/TAF9. Common in TATA-less promoters (Drosophila). Works in strict spacing with Inr.
- BRE (TFIIB Recognition Element): Flanks the TATA box (BRE$^u$ upstream, BRE$^d$ downstream). Directly contacts TFIIB. Can positively or negatively regulate transcription.
- MTE (Motif Ten Element): Located at +18 to +27. Works synergistically with Inr, DPE, and TATA box.
- CpG islands: ~70% of human gene promoters are associated with CpG islands β GC-rich, ~1 kb regions that are typically TATA-less with dispersed transcription start sites.
General Transcription Factors and PIC Assembly
Unlike the bacterial $\sigma$ factor (a single polypeptide), eukaryotic Pol II requires a large set of general transcription factors (GTFs) to recognize the promoter and initiate transcription. Together with Pol II, they form the Pre-Initiation Complex (PIC).
TFIID (TBP + 13-14 TAFs)
TBP binds the TATA box minor groove, inducing a dramatic ~80$^\circ$ bend in the DNA. TBP has a saddle-shaped structure. TAFs (TBP-Associated Factors) recognize other promoter elements (Inr, DPE, MTE) and provide a platform for activator contacts. TAF1 has histone acetyltransferase (HAT) activity; TAF1/TAF7 has kinase activity. TFIID is the first GTF to bind the promoter in the stepwise assembly model.
TFIIA
Stabilizes TBP-DNA interaction; counteracts transcriptional repressors that compete for TBP binding. Not always required in vitro.
TFIIB
Bridges TBP and Pol II. Binds BRE elements and positions Pol II at the correct start site. The B-reader and B-linker domains extend into the Pol II active site cleft, helping to stabilize the initially transcribing complex and select the +1 nucleotide. TFIIB is the functional analog of bacterial $\sigma$ region 3.2.
TFIIF (RAP30/RAP74)
Escorts Pol II to the promoter; stabilizes Pol II-TFIIB-DNA interaction. RAP74 has a role in start-site selection. Reduces nonspecific binding of Pol II to DNA. Enters as a Pol II-TFIIF complex.
TFIIE
Recruits TFIIH to the PIC. Stimulates the helicase and kinase activities of TFIIH. Helps stabilize the open complex by binding single-stranded DNA in the transcription bubble.
TFIIH (10 subunits, ~500 kDa)
The most complex GTF, with two enzymatic activities critical for initiation: (1) XPB helicase (3β²β5β²) β uses ATP hydrolysis to melt ~11 bp of DNA around the start site, forming the open complex. XPB is also essential for nucleotide excision repair (NER), linking transcription and repair. (2) CDK7/Cyclin H kinase (CAK module) β phosphorylates Ser5 of the Pol II CTD heptad repeat, triggering promoter escape and co-transcriptional capping. CDK7 also phosphorylates cell-cycle CDKs.
Stepwise PIC Assembly Order
TFIID (TBP) β TFIIA β TFIIB β Pol II/TFIIF β TFIIE β TFIIH β PIC
In vivo, much of this may arrive as a preassembled "Pol II holoenzyme" containing Pol II, GTFs, and the Mediator complex. Scaffold complexes with TFIID, TFIIA, TFIIB, Mediator can persist at active promoters through multiple rounds of reinitiation.
The CTD Code: Phosphorylation as a Regulatory Switch
The C-terminal domain (CTD) of the Rpb1 subunit of Pol II consists of tandem heptad repeats with the consensus sequence Y$_1$S$_2$P$_3$T$_4$S$_5$P$_6$S$_7$. Humans have 52 repeats; yeast have 26. The CTD acts as a landing platform whose phosphorylation state recruits different processing and regulatory factors at different stages of transcription.
Ser5-P (Initiation / Early Elongation)
- Kinase: CDK7 (TFIIH subunit)
- Recruits: Capping enzyme (guanylyltransferase)
- Function: Triggers promoter escape; 5β² capping of nascent RNA
- Peak: Near the promoter (+1 to +500 bp)
- Removed by: Rtr1/RPAP2 phosphatase
Ser2-P (Productive Elongation)
- Kinase: P-TEFb (CDK9/Cyclin T)
- Recruits: Splicing factors, 3β² processing machinery (CPSF/CstF)
- Function: Marks productively elongating Pol II
- Peak: Gene body and 3β² end
- Removed by: Fcp1 phosphatase (CTD recycling)
The CTD Cycle
$$\text{Pol II (unphosphorylated)} \xrightarrow{\text{CDK7}} \text{Ser5-P} \xrightarrow{\text{CDK9}} \text{Ser2-P + Ser5-P} \xrightarrow{\text{Fcp1}} \text{Recycled Pol II}$$
Only hypophosphorylated Pol II (IIA form) is recruited to the PIC. The hyperphosphorylated form (IIO) is the elongating species. After termination, Fcp1 dephosphorylates the CTD to regenerate the IIA form for reinitiation.
Elongation Regulation: Promoter-Proximal Pausing
A major regulatory checkpoint in eukaryotic transcription occurs after initiation: Pol II pauses ~30-60 bp downstream of the TSS and must be actively released into productive elongation. This is called promoter-proximal pausing and is a widespread mechanism for rapid gene activation.
Pausing Factors
- DSIF (DRB Sensitivity-Inducing Factor): Spt4/Spt5 heterodimer; travels with Pol II; contributes to pausing in unphosphorylated state; becomes a positive elongation factor when phosphorylated by P-TEFb
- NELF (Negative Elongation Factor): 4-subunit complex; cooperates with DSIF to stabilize the paused Pol II. Absent in yeast and C. elegans (no pausing in these organisms).
Pause Release: P-TEFb
P-TEFb (CDK9/Cyclin T1) is the master switch for pause release. It phosphorylates: (1) Ser2 of the Pol II CTD, (2) Spt5 (converting DSIF to a positive factor), and (3) the RD subunit of NELF (causing NELF dissociation). P-TEFb activity is itself tightly regulated β the majority is sequestered in the inactive 7SK snRNP complex and must be released by signals such as BRD4 binding to acetylated histones or by super-enhancers.
Other Elongation Factors
- FACT (Facilitates Chromatin Transcription): Histone chaperone; disassembles nucleosomes ahead of Pol II and reassembles them behind
- Elongin (SIII): Stimulates elongation rate by suppressing transient pausing
- TFIIS (SII): Stimulates intrinsic Pol II endonuclease to cleave backtracked RNA (rescues arrested complexes)
- PAF1 complex: Couples elongation to histone modifications (H2B ubiquitination, H3K4/K36 methylation)
Eukaryotic RNA Processing
Eukaryotic pre-mRNAs undergo three co-transcriptional modifications before export as mature mRNA: 5β² capping, splicing, and 3β² cleavage/polyadenylation. These processing events are coordinated through the Pol II CTD, which serves as a scaffold for recruiting processing machinery.
5β² Capping
The 5β² cap is added co-transcriptionally when the nascent RNA is only ~20-30 nucleotides long. It is the first processing event and is coupled to transcription through Ser5-P of the CTD, which recruits the capping enzymes.
Three Enzymatic Steps
- RNA triphosphatase: Removes the $\gamma$-phosphate from the 5β² end of the nascent RNA (5β²-pppN β 5β²-ppN)
- Guanylyltransferase: Transfers GMP (from GTP) via an unusual 5β²-5β² triphosphate linkage (GpppN). This backward-facing G is the defining feature of the cap.
- Methyltransferase: SAM-dependent methylation at N7 of the guanine, yielding the m$^7$GpppN (Cap 0) structure. Further 2β²-O-methylation of the first and second nucleotides gives Cap 1 and Cap 2.
Functions of the m$^7$G Cap
- Stability: Protects mRNA from 5β²β3β² exonuclease degradation (Xrn1/Xrn2)
- Nuclear export: Recognized by CBC (Cap-Binding Complex: CBP20/CBP80) for transport through nuclear pores
- Translation: Recognized by eIF4E in the cytoplasm for ribosome recruitment
- Splicing: Enhances splicing of the first (cap-proximal) intron
- Self vs. non-self: Cap modifications (2β²-O-methylation) help distinguish self mRNA from viral RNA, avoiding innate immune activation (RIG-I, IFIT1)
Pre-mRNA Splicing
Most eukaryotic genes are interrupted by non-coding introns that must be precisely removed to join the coding exons. In humans, the average gene has ~8.8 exons and ~7.8 introns, with introns comprising ~95% of pre-mRNA length. Splicing is catalyzed by the spliceosome, a massive (~3 MDa) ribonucleoprotein machine.
Splice Site Consensus Sequences
- 5β² splice site (donor): exon-GU-intron (consensus: AG|GURAGU, where R = purine; the GU dinucleotide is almost invariant)
- 3β² splice site (acceptor): intron-AG-exon (consensus: YYYAG|G, preceded by a polypyrimidine tract)
- Branch point: Located 18-40 nt upstream of the 3β² SS. Contains a conserved adenosine (consensus: YNCURAY in mammals, UACUAAC in yeast; the underlined A is the branch-point nucleotide)
- Polypyrimidine tract: A pyrimidine-rich region between the branch point and 3β² SS, bound by U2AF65
The Two Transesterification Reactions
Splicing proceeds through two sequential transesterification (phosphoryl transfer) reactions. No energy input (ATP) is required for the chemistry itself β the reactions are isoenergetic. ATP is consumed only for spliceosome assembly and rearrangements.
Step 1: Branching Reaction
The 2β²-OH of the branch-point adenosine performs a nucleophilic attack on the phosphodiester bond at the 5β² splice site. This generates: (a) a free 5β² exon with a 3β²-OH, and (b) a lariat intermediate β the intron forms a loop via the 2β²-5β² phosphodiester bond to the branch-point A, still connected to the 3β² exon.
Step 2: Exon Ligation
The free 3β²-OH of the 5β² exon attacks the phosphodiester bond at the 3β² splice site. This ligates the two exons and releases the lariat intron (which is debranched and degraded). The spliced mRNA retains the original reading frame.
Spliceosome Assembly (Ordered snRNP Recruitment)
The major (U2-dependent) spliceosome consists of five snRNPs (U1, U2, U4, U5, U6) plus ~150 associated proteins. Assembly follows a defined pathway:
- E complex (early): U1 snRNP base-pairs with the 5β² SS; SF1/BBP recognizes the branch point; U2AF65 binds polypyrimidine tract; U2AF35 binds the 3β² SS AG
- A complex: U2 snRNP replaces SF1, base-pairing with the branch-point sequence (the branch-point A is bulged out, activating its 2β²-OH). ATP-dependent step requiring the helicase activity of UAP56.
- B complex: U4/U5/U6 tri-snRNP joins. U5 contacts the 5β² exon. U6 displaces U1 at the 5β² SS. Massive conformational rearrangements mediated by DExD/H-box RNA helicases (Brr2, Prp28).
- B* (activated): U1 and U4 are released. U6 base-pairs with U2 to form the catalytic center. U6 coordinates catalytic metal ions (Mg$^{2+}$). The spliceosome is now a ribozyme.
- C complex: Step 1 catalysis (branching reaction) occurs. The lariat intermediate is formed.
- C* complex: Rearrangement; step 2 catalysis (exon ligation). Exons are joined; lariat is released.
- P complex (post-spliceosomal): Mature mRNA released with exon junction complex (EJC) deposited ~20-24 nt upstream of the splice junction. EJC marks the splice site for NMD surveillance and enhances nuclear export and translation.
Splicing Efficiency
The fraction of transcripts correctly spliced can be modeled as:
$$\eta_{\text{splice}} = \prod_{i=1}^{n} \frac{k_{\text{splice},i}}{k_{\text{splice},i} + k_{\text{skip},i} + k_{\text{retain},i}}$$
where $k_{\text{splice},i}$ is the splicing rate for intron $i$, $k_{\text{skip},i}$ is the exon-skipping rate, and $k_{\text{retain},i}$ is the intron-retention rate. For a gene with $n$ introns, the overall fidelity is the product of individual intron efficiencies.
Alternative Splicing
Over 95% of human multi-exon genes undergo alternative splicing, producing multiple mRNA isoforms from a single gene. This is a major source of proteomic diversity β the ~20,000 human genes can produce over 100,000 distinct transcript isoforms.
Modes of Alternative Splicing
- Exon skipping (cassette exon): Most common in mammals (~40%). An internal exon is either included or excluded.
- Alternative 5β² splice site: Two or more donor sites compete for the same acceptor. Changes the 3β² boundary of the upstream exon.
- Alternative 3β² splice site: Two or more acceptor sites compete for the same donor. Changes the 5β² boundary of the downstream exon.
- Intron retention: Most common in plants/fungi. An intron is retained in the mature mRNA. Often introduces a premature stop codon β NMD.
- Mutually exclusive exons: Exactly one of two or more exons is included, never both and never neither.
Regulatory Proteins
- SR proteins (SRSF1-12): Contain RS (arginine-serine) domain and RRM. Generally promote exon inclusion by binding exonic splicing enhancers (ESEs) and recruiting U1/U2 snRNPs.
- hnRNP proteins: Generally antagonize SR proteins. Bind exonic/intronic splicing silencers (ESS/ISS) and promote exon skipping. Examples: hnRNP A1 (promotes skipping), PTB/hnRNP I (represses splicing).
- Tissue-specific regulators: NOVA (brain), RBFOX (muscle/brain), MBNL (muscle), ESRP1/2 (epithelial). Create tissue-specific splicing programs.
Record Holder: Drosophila Dscam
The Drosophila Down syndrome cell adhesion molecule (Dscam) gene can potentially produce 38,016 distinct mRNA isoforms through mutually exclusive alternative splicing of four exon clusters: exon 4 (12 variants) $\times$ exon 6 (48 variants) $\times$ exon 9 (33 variants) $\times$ exon 17 (2 variants) = 12 $\times$ 48 $\times$ 33 $\times$ 2 = 38,016. This exceeds the total number of genes in the Drosophila genome (~14,000) and plays a critical role in neuronal self-recognition and innate immunity.
Derivation: Splicing Combinatorics β Number of Possible mRNA Isoforms
Starting from a gene with $N$ exons, we calculate the theoretical number of distinct mRNA isoforms under different splicing models.
Step 1: Cassette exon model (independent exon skipping)
Assume the first and last exons are always included (constitutive), while each of the $N - 2$ internal exons is independently included or excluded. Each internal exon has 2 choices:
$$I_{\text{skip}} = 2^{N-2}$$
For a 10-exon gene: $2^8 = 256$ possible isoforms.
Step 2: Mutually exclusive exon clusters
If exon position $j$ has a cluster of $m_j$ mutually exclusive variants (exactly one must be chosen), the total number of isoforms from $k$ such clusters is the product:
$$I_{\text{ME}} = \prod_{j=1}^{k} m_j$$
For Drosophila Dscam: $I = 12 \times 48 \times 33 \times 2 = 38{,}016$ isoforms.
Step 3: Include alternative 5β² and 3β² splice sites
If each internal exon can be skipped, included with the canonical site, or included with an alternative 5β² or 3β² splice site, there are 4 choices per internal exon:
$$I_{\text{alt}} = 4^{N-2}$$
For a 10-exon gene: $4^8 = 65{,}536$ theoretical isoforms.
Step 4: Add intron retention
Each of the $N - 1$ introns can be retained or removed. Combined with exon skipping, the upper bound becomes:
$$I_{\text{max}} = 2^{N-2} \times 2^{N-1} = 2^{2N-3}$$
Step 5: General combinatorial formula
For a gene with $n_s$ cassette exons, $k$ mutually exclusive clusters of sizes $m_1, \ldots, m_k$, $n_a$ exons with alternative splice sites (each with $a_i$ sites), and $n_r$ retainable introns:
$$I_{\text{total}} = 2^{n_s} \times \prod_{j=1}^{k} m_j \times \prod_{i=1}^{n_a} a_i \times 2^{n_r}$$
Step 6: Biological constraints reduce complexity
In practice, most theoretical isoforms are never produced. SR proteins and hnRNPs create regulatory networks that couple exon choices, and many combinations would introduce frameshifts or premature stop codons (subject to NMD). The human transcriptome has ~20,000 genes producing ~100,000-200,000 distinct transcripts β an average of 5-10 isoforms per gene, far below the theoretical maximum of $2^{N-2}$.
3β² End Processing: Cleavage and Polyadenylation
The 3β² end of most eukaryotic mRNAs is generated not by termination of transcription but by endonucleolytic cleavage of the pre-mRNA followed by addition of a poly(A) tail. This is coupled to Pol II termination.
Signal Sequences
- Polyadenylation signal: AAUAAA (or close variant AUUAAA). Located 10-30 nt upstream of the cleavage site. Recognized by CPSF (Cleavage and Polyadenylation Specificity Factor), specifically the CPSF-30 subunit via its zinc finger domains. The AAUAAA hexamer is the most conserved element in the 3β² processing signal.
- Downstream element (DSE): A GU-rich or U-rich element located 20-40 nt downstream of the cleavage site. Recognized by CstF (Cleavage Stimulation Factor), specifically CstF-64.
- Cleavage site: Usually after a CA dinucleotide, between the AAUAAA and DSE elements.
Processing Steps
- Assembly: CPSF binds AAUAAA; CstF binds DSE; additional factors (CFI, CFII) join. CPSF is recruited to the elongating Pol II via the Ser2-P CTD.
- Cleavage: CPSF-73 (a metallo-$\beta$-lactamase) performs the endonucleolytic cleavage. This generates a free 3β²-OH on the upstream (mRNA) fragment.
- Polyadenylation: Poly(A) polymerase (PAP) adds ~200 adenylate residues (in mammals) to the 3β²-OH in a template-independent manner. PABPN1 (nuclear poly(A) binding protein) stimulates PAP processivity and controls tail length.
- Termination coupling: The downstream RNA fragment (still attached to Pol II) is degraded 5β²β3β² by Xrn2 (torpedo model). When Xrn2 catches up to Pol II, it triggers release β termination.
Functions of the Poly(A) Tail
- mRNA stability: Bound by PABPC1 in the cytoplasm; protects from 3β²β5β² exosome degradation. Deadenylation is typically the first step in mRNA decay.
- Nuclear export: PABPN1 + CBC promote mRNP export through the nuclear pore complex
- Translation: PABPC1 interacts with eIF4G, forming the mRNA closed-loop structure that promotes translation initiation and ribosome recycling
Mathematical Framework
Transcription Rate Equations
Steady-State mRNA Level
At steady state, the rate of mRNA production equals the rate of degradation:
$$\frac{d[\text{mRNA}]}{dt} = k_{\text{txn}} - k_{\text{deg}}[\text{mRNA}] = 0$$
$$[\text{mRNA}]_{\text{ss}} = \frac{k_{\text{txn}}}{k_{\text{deg}}}$$
where $k_{\text{txn}}$ is the transcription rate (mRNA/time) and $k_{\text{deg}}$ is the first-order degradation rate constant. The mRNA half-life is $t_{1/2} = \ln 2 / k_{\text{deg}}$.
Transcriptional Bursting (Two-State Model)
Gene expression is inherently stochastic. The two-state (telegraph) model captures transcriptional bursting:
$$G_{\text{OFF}} \underset{k_{\text{off}}}{\overset{k_{\text{on}}}{\rightleftharpoons}} G_{\text{ON}} \xrightarrow{k_{\text{init}}} G_{\text{ON}} + \text{mRNA}$$
$$\text{Burst size} = \frac{k_{\text{init}}}{k_{\text{off}}}, \quad \text{Burst frequency} = k_{\text{on}}$$
$$\text{Fano factor} = \frac{\sigma^2}{\mu} = 1 + \frac{k_{\text{init}}}{k_{\text{off}} + k_{\text{deg}}} > 1 \quad \text{(super-Poissonian)}$$
Derivation: Transcriptional Bursting β Two-State Telegraph Model
Starting from the two-state (telegraph) model of gene expression, we derive the steady-state mean, variance, and mRNA distribution for a bursty promoter.
Step 1: Define the two-state promoter model
A gene switches stochastically between an inactive (OFF) state and an active (ON) state. mRNA is only produced in the ON state and degrades with first-order kinetics:
$$G_{\text{OFF}} \underset{k_{\text{off}}}{\overset{k_{\text{on}}}{\rightleftharpoons}} G_{\text{ON}} \xrightarrow{k_{\text{init}}} G_{\text{ON}} + \text{mRNA}, \quad \text{mRNA} \xrightarrow{k_{\text{deg}}} \varnothing$$
Step 2: Determine the mean ON fraction
At steady state, the fraction of time the promoter is ON:
$$\langle f_{\text{ON}} \rangle = \frac{k_{\text{on}}}{k_{\text{on}} + k_{\text{off}}}$$
The effective mean transcription rate is $k_{\text{eff}} = k_{\text{init}} \cdot \langle f_{\text{ON}} \rangle$.
Step 3: Define burst size and burst frequency
A βburstβ is the number of mRNAs produced during a single ON episode. The ON duration is exponentially distributed with mean $1/k_{\text{off}}$. During this time, mRNAs are initiated at rate $k_{\text{init}}$:
$$b = \frac{k_{\text{init}}}{k_{\text{off}}} \quad \text{(mean burst size)}$$
Burst frequency = rate of switching ON: $k_{\text{on}}$.
Step 4: Derive the mean mRNA level
At steady state, production equals degradation:
$$\langle n \rangle = \frac{k_{\text{eff}}}{k_{\text{deg}}} = \frac{k_{\text{init}} \cdot k_{\text{on}}}{(k_{\text{on}} + k_{\text{off}}) \cdot k_{\text{deg}}}$$
Step 5: Derive the variance and Fano factor
The full analytical solution of the two-state master equation (Peccoud and Ycart, 1995) gives the variance. In the limit where promoter switching is slow compared to mRNA degradation ($k_{\text{on}}, k_{\text{off}} \ll k_{\text{deg}}$):
$$\sigma^2 = \langle n \rangle + \langle n \rangle \cdot \frac{k_{\text{init}}}{k_{\text{off}} + k_{\text{deg}}}$$
The Fano factor (variance/mean) measures deviation from Poisson statistics:
$$F = \frac{\sigma^2}{\langle n \rangle} = 1 + \frac{b}{1 + k_{\text{off}}/k_{\text{deg}}} = 1 + \frac{k_{\text{init}}}{k_{\text{off}} + k_{\text{deg}}}$$
$F = 1$ is Poissonian (constitutive expression). $F > 1$ is super-Poissonian (bursty). The excess noise comes entirely from promoter switching.
Step 6: Steady-state mRNA distribution
The full distribution $P(n)$ is a negative binomial (or beta-Poisson) distribution in the bursty limit ($k_{\text{on}}, k_{\text{off}} \ll k_{\text{deg}}$):
$$P(n) = \frac{\Gamma(a + n)}{\Gamma(a) \cdot n!} \cdot \frac{b^n}{(1+b)^{a+n}}$$
where $a = k_{\text{on}}/k_{\text{deg}}$ is the normalized burst frequency and $b = k_{\text{init}}/k_{\text{off}}$ is the burst size. This distribution is broader than Poisson, with a characteristic βburstyβ shape featuring a peak near zero and a long tail, frequently observed in single-molecule FISH experiments.
Elongation Time and Polymerase Density
$$t_{\text{elong}} = \frac{L}{v_{\text{elong}}}, \quad \rho_{\text{RNAP}} = \frac{k_{\text{init}}}{v_{\text{elong}}}$$
where $L$ is gene length (nt), $v_{\text{elong}}$ is elongation speed (nt/s), and $\rho_{\text{RNAP}}$ is the linear density of polymerases on the gene (RNAP/nt). For rRNA genes, $\rho$ can be as high as 1 RNAP per ~80 nt.
Transcription Regulation Preview
Transcription is the most frequently regulated step in gene expression. A full treatment follows in Part 7 (Gene Regulation), but the key players are introduced here:
Activators
Sequence-specific DNA-binding proteins that enhance transcription by recruiting Pol II/GTFs to the promoter or stimulating PIC assembly. They bind upstream activating sequences (UAS in yeast) or enhancers (metazoans). Activation domains (ADs) are typically acidic, glutamine-rich, or proline-rich intrinsically disordered regions. Examples: GAL4 (yeast), p53, NF-$\kappa$B, SP1.
Repressors
Inhibit transcription by: (1) competing with activators for DNA binding, (2) masking activation domains, (3) recruiting corepressors/HDACs to create repressive chromatin, (4) directly inhibiting PIC assembly. Examples: lac repressor (prokaryotic), REST/NRSF (neuronal genes), Rb (cell cycle).
Enhancers
Cis-regulatory DNA elements that can activate transcription from distances of 10-1000 kb (even on different chromosomes in rare cases). They function orientation-independently and are thought to contact promoters via DNA looping, facilitated by cohesin and CTCF. Enhancers are marked by H3K4me1 and H3K27ac. Super-enhancers are large clusters that drive cell-identity genes.
Mediator Complex
A ~30-subunit coactivator complex that bridges activators/repressors and Pol II/GTFs. The Mediator is essential for virtually all Pol II transcription in vivo. It consists of head (contacts Pol II), middle (scaffold), tail (contacts activators), and kinase (CDK8, regulatory) modules. Mediator integrates diverse regulatory signals into a unified transcriptional output.
Types of RNA
Python: Transcription Kinetics Simulation
This simulation models stochastic transcription with a two-state (ON/OFF) promoter, capturing the phenomenon of transcriptional bursting. The promoter stochastically switches between active and inactive states. When ON, RNA polymerase initiates at a Poisson-distributed rate. Each transcript requires a finite elongation time before becoming mature mRNA, which then decays with first-order kinetics.
Stochastic Transcription Kinetics with Bursting
PythonSimulate two-state promoter model with initiation, elongation delay, and mRNA degradation
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Fortran: Splicing Combinatorics Calculator
This Fortran program calculates the theoretical number of possible mRNA isoforms from a gene with N exons under different models of alternative splicing: independent exon skipping, mutually exclusive clusters (Dscam-style), combined alternative splice sites, intron retention, and full combinatorial complexity.
Alternative Splicing Combinatorics
FortranCalculate possible mRNA isoforms under different splicing models
Click Run to execute the Fortran code
Code will be compiled with gfortran and executed on the server
Key Concepts Summary
Prokaryotic Transcription
- Single RNAP ($\alpha_2\beta\beta'\omega$) + $\sigma$ factor
- Promoter: -35 (TTGACA) + -10 (TATAAT)
- Closed complex β open complex β elongation
- Termination: intrinsic (hairpin + poly-U) or Rho-dependent
- Speed: ~40-80 nt/s; no post-transcriptional processing
Eukaryotic Transcription
- Three polymerases: Pol I (rRNA), Pol II (mRNA), Pol III (tRNA)
- GTFs: TFIID, TFIIB, TFIIF, TFIIE, TFIIH β PIC
- CTD code: Ser5-P (capping) β Ser2-P (elongation)
- Promoter-proximal pausing: DSIF/NELF β P-TEFb release
- Speed: ~20-30 nt/s; coupled to RNA processing
RNA Processing
- 5β² cap: m$^7$GpppN via guanylyltransferase
- Splicing: spliceosome (U1-U6 snRNPs), lariat intermediate
- Two transesterification reactions (branch-point A attack)
- Alternative splicing: exon skip, alt SS, intron retention
- 3β² processing: AAUAAA, CPSF/CstF, ~200 nt poly(A) tail
Quantitative Relationships
- $[\text{mRNA}]_{\text{ss}} = k_{\text{txn}}/k_{\text{deg}}$
- $t_{1/2} = \ln 2 / k_{\text{deg}}$
- Burst size = $k_{\text{init}}/k_{\text{off}}$
- Fano factor > 1 indicates bursty expression
- Dscam: 12 $\times$ 48 $\times$ 33 $\times$ 2 = 38,016 isoforms