Part 5: Transcription

Overview: DNA to RNA

Transcription is the synthesis of an RNA molecule from a DNA template. It is the first major step in gene expression — the process by which the information stored in a gene's nucleotide sequence is converted into a functional product (protein or functional RNA). RNA polymerase (RNAP) reads the template strand of DNA in the 3′ → 5′ direction while synthesizing the RNA transcript in the 5′ → 3′ direction, adding ribonucleoside triphosphates (NTPs) complementary to the template.

The newly synthesized RNA is identical in sequence to the coding strand (also called the non-template or sense strand), except that uracil (U) replaces thymine (T). Unlike DNA replication, transcription does not require a primer and copies only selected portions of the genome — specific genes or operons rather than the entire chromosome.

Ninja Nerd · Cell Biology

DNA Transcription

Promoter recognition, initiation/elongation/termination, the eukaryotic carboxy-terminal-domain phosphorylation cycle, and the difference between coding and template strands.

The Three Phases of Transcription

1. Initiation

RNAP binds the promoter, melts the DNA duplex to form the transcription bubble, and begins RNA synthesis. The most highly regulated phase.

2. Elongation

RNAP moves along the template, extending the RNA chain at ~40-80 nt/s (prokaryotes) or ~20-30 nt/s (eukaryotes), maintaining the transcription bubble.

3. Termination

Specific signals cause RNAP to halt, release the completed RNA transcript, and dissociate from the DNA template.

Template vs. Coding Strand

5′-...A T G C C G T A A G G C T T A...-3′ (Coding / Sense strand)

3′-...T A C G G C A T T C C G A A T...-5′ (Template / Antisense strand)

5′-...A U G C C G U A A G G C U U A...-3′ (mRNA transcript)

The mRNA sequence matches the coding strand (with U replacing T). RNAP reads the template strand.

Prokaryotic Transcription

RNA Polymerase: Structure and Function

Bacteria possess a single RNA polymerase responsible for synthesizing all classes of RNA (mRNA, rRNA, tRNA). The enzyme is a large multi-subunit complex with a distinctive "crab claw" shape, enclosing an internal channel that accommodates ~17 bp of DNA.

Core Enzyme ($\alpha_2\beta\beta'\omega$)

$\alpha$ subunits (x2): Assembly platform; $\alpha$-CTD contacts UP element upstream of promoter
$\beta$ subunit: Contains catalytic center; binds NTP substrates; rifampicin binding site
$\beta'$ subunit: Binds DNA template; contains the two catalytic Mg$^{2+}$ ions
$\omega$ subunit: Structural role; assists $\beta'$ folding and assembly

Holoenzyme ($\alpha_2\beta\beta'\omega\sigma$)

$\sigma$ factor: Required for promoter recognition; dissociates after initiation
$\sigma^{70}$: Housekeeping genes (most common in E. coli)
$\sigma^{32}$: Heat shock response genes
$\sigma^{54}$: Nitrogen metabolism genes (unique mechanism)
$\sigma^{38}$ (RpoS): Stationary phase / stress response

Two-Metal-Ion Catalytic Mechanism

The active site contains two Mg$^{2+}$ ions coordinated by conserved aspartate residues in the $\beta'$ subunit. Metal A activates the 3′-OH of the growing RNA chain for nucleophilic attack on the $\alpha$-phosphate of the incoming NTP. Metal B stabilizes the pyrophosphate leaving group and helps position the NTP substrate.

$$\text{RNA}_{n} + \text{NTP} \xrightarrow{2\text{Mg}^{2+}} \text{RNA}_{n+1} + \text{PP}_i$$

Pyrophosphate (PP$_i$) is subsequently hydrolyzed by pyrophosphatase, making the reaction effectively irreversible ($\Delta G \approx -7$ kcal/mol per nucleotide addition when coupled with PP$_i$ hydrolysis).

Promoter Recognition

Prokaryotic promoters are defined by conserved sequence elements upstream of the transcription start site (+1). The $\sigma$ factor of the holoenzyme makes specific contacts with two key elements:

Consensus Promoter Elements ($\sigma^{70}$)

Position: -35 -10 +1

TTGACA ...17 bp spacer... TATAAT ... +1

-35 element (TTGACA): Recognized by $\sigma$ region 4.2. Initial contact point for holoenzyme. Critical for promoter strength.
-10 element / Pribnow box (TATAAT): Recognized by $\sigma$ region 2.4. AT-rich to facilitate strand separation (only 2 H-bonds per A-T vs 3 for G-C). The site of initial DNA melting.
Spacer (17 bp optimal): Distance between -35 and -10 elements is critical. Deviations from 17 bp reduce promoter strength because both $\sigma$ contacts must be on the same face of the DNA helix.
UP element (-40 to -60): AT-rich sequence contacted by $\alpha$-CTD. Enhances transcription up to 30-fold (e.g., rrnB P1 promoter for ribosomal RNA).
Extended -10 (TGn): Some promoters lack a recognizable -35 element and instead have a TGn motif immediately upstream of the -10 box, contacted by $\sigma$ region 3.0.

Promoter Strength

Promoter strength varies over a 1000-fold range and is determined by how closely the sequence matches the consensus. The strongest promoters (like rrnB P1) initiate transcription every ~1-2 seconds, while weak promoters may fire only once every 10-30 minutes. This can be quantified as a catalytic efficiency:

$$\text{Promoter strength} \propto \frac{k_{\text{cat}}}{K_m} = \frac{k_{\text{init}}}{K_d^{\text{RNAP-promoter}}}$$

where $k_{\text{init}}$ is the rate of productive initiation and $K_d$ is the dissociation constant for RNAP-promoter binding.

Initiation

Transcription initiation is a multi-step process that converts the initial RNAP-promoter encounter into a productively elongating complex. It is the primary point of regulation for most genes.

Step 1: Closed Complex (RP$_c$)

The holoenzyme ($\alpha_2\beta\beta'\omega\sigma$) binds the double-stranded promoter DNA. $\sigma$ region 4.2 contacts the -35 element and $\sigma$ region 2.4 contacts the -10 element. DNA remains fully base-paired. This is a reversible, relatively rapid step. The holoenzyme slides along DNA and can scan ~2000 bp in a single binding event.

Step 2: Open Complex (RP$_o$)

The DNA duplex melts (unwinds) over a region of approximately 14 bp from roughly -11 to +3, forming the transcription bubble. The template strand descends into the active site channel. This isomerization is essentially irreversible at most promoters (for $\sigma^{70}$) and is the rate-limiting step. The AT-rich Pribnow box facilitates melting. Aromatic residues in $\sigma$ region 2.3 stabilize the single-stranded non-template DNA.

Step 3: Abortive Initiation

RNAP begins synthesizing short RNA transcripts (2-12 nt) but repeatedly releases them and reinitiates without leaving the promoter. This "scrunching" mechanism involves pulling downstream DNA into the enzyme, storing energy as stressed single-stranded DNA. Typically 5-10 abortive cycles occur before productive elongation begins.

Step 4: Promoter Escape

Once the RNA reaches ~10-12 nt, the accumulated scrunching stress is released, breaking the $\sigma$-promoter contacts. The $\sigma$ factor is released (or repositioned), converting the initiation complex into a stable elongation complex (EC). The RNA-DNA hybrid in the EC is ~8-9 bp. This transition is the commitment point — the enzyme is now processive and will transcribe the entire gene.

Kinetic Scheme

$$R + P \underset{k_{-1}}{\overset{k_1}{\rightleftharpoons}} RP_c \xrightarrow{k_2} RP_o \xrightarrow{k_3} \text{EC} + \sigma$$

R = RNAP holoenzyme, P = promoter, RP$_c$ = closed complex, RP$_o$ = open complex, EC = elongation complex. The rate of transcript production depends on the slowest step: $k_2$ (isomerization) for most promoters.

Derivation: Transcription Initiation Rate from Thermodynamic Promoter Binding

Starting from the thermodynamic model of RNAP-promoter interaction, we derive the rate of productive transcription initiation as a function of RNAP concentration and promoter binding affinity.

Step 1: Equilibrium binding of RNAP to promoter (closed complex)

The formation of the closed complex follows a binding equilibrium:

$$R + P \rightleftharpoons RP_c, \quad K_B = \frac{[RP_c]}{[R][P]} = \frac{k_1}{k_{-1}}$$

The fraction of promoters occupied by RNAP at equilibrium:

$$f_{\text{bound}} = \frac{[R]}{K_d + [R]} = \frac{K_B[R]}{1 + K_B[R]}$$

Step 2: Irreversible isomerization to open complex

The closed complex undergoes an essentially irreversible isomerization (for strong promoters) to the open complex with rate constant $k_2$. This is the rate-limiting step for most promoters:

$$RP_c \xrightarrow{k_2} RP_o$$

The free energy for melting ~14 bp of DNA at the Pribnow box is provided by favorable RNAP-ssDNA contacts.

Step 3: Apply the steady-state approximation to RP$_c$

Setting $d[RP_c]/dt = 0$:

$$k_1[R][P] = (k_{-1} + k_2)[RP_c]$$

The effective initiation rate (transcripts per promoter per second):

$$J_{\text{init}} = k_2[RP_c] = \frac{k_1 k_2 [R]}{k_{-1} + k_2} \cdot [P]_{\text{free}}$$

Step 4: Simplify for different promoter strengths

Strong promoters ($k_2 \gg k_{-1}$): isomerization is fast once RNAP binds, so binding is rate-limiting:

$$J_{\text{init}} \approx k_1[R] \quad \text{(diffusion-limited)}$$

Weak promoters ($k_{-1} \gg k_2$): RNAP frequently dissociates before isomerization, so:

$$J_{\text{init}} \approx K_B \cdot k_2 \cdot [R] = \frac{k_1 k_2}{k_{-1}}[R]$$

Step 5: Thermodynamic model with multiple competing promoters

In a cell with $N$ promoters competing for a limited pool of RNAP, the probability of occupancy at promoter $i$ is given by the Boltzmann weight of its binding energy $\Delta G_i$:

$$p_i = \frac{e^{-\Delta G_i / k_B T}}{1 + \sum_{j=1}^{N} e^{-\Delta G_j / k_B T}}$$

This thermodynamic framework explains the 1000-fold range of promoter strengths observed in E. coli.

Step 6: Numerical example

For the lacUV5 promoter: $K_B \approx 10^7$ M$^{-1}$, $k_2 \approx 0.1$ s$^{-1}$. With [RNAP] $\approx 30$ nM free in E. coli:

$$J_{\text{init}} = \frac{K_B \cdot k_2 \cdot [R]}{1 + K_B[R]} = \frac{10^7 \times 0.1 \times 3 \times 10^{-8}}{1 + 10^7 \times 3 \times 10^{-8}} \approx 0.023 \text{ s}^{-1}$$

This corresponds to ~1 transcript initiated every 43 seconds, consistent with measured rates.

Elongation

The elongation complex (EC) is remarkably stable and processive, capable of transcribing genes spanning hundreds of kilobases without dissociating. It maintains a characteristic structural organization:

Transcription Bubble Architecture

Transcription bubble: ~17 bp of unwound DNA
RNA-DNA hybrid: ~8 bp at the active site; peeled off by the lid domain
RNA exit channel: Single-stranded RNA exits through a channel between the $\beta$ flap domain and the lid
Downstream DNA: Double-stranded DNA enters through the main channel (jaws/clamp)

NTP Addition Cycle

Each nucleotide addition follows a conserved cycle:

NTP binding: Complementary NTP enters through the secondary channel and binds at the active site (trigger loop open)
Active site closure: Trigger loop folds, positioning the NTP for catalysis and checking base-pair complementarity (fidelity checkpoint)
Phosphodiester bond formation: Nucleophilic attack of the 3′-OH on the $\alpha$-phosphate; PP$_i$ release
Translocation: RNAP moves forward by 1 bp, driven by a Brownian ratchet mechanism; the trigger loop reopens

Elongation Speed

$$v_{\text{elong}} = k_{\text{cat}} \approx 40\text{-}80 \text{ nt/s (prokaryotes)}$$

Speed varies with sequence context, NTP concentrations, and pausing. The net transcription rate over an operon can be expressed as:

$$\langle v \rangle = \frac{L}{t_{\text{pause}} + L/v_{\text{max}}}$$

where L = gene length, $t_{\text{pause}}$ = cumulative pause time

Derivation: RNA Polymerase Elongation from the Brownian Ratchet Model

Starting from the idea that RNAP does not actively “walk” along DNA but instead undergoes thermal diffusion (Brownian motion) between pre- and post-translocated states, rectified by NTP binding.

Step 1: Define the two translocation states

After phosphodiester bond formation, RNAP fluctuates between the pre-translocated state (active site occupied by the 3′ end of the RNA) and the post-translocated state (active site empty, ready for next NTP). The forward and backward rates are governed by thermal energy:

$$\text{Pre} \underset{k_-}{\overset{k_+}{\rightleftharpoons}} \text{Post}$$

In a simple Brownian ratchet, $k_+ \approx k_-$ (no intrinsic directional bias).

Step 2: NTP binding rectifies the ratchet

When RNAP is in the post-translocated state, the incoming NTP binds to the open active site with rate $k_{\text{bind}}[\text{NTP}]$. This locks RNAP in the forward position, preventing backward translocation. The NTP acts as a “pawl” on the ratchet:

$$\text{Post} + \text{NTP} \xrightarrow{k_{\text{bind}}} \text{Post-NTP} \xrightarrow{k_{\text{chem}}} \text{Pre}_{n+1} + \text{PP}_i$$

Step 3: Derive the steady-state elongation rate

Using the steady-state fraction of time in the post-translocated state: $f_{\text{post}} = k_+/(k_+ + k_-)$. The rate of NTP incorporation:

$$v_{\text{elong}} = \frac{k_+ \cdot k_{\text{bind}}[\text{NTP}] \cdot k_{\text{chem}}}{(k_+ + k_-)(k_{\text{bind}}[\text{NTP}] + k_-) + k_+ \cdot k_{\text{bind}}[\text{NTP}]}$$

Step 4: Simplify at saturating NTP

When $[\text{NTP}] \to \infty$, NTP binding is instantaneous and the rate becomes limited by forward translocation:

$$v_{\max} = \frac{k_+ \cdot k_{\text{chem}}}{k_+ + k_{\text{chem}}}$$

If chemistry is fast ($k_{\text{chem}} \gg k_+$): $v_{\max} \approx k_+$ (translocation-limited). If translocation is fast ($k_+ \gg k_{\text{chem}}$): $v_{\max} \approx k_{\text{chem}}$ (chemistry-limited).

Step 5: Backtracking as a competing pathway

RNAP can also slide backward by multiple base pairs (backtracking), extruding the 3′ RNA end through the secondary channel. This is favored at pause/arrest sites and acts as an off-pathway state:

$$\text{Pre} \xrightarrow{k_{\text{bt}}} \text{Backtracked} \xrightarrow{k_{\text{cleavage}}} \text{Pre}$$

The effective elongation rate including backtracking pauses:

$$\langle v \rangle = v_{\text{elong}} \cdot \frac{k_{\text{cleavage}}}{k_{\text{cleavage}} + k_{\text{bt}} \cdot f_{\text{pause}}}$$

Step 6: Numerical estimates

For E. coli RNAP: $k_+ \approx 200$ s$^{-1}$, $k_- \approx 100$ s$^{-1}$ (slight forward bias from RNA-DNA hybrid stability), $k_{\text{chem}} \approx 300$ s$^{-1}$. This gives $v_{\max} \approx 120$ nt/s for pure elongation. The observed rate of ~40-80 nt/s reflects frequent pausing events (ubiquitous pausing at ~1-5% of template positions for 1-10 seconds each).

Proofreading

RNAP lacks the 3′→5′ exonuclease of DNA polymerase, but achieves ~10$^{-4}$ to 10$^{-5}$ error rate through: (1) Kinetic selection — mismatched NTPs have slower trigger loop closure; (2) Backtracking — RNAP slides backward, extruding the 3′ end through the secondary channel, which can be cleaved by intrinsic endonuclease activity or the GreA/GreB factors (analogous to eukaryotic TFIIS).

Termination

Prokaryotic transcription termination occurs by two distinct mechanisms. Both ultimately destabilize the elongation complex, causing release of the RNA transcript and dissociation of RNAP from the DNA template.

Rho-Independent (Intrinsic) Termination

Accounts for ~50% of E. coli terminators. Requires two elements in the RNA:

GC-rich hairpin: A palindromic sequence forms a stable stem-loop structure ($\Delta G \approx -10$ to $-20$ kcal/mol) in the RNA exit channel. The hairpin physically disrupts the RNA-DNA hybrid and destabilizes the elongation complex by pulling on the RNA within the exit channel.
Poly-U tract: A run of ~7-9 uridines immediately follows the hairpin. The rU-dA hybrid is the weakest of all base-pair combinations ($T_m$ depressed), providing minimal resistance to transcript dissociation. Together, the hairpin + poly-U cause RNAP to pause and release.

Rho-Dependent Termination

Requires the Rho ($\rho$) protein, an ATP-dependent RNA translocase/helicase:

Rho structure: Homohexameric ring (6 identical subunits), encircles single-stranded RNA, translocates 5′→3′ along the transcript using ATP hydrolysis
rut site (Rho utilization): A C-rich, G-poor, unstructured region (~70-80 nt) in the RNA where Rho initially binds. Lack of secondary structure allows Rho loading.
Mechanism: Rho binds rut site, threads RNA through its central pore, translocates toward RNAP. When it catches a paused RNAP at a termination site, Rho's helicase activity unwinds the RNA-DNA hybrid, releasing the transcript.
Bicyclomycin: Antibiotic that specifically inhibits Rho, demonstrating the essential nature of Rho-dependent termination.

Eukaryotic Transcription

Three Nuclear RNA Polymerases

Unlike prokaryotes, eukaryotes employ three distinct RNA polymerases (Pol I, II, III) for different classes of RNA, each with 12-17 subunits. A fourth, Pol IV/V, exists in plants for siRNA-directed DNA methylation.

Polymerase	Location	RNA Products	% Transcription	$\alpha$-amanitin sensitivity
Pol I	Nucleolus	18S, 5.8S, 28S rRNA (45S precursor)	~50-70%	Resistant
Pol II	Nucleoplasm	mRNA, snRNA, miRNA, lncRNA	~20-40%	Highly sensitive (low [nM])
Pol III	Nucleoplasm	tRNA, 5S rRNA, U6 snRNA, 7SL RNA	~10%	Sensitive (high [$\mu$M])

$\alpha$-amanitin (from Amanita phalloides mushroom) is the classic tool for distinguishing the polymerases. It binds the bridge helix/trigger loop of Pol II, blocking translocation.

RNA Polymerase II Promoter Architecture

Pol II promoters are far more complex and variable than prokaryotic promoters. Core promoter elements span roughly -40 to +40 relative to the transcription start site (+1). No single element is universal — different combinations are found at different genes.

Core Promoter Elements

TATA box (consensus: TATA(A/T)A(A/T)): Located at -25 to -30. Bound by TBP (TATA-binding protein), a subunit of TFIID. Present in only ~10-20% of human promoters (mostly tissue-specific, highly regulated genes). Determines the position of +1.
Initiator (Inr, consensus: Py-Py-A+1-N-T/A-Py-Py): Spans the +1 start site. Bound by TAF1/TAF2 subunits of TFIID. Can function independently of TATA box.
Downstream Promoter Element (DPE): Located at +28 to +34. Bound by TAF6/TAF9. Common in TATA-less promoters (Drosophila). Works in strict spacing with Inr.
BRE (TFIIB Recognition Element): Flanks the TATA box (BRE$^u$ upstream, BRE$^d$ downstream). Directly contacts TFIIB. Can positively or negatively regulate transcription.
MTE (Motif Ten Element): Located at +18 to +27. Works synergistically with Inr, DPE, and TATA box.
CpG islands: ~70% of human gene promoters are associated with CpG islands — GC-rich, ~1 kb regions that are typically TATA-less with dispersed transcription start sites.

General Transcription Factors and PIC Assembly

Unlike the bacterial $\sigma$ factor (a single polypeptide), eukaryotic Pol II requires a large set of general transcription factors (GTFs) to recognize the promoter and initiate transcription. Together with Pol II, they form the Pre-Initiation Complex (PIC).

TFIID (TBP + 13-14 TAFs)

TBP binds the TATA box minor groove, inducing a dramatic ~80$^\circ$ bend in the DNA. TBP has a saddle-shaped structure. TAFs (TBP-Associated Factors) recognize other promoter elements (Inr, DPE, MTE) and provide a platform for activator contacts. TAF1 has histone acetyltransferase (HAT) activity; TAF1/TAF7 has kinase activity. TFIID is the first GTF to bind the promoter in the stepwise assembly model.

TFIIA

Stabilizes TBP-DNA interaction; counteracts transcriptional repressors that compete for TBP binding. Not always required in vitro.

TFIIB

Bridges TBP and Pol II. Binds BRE elements and positions Pol II at the correct start site. The B-reader and B-linker domains extend into the Pol II active site cleft, helping to stabilize the initially transcribing complex and select the +1 nucleotide. TFIIB is the functional analog of bacterial $\sigma$ region 3.2.

TFIIF (RAP30/RAP74)

Escorts Pol II to the promoter; stabilizes Pol II-TFIIB-DNA interaction. RAP74 has a role in start-site selection. Reduces nonspecific binding of Pol II to DNA. Enters as a Pol II-TFIIF complex.

TFIIE

Recruits TFIIH to the PIC. Stimulates the helicase and kinase activities of TFIIH. Helps stabilize the open complex by binding single-stranded DNA in the transcription bubble.

TFIIH (10 subunits, ~500 kDa)

The most complex GTF, with two enzymatic activities critical for initiation: (1) XPB helicase (3′→5′) — uses ATP hydrolysis to melt ~11 bp of DNA around the start site, forming the open complex. XPB is also essential for nucleotide excision repair (NER), linking transcription and repair. (2) CDK7/Cyclin H kinase (CAK module) — phosphorylates Ser5 of the Pol II CTD heptad repeat, triggering promoter escape and co-transcriptional capping. CDK7 also phosphorylates cell-cycle CDKs.

Stepwise PIC Assembly Order

TFIID (TBP) → TFIIA → TFIIB → Pol II/TFIIF → TFIIE → TFIIH → PIC

In vivo, much of this may arrive as a preassembled "Pol II holoenzyme" containing Pol II, GTFs, and the Mediator complex. Scaffold complexes with TFIID, TFIIA, TFIIB, Mediator can persist at active promoters through multiple rounds of reinitiation.

The CTD Code: Phosphorylation as a Regulatory Switch

The C-terminal domain (CTD) of the Rpb1 subunit of Pol II consists of tandem heptad repeats with the consensus sequence Y$_1$S$_2$P$_3$T$_4$S$_5$P$_6$S$_7$. Humans have 52 repeats; yeast have 26. The CTD acts as a landing platform whose phosphorylation state recruits different processing and regulatory factors at different stages of transcription.

Ser5-P (Initiation / Early Elongation)

Kinase: CDK7 (TFIIH subunit)
Recruits: Capping enzyme (guanylyltransferase)
Function: Triggers promoter escape; 5′ capping of nascent RNA
Peak: Near the promoter (+1 to +500 bp)
Removed by: Rtr1/RPAP2 phosphatase

Ser2-P (Productive Elongation)

Kinase: P-TEFb (CDK9/Cyclin T)
Recruits: Splicing factors, 3′ processing machinery (CPSF/CstF)
Function: Marks productively elongating Pol II
Peak: Gene body and 3′ end
Removed by: Fcp1 phosphatase (CTD recycling)

The CTD Cycle

$$\text{Pol II (unphosphorylated)} \xrightarrow{\text{CDK7}} \text{Ser5-P} \xrightarrow{\text{CDK9}} \text{Ser2-P + Ser5-P} \xrightarrow{\text{Fcp1}} \text{Recycled Pol II}$$

Only hypophosphorylated Pol II (IIA form) is recruited to the PIC. The hyperphosphorylated form (IIO) is the elongating species. After termination, Fcp1 dephosphorylates the CTD to regenerate the IIA form for reinitiation.

Elongation Regulation: Promoter-Proximal Pausing

A major regulatory checkpoint in eukaryotic transcription occurs after initiation: Pol II pauses ~30-60 bp downstream of the TSS and must be actively released into productive elongation. This is called promoter-proximal pausing and is a widespread mechanism for rapid gene activation.

Pausing Factors

DSIF (DRB Sensitivity-Inducing Factor): Spt4/Spt5 heterodimer; travels with Pol II; contributes to pausing in unphosphorylated state; becomes a positive elongation factor when phosphorylated by P-TEFb
NELF (Negative Elongation Factor): 4-subunit complex; cooperates with DSIF to stabilize the paused Pol II. Absent in yeast and C. elegans (no pausing in these organisms).

Pause Release: P-TEFb

P-TEFb (CDK9/Cyclin T1) is the master switch for pause release. It phosphorylates: (1) Ser2 of the Pol II CTD, (2) Spt5 (converting DSIF to a positive factor), and (3) the RD subunit of NELF (causing NELF dissociation). P-TEFb activity is itself tightly regulated — the majority is sequestered in the inactive 7SK snRNP complex and must be released by signals such as BRD4 binding to acetylated histones or by super-enhancers.

Other Elongation Factors

FACT (Facilitates Chromatin Transcription): Histone chaperone; disassembles nucleosomes ahead of Pol II and reassembles them behind
Elongin (SIII): Stimulates elongation rate by suppressing transient pausing
TFIIS (SII): Stimulates intrinsic Pol II endonuclease to cleave backtracked RNA (rescues arrested complexes)
PAF1 complex: Couples elongation to histone modifications (H2B ubiquitination, H3K4/K36 methylation)

Eukaryotic RNA Processing

Eukaryotic pre-mRNAs undergo three co-transcriptional modifications before export as mature mRNA: 5′ capping, splicing, and 3′ cleavage/polyadenylation. These processing events are coordinated through the Pol II CTD, which serves as a scaffold for recruiting processing machinery.

5′ Capping

The 5′ cap is added co-transcriptionally when the nascent RNA is only ~20-30 nucleotides long. It is the first processing event and is coupled to transcription through Ser5-P of the CTD, which recruits the capping enzymes.

Three Enzymatic Steps

RNA triphosphatase: Removes the $\gamma$-phosphate from the 5′ end of the nascent RNA (5′-pppN → 5′-ppN)
Guanylyltransferase: Transfers GMP (from GTP) via an unusual 5′-5′ triphosphate linkage (GpppN). This backward-facing G is the defining feature of the cap.
Methyltransferase: SAM-dependent methylation at N7 of the guanine, yielding the m$^7$GpppN (Cap 0) structure. Further 2′-O-methylation of the first and second nucleotides gives Cap 1 and Cap 2.

Functions of the m$^7$G Cap

Stability: Protects mRNA from 5′→3′ exonuclease degradation (Xrn1/Xrn2)
Nuclear export: Recognized by CBC (Cap-Binding Complex: CBP20/CBP80) for transport through nuclear pores
Translation: Recognized by eIF4E in the cytoplasm for ribosome recruitment
Splicing: Enhances splicing of the first (cap-proximal) intron
Self vs. non-self: Cap modifications (2′-O-methylation) help distinguish self mRNA from viral RNA, avoiding innate immune activation (RIG-I, IFIT1)

Pre-mRNA Splicing

Most eukaryotic genes are interrupted by non-coding introns that must be precisely removed to join the coding exons. In humans, the average gene has ~8.8 exons and ~7.8 introns, with introns comprising ~95% of pre-mRNA length. Splicing is catalyzed by the spliceosome, a massive (~3 MDa) ribonucleoprotein machine.

Splice Site Consensus Sequences

5′ splice site (donor): exon-GU-intron (consensus: AG|GURAGU, where R = purine; the GU dinucleotide is almost invariant)
3′ splice site (acceptor): intron-AG-exon (consensus: YYYAG|G, preceded by a polypyrimidine tract)
Branch point: Located 18-40 nt upstream of the 3′ SS. Contains a conserved adenosine (consensus: YNCURAY in mammals, UACUAAC in yeast; the underlined A is the branch-point nucleotide)
Polypyrimidine tract: A pyrimidine-rich region between the branch point and 3′ SS, bound by U2AF65

The Two Transesterification Reactions

Splicing proceeds through two sequential transesterification (phosphoryl transfer) reactions. No energy input (ATP) is required for the chemistry itself — the reactions are isoenergetic. ATP is consumed only for spliceosome assembly and rearrangements.

Step 1: Branching Reaction

The 2′-OH of the branch-point adenosine performs a nucleophilic attack on the phosphodiester bond at the 5′ splice site. This generates: (a) a free 5′ exon with a 3′-OH, and (b) a lariat intermediate — the intron forms a loop via the 2′-5′ phosphodiester bond to the branch-point A, still connected to the 3′ exon.

Step 2: Exon Ligation

The free 3′-OH of the 5′ exon attacks the phosphodiester bond at the 3′ splice site. This ligates the two exons and releases the lariat intron (which is debranched and degraded). The spliced mRNA retains the original reading frame.

Spliceosome Assembly (Ordered snRNP Recruitment)

The major (U2-dependent) spliceosome consists of five snRNPs (U1, U2, U4, U5, U6) plus ~150 associated proteins. Assembly follows a defined pathway:

E complex (early): U1 snRNP base-pairs with the 5′ SS; SF1/BBP recognizes the branch point; U2AF65 binds polypyrimidine tract; U2AF35 binds the 3′ SS AG
A complex: U2 snRNP replaces SF1, base-pairing with the branch-point sequence (the branch-point A is bulged out, activating its 2′-OH). ATP-dependent step requiring the helicase activity of UAP56.
B complex: U4/U5/U6 tri-snRNP joins. U5 contacts the 5′ exon. U6 displaces U1 at the 5′ SS. Massive conformational rearrangements mediated by DExD/H-box RNA helicases (Brr2, Prp28).
B* (activated): U1 and U4 are released. U6 base-pairs with U2 to form the catalytic center. U6 coordinates catalytic metal ions (Mg$^{2+}$). The spliceosome is now a ribozyme.
C complex: Step 1 catalysis (branching reaction) occurs. The lariat intermediate is formed.
C* complex: Rearrangement; step 2 catalysis (exon ligation). Exons are joined; lariat is released.
P complex (post-spliceosomal): Mature mRNA released with exon junction complex (EJC) deposited ~20-24 nt upstream of the splice junction. EJC marks the splice site for NMD surveillance and enhances nuclear export and translation.

Splicing Efficiency

The fraction of transcripts correctly spliced can be modeled as:

$$\eta_{\text{splice}} = \prod_{i=1}^{n} \frac{k_{\text{splice},i}}{k_{\text{splice},i} + k_{\text{skip},i} + k_{\text{retain},i}}$$

where $k_{\text{splice},i}$ is the splicing rate for intron $i$, $k_{\text{skip},i}$ is the exon-skipping rate, and $k_{\text{retain},i}$ is the intron-retention rate. For a gene with $n$ introns, the overall fidelity is the product of individual intron efficiencies.

Alternative Splicing

Over 95% of human multi-exon genes undergo alternative splicing, producing multiple mRNA isoforms from a single gene. This is a major source of proteomic diversity — the ~20,000 human genes can produce over 100,000 distinct transcript isoforms.

Modes of Alternative Splicing

Exon skipping (cassette exon): Most common in mammals (~40%). An internal exon is either included or excluded.
Alternative 5′ splice site: Two or more donor sites compete for the same acceptor. Changes the 3′ boundary of the upstream exon.
Alternative 3′ splice site: Two or more acceptor sites compete for the same donor. Changes the 5′ boundary of the downstream exon.
Intron retention: Most common in plants/fungi. An intron is retained in the mature mRNA. Often introduces a premature stop codon → NMD.
Mutually exclusive exons: Exactly one of two or more exons is included, never both and never neither.

Regulatory Proteins

SR proteins (SRSF1-12): Contain RS (arginine-serine) domain and RRM. Generally promote exon inclusion by binding exonic splicing enhancers (ESEs) and recruiting U1/U2 snRNPs.
hnRNP proteins: Generally antagonize SR proteins. Bind exonic/intronic splicing silencers (ESS/ISS) and promote exon skipping. Examples: hnRNP A1 (promotes skipping), PTB/hnRNP I (represses splicing).
Tissue-specific regulators: NOVA (brain), RBFOX (muscle/brain), MBNL (muscle), ESRP1/2 (epithelial). Create tissue-specific splicing programs.

Record Holder: Drosophila Dscam

The Drosophila Down syndrome cell adhesion molecule (Dscam) gene can potentially produce 38,016 distinct mRNA isoforms through mutually exclusive alternative splicing of four exon clusters: exon 4 (12 variants) $\times$ exon 6 (48 variants) $\times$ exon 9 (33 variants) $\times$ exon 17 (2 variants) = 12 $\times$ 48 $\times$ 33 $\times$ 2 = 38,016. This exceeds the total number of genes in the Drosophila genome (~14,000) and plays a critical role in neuronal self-recognition and innate immunity.

Derivation: Splicing Combinatorics — Number of Possible mRNA Isoforms

Starting from a gene with $N$ exons, we calculate the theoretical number of distinct mRNA isoforms under different splicing models.

Step 1: Cassette exon model (independent exon skipping)

Assume the first and last exons are always included (constitutive), while each of the $N - 2$ internal exons is independently included or excluded. Each internal exon has 2 choices:

$$I_{\text{skip}} = 2^{N-2}$$

For a 10-exon gene: $2^8 = 256$ possible isoforms.

Step 2: Mutually exclusive exon clusters

If exon position $j$ has a cluster of $m_j$ mutually exclusive variants (exactly one must be chosen), the total number of isoforms from $k$ such clusters is the product:

$$I_{\text{ME}} = \prod_{j=1}^{k} m_j$$

For Drosophila Dscam: $I = 12 \times 48 \times 33 \times 2 = 38{,}016$ isoforms.

Step 3: Include alternative 5′ and 3′ splice sites

If each internal exon can be skipped, included with the canonical site, or included with an alternative 5′ or 3′ splice site, there are 4 choices per internal exon:

$$I_{\text{alt}} = 4^{N-2}$$

For a 10-exon gene: $4^8 = 65{,}536$ theoretical isoforms.

Step 4: Add intron retention

Each of the $N - 1$ introns can be retained or removed. Combined with exon skipping, the upper bound becomes:

$$I_{\text{max}} = 2^{N-2} \times 2^{N-1} = 2^{2N-3}$$

Step 5: General combinatorial formula

For a gene with $n_s$ cassette exons, $k$ mutually exclusive clusters of sizes $m_1, \ldots, m_k$, $n_a$ exons with alternative splice sites (each with $a_i$ sites), and $n_r$ retainable introns:

$$I_{\text{total}} = 2^{n_s} \times \prod_{j=1}^{k} m_j \times \prod_{i=1}^{n_a} a_i \times 2^{n_r}$$

Step 6: Biological constraints reduce complexity

In practice, most theoretical isoforms are never produced. SR proteins and hnRNPs create regulatory networks that couple exon choices, and many combinations would introduce frameshifts or premature stop codons (subject to NMD). The human transcriptome has ~20,000 genes producing ~100,000-200,000 distinct transcripts — an average of 5-10 isoforms per gene, far below the theoretical maximum of $2^{N-2}$.

3′ End Processing: Cleavage and Polyadenylation

The 3′ end of most eukaryotic mRNAs is generated not by termination of transcription but by endonucleolytic cleavage of the pre-mRNA followed by addition of a poly(A) tail. This is coupled to Pol II termination.

Signal Sequences

Polyadenylation signal: AAUAAA (or close variant AUUAAA). Located 10-30 nt upstream of the cleavage site. Recognized by CPSF (Cleavage and Polyadenylation Specificity Factor), specifically the CPSF-30 subunit via its zinc finger domains. The AAUAAA hexamer is the most conserved element in the 3′ processing signal.
Downstream element (DSE): A GU-rich or U-rich element located 20-40 nt downstream of the cleavage site. Recognized by CstF (Cleavage Stimulation Factor), specifically CstF-64.
Cleavage site: Usually after a CA dinucleotide, between the AAUAAA and DSE elements.

Processing Steps

Assembly: CPSF binds AAUAAA; CstF binds DSE; additional factors (CFI, CFII) join. CPSF is recruited to the elongating Pol II via the Ser2-P CTD.
Cleavage: CPSF-73 (a metallo-$\beta$-lactamase) performs the endonucleolytic cleavage. This generates a free 3′-OH on the upstream (mRNA) fragment.
Polyadenylation: Poly(A) polymerase (PAP) adds ~200 adenylate residues (in mammals) to the 3′-OH in a template-independent manner. PABPN1 (nuclear poly(A) binding protein) stimulates PAP processivity and controls tail length.
Termination coupling: The downstream RNA fragment (still attached to Pol II) is degraded 5′→3′ by Xrn2 (torpedo model). When Xrn2 catches up to Pol II, it triggers release — termination.

Functions of the Poly(A) Tail

mRNA stability: Bound by PABPC1 in the cytoplasm; protects from 3′→5′ exosome degradation. Deadenylation is typically the first step in mRNA decay.
Nuclear export: PABPN1 + CBC promote mRNP export through the nuclear pore complex
Translation: PABPC1 interacts with eIF4G, forming the mRNA closed-loop structure that promotes translation initiation and ribosome recycling

Mathematical Framework

Transcription Rate Equations

Steady-State mRNA Level

At steady state, the rate of mRNA production equals the rate of degradation:

$$\frac{d[\text{mRNA}]}{dt} = k_{\text{txn}} - k_{\text{deg}}[\text{mRNA}] = 0$$

$$[\text{mRNA}]_{\text{ss}} = \frac{k_{\text{txn}}}{k_{\text{deg}}}$$

where $k_{\text{txn}}$ is the transcription rate (mRNA/time) and $k_{\text{deg}}$ is the first-order degradation rate constant. The mRNA half-life is $t_{1/2} = \ln 2 / k_{\text{deg}}$.

Transcriptional Bursting (Two-State Model)

Gene expression is inherently stochastic. The two-state (telegraph) model captures transcriptional bursting:

$$G_{\text{OFF}} \underset{k_{\text{off}}}{\overset{k_{\text{on}}}{\rightleftharpoons}} G_{\text{ON}} \xrightarrow{k_{\text{init}}} G_{\text{ON}} + \text{mRNA}$$

$$\text{Burst size} = \frac{k_{\text{init}}}{k_{\text{off}}}, \quad \text{Burst frequency} = k_{\text{on}}$$

$$\text{Fano factor} = \frac{\sigma^2}{\mu} = 1 + \frac{k_{\text{init}}}{k_{\text{off}} + k_{\text{deg}}} > 1 \quad \text{(super-Poissonian)}$$

Derivation: Transcriptional Bursting — Two-State Telegraph Model

Starting from the two-state (telegraph) model of gene expression, we derive the steady-state mean, variance, and mRNA distribution for a bursty promoter.

Step 1: Define the two-state promoter model

A gene switches stochastically between an inactive (OFF) state and an active (ON) state. mRNA is only produced in the ON state and degrades with first-order kinetics:

$$G_{\text{OFF}} \underset{k_{\text{off}}}{\overset{k_{\text{on}}}{\rightleftharpoons}} G_{\text{ON}} \xrightarrow{k_{\text{init}}} G_{\text{ON}} + \text{mRNA}, \quad \text{mRNA} \xrightarrow{k_{\text{deg}}} \varnothing$$

Step 2: Determine the mean ON fraction

At steady state, the fraction of time the promoter is ON:

$$\langle f_{\text{ON}} \rangle = \frac{k_{\text{on}}}{k_{\text{on}} + k_{\text{off}}}$$

The effective mean transcription rate is $k_{\text{eff}} = k_{\text{init}} \cdot \langle f_{\text{ON}} \rangle$.

Step 3: Define burst size and burst frequency

A “burst” is the number of mRNAs produced during a single ON episode. The ON duration is exponentially distributed with mean $1/k_{\text{off}}$. During this time, mRNAs are initiated at rate $k_{\text{init}}$:

$$b = \frac{k_{\text{init}}}{k_{\text{off}}} \quad \text{(mean burst size)}$$

Burst frequency = rate of switching ON: $k_{\text{on}}$.

Step 4: Derive the mean mRNA level

At steady state, production equals degradation:

$$\langle n \rangle = \frac{k_{\text{eff}}}{k_{\text{deg}}} = \frac{k_{\text{init}} \cdot k_{\text{on}}}{(k_{\text{on}} + k_{\text{off}}) \cdot k_{\text{deg}}}$$

Step 5: Derive the variance and Fano factor

The full analytical solution of the two-state master equation (Peccoud and Ycart, 1995) gives the variance. In the limit where promoter switching is slow compared to mRNA degradation ($k_{\text{on}}, k_{\text{off}} \ll k_{\text{deg}}$):

$$\sigma^2 = \langle n \rangle + \langle n \rangle \cdot \frac{k_{\text{init}}}{k_{\text{off}} + k_{\text{deg}}}$$

The Fano factor (variance/mean) measures deviation from Poisson statistics:

$$F = \frac{\sigma^2}{\langle n \rangle} = 1 + \frac{b}{1 + k_{\text{off}}/k_{\text{deg}}} = 1 + \frac{k_{\text{init}}}{k_{\text{off}} + k_{\text{deg}}}$$

$F = 1$ is Poissonian (constitutive expression). $F > 1$ is super-Poissonian (bursty). The excess noise comes entirely from promoter switching.

Step 6: Steady-state mRNA distribution

The full distribution $P(n)$ is a negative binomial (or beta-Poisson) distribution in the bursty limit ($k_{\text{on}}, k_{\text{off}} \ll k_{\text{deg}}$):

$$P(n) = \frac{\Gamma(a + n)}{\Gamma(a) \cdot n!} \cdot \frac{b^n}{(1+b)^{a+n}}$$

where $a = k_{\text{on}}/k_{\text{deg}}$ is the normalized burst frequency and $b = k_{\text{init}}/k_{\text{off}}$ is the burst size. This distribution is broader than Poisson, with a characteristic “bursty” shape featuring a peak near zero and a long tail, frequently observed in single-molecule FISH experiments.

Elongation Time and Polymerase Density

$$t_{\text{elong}} = \frac{L}{v_{\text{elong}}}, \quad \rho_{\text{RNAP}} = \frac{k_{\text{init}}}{v_{\text{elong}}}$$

where $L$ is gene length (nt), $v_{\text{elong}}$ is elongation speed (nt/s), and $\rho_{\text{RNAP}}$ is the linear density of polymerases on the gene (RNAP/nt). For rRNA genes, $\rho$ can be as high as 1 RNAP per ~80 nt.

Transcription Regulation Preview

Transcription is the most frequently regulated step in gene expression. A full treatment follows in Part 7 (Gene Regulation), but the key players are introduced here:

Activators

Sequence-specific DNA-binding proteins that enhance transcription by recruiting Pol II/GTFs to the promoter or stimulating PIC assembly. They bind upstream activating sequences (UAS in yeast) or enhancers (metazoans). Activation domains (ADs) are typically acidic, glutamine-rich, or proline-rich intrinsically disordered regions. Examples: GAL4 (yeast), p53, NF-$\kappa$B, SP1.

Repressors

Inhibit transcription by: (1) competing with activators for DNA binding, (2) masking activation domains, (3) recruiting corepressors/HDACs to create repressive chromatin, (4) directly inhibiting PIC assembly. Examples: lac repressor (prokaryotic), REST/NRSF (neuronal genes), Rb (cell cycle).

Enhancers

Cis-regulatory DNA elements that can activate transcription from distances of 10-1000 kb (even on different chromosomes in rare cases). They function orientation-independently and are thought to contact promoters via DNA looping, facilitated by cohesin and CTCF. Enhancers are marked by H3K4me1 and H3K27ac. Super-enhancers are large clusters that drive cell-identity genes.

Mediator Complex

A ~30-subunit coactivator complex that bridges activators/repressors and Pol II/GTFs. The Mediator is essential for virtually all Pol II transcription in vivo. It consists of head (contacts Pol II), middle (scaffold), tail (contacts activators), and kinase (CDK8, regulatory) modules. Mediator integrates diverse regulatory signals into a unified transcriptional output.

Types of RNA

mRNA

Messenger RNA—Carries genetic code for protein synthesis; 1-5% of total RNA

tRNA

Transfer RNA—Brings amino acids to ribosome; ~76 nt cloverleaf structure; anticodon recognition

rRNA

Ribosomal RNA—Structural and catalytic component of ribosomes; ~80% of total RNA; peptidyl transferase

snRNA

Small nuclear RNA—Core of spliceosome (U1, U2, U4, U5, U6); U6 is the catalytic RNA

snoRNA

Small nucleolar RNA—Guides rRNA modifications (2'-O-methylation, pseudouridylation) in the nucleolus

miRNA

MicroRNA—Gene silencing; ~22 nt; loaded into RISC/Argonaute; targets mRNA 3' UTR

lncRNA

Long non-coding RNA—Diverse regulatory roles: Xist (X-inactivation), HOTAIR (chromatin remodeling), MALAT1

piRNA

Piwi-interacting RNA—Transposon silencing in germline; 24-31 nt; Piwi/Argonaute family

Python: Transcription Kinetics Simulation

This simulation models stochastic transcription with a two-state (ON/OFF) promoter, capturing the phenomenon of transcriptional bursting. The promoter stochastically switches between active and inactive states. When ON, RNA polymerase initiates at a Poisson-distributed rate. Each transcript requires a finite elongation time before becoming mature mRNA, which then decays with first-order kinetics.

Stochastic Transcription Kinetics with Bursting

Python

Simulate two-state promoter model with initiation, elongation delay, and mRNA degradation

transcription_kinetics.py121 lines

import numpy as np

# ============================================================
# Transcription Kinetics Simulator with Transcriptional Bursting
# ============================================================
# Models stochastic mRNA production using a two-state promoter
# (ON/OFF) with Poisson-distributed initiation events.

np.random.seed(42)

# Parameters
k_on = 0.05        # Rate of switching from OFF -> ON (per second)
k_off = 0.02       # Rate of switching from ON -> OFF (per second)
k_init = 0.8       # Initiation rate when promoter is ON (transcripts/s)
k_elong = 50.0     # Elongation speed (nt/s)
gene_length = 3000  # Gene length in nucleotides
k_degrad = 0.002   # mRNA degradation rate (per second)
t_max = 3600.0     # Simulate 1 hour
dt = 0.5           # Time step (seconds)

elongation_time = gene_length / k_elong  # Time to complete one transcript
print(f"=== Transcription Kinetics Simulation ===")
print(f"Gene length: {gene_length} nt")
print(f"Elongation speed: {k_elong} nt/s")
print(f"Time to complete one transcript: {elongation_time:.1f} s")
print(f"Promoter ON rate: {k_on}/s, OFF rate: {k_off}/s")
print(f"Fraction time ON (expected): {k_on/(k_on+k_off):.2f}")
print(f"Initiation rate (when ON): {k_init} transcripts/s")
print(f"mRNA degradation rate: {k_degrad}/s")
print(f"mRNA half-life: {np.log(2)/k_degrad:.1f} s")
print()

# Simulation arrays
steps = int(t_max / dt)
time = np.linspace(0, t_max, steps)
promoter_state = np.zeros(steps, dtype=int)  # 0=OFF, 1=ON
nascent_transcripts = []  # list of (start_time,) for in-progress transcripts
mature_mRNA = np.zeros(steps)
nascent_count = np.zeros(steps)
initiation_events = np.zeros(steps)

# Initial state
state = 0  # start OFF
mRNA_count = 0.0

for i in range(steps):
    # Promoter switching (Gillespie-like within dt)
    if state == 0:
        if np.random.random() < k_on * dt:
            state = 1
    else:
        if np.random.random() < k_off * dt:
            state = 0
    promoter_state[i] = state

# Initiation (only when ON)
    if state == 1:
        n_init = np.random.poisson(k_init * dt)
        for _ in range(n_init):
            nascent_transcripts.append(time[i])
            initiation_events[i] += 1

# Check which nascent transcripts have finished elongation
    completed = 0
    still_nascent = []
    for start_t in nascent_transcripts:
        if time[i] - start_t >= elongation_time:
            completed += 1
        else:
            still_nascent.append(start_t)
    nascent_transcripts = still_nascent

# Update mature mRNA count (production - degradation)
    mRNA_count += completed - k_degrad * mRNA_count * dt
    if mRNA_count < 0:
        mRNA_count = 0

mature_mRNA[i] = mRNA_count
    nascent_count[i] = len(nascent_transcripts)

# Steady-state analytics
frac_on = k_on / (k_on + k_off)
mean_init_rate = frac_on * k_init
steady_state_mRNA = mean_init_rate / k_degrad
burst_size = k_init / k_off
burst_freq = k_on

print(f"--- Analytical Predictions ---")
print(f"Promoter ON fraction: {frac_on:.3f}")
print(f"Mean initiation rate: {mean_init_rate:.3f} transcripts/s")
print(f"Steady-state mRNA (deterministic): {steady_state_mRNA:.1f}")
print(f"Burst size (k_init/k_off): {burst_size:.1f} transcripts/burst")
print(f"Burst frequency (k_on): {burst_freq:.3f} bursts/s")
print()

# Report simulation results
print(f"--- Simulation Results (t = {t_max:.0f}s) ---")
print(f"Final mature mRNA count: {mature_mRNA[-1]:.1f}")
print(f"Mean mature mRNA (last 50%): {np.mean(mature_mRNA[steps//2:]):.1f}")
print(f"Std dev mRNA (last 50%): {np.std(mature_mRNA[steps//2:]):.1f}")
print(f"Fano factor (var/mean): {np.var(mature_mRNA[steps//2:])/max(np.mean(mature_mRNA[steps//2:]),1e-9):.2f}")
print(f"  (Fano > 1 indicates super-Poissonian / bursty expression)")
print()

# Time-binned burst analysis
bin_size = 60  # 1 minute bins
n_bins = int(t_max / bin_size)
print(f"--- Transcription Bursts (1-min bins) ---")
print(f"{'Time (min)':>10} {'Promoter ON%':>12} {'Initiations':>12} {'Mature mRNA':>12}")
for b in range(min(n_bins, 20)):
    s = int(b * bin_size / dt)
    e = int((b + 1) * bin_size / dt)
    on_frac = np.mean(promoter_state[s:e]) * 100
    inits = np.sum(initiation_events[s:e])
    mrna = mature_mRNA[min(e, steps-1)]
    print(f"{b+1:>10} {on_frac:>11.1f}% {inits:>12.0f} {mrna:>12.1f}")

print()
print(f"Total initiation events: {np.sum(initiation_events):.0f}")
print(f"Mean initiations per minute: {np.sum(initiation_events)/n_bins:.1f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Fortran: Splicing Combinatorics Calculator

This Fortran program calculates the theoretical number of possible mRNA isoforms from a gene with N exons under different models of alternative splicing: independent exon skipping, mutually exclusive clusters (Dscam-style), combined alternative splice sites, intron retention, and full combinatorial complexity.

Alternative Splicing Combinatorics

Fortran

Calculate possible mRNA isoforms under different splicing models

splicing_combinatorics.f90118 lines

program splicing_combinatorics
  implicit none
  ! ============================================================
  ! Alternative Splicing Combinatorics Calculator
  ! ============================================================
  ! Calculates the number of possible mRNA isoforms from a gene
  ! with N exons under different splicing models.

integer, parameter :: max_exons = 25
  integer :: n, i, j
  real(8) :: total_skip, total_cassette, total_alt_ss
  real(8) :: binom, dscam_isoforms
  real(8) :: constitutive, factorial_val
  integer :: n_alt_5p, n_alt_3p

print *, '========================================================'
  print *, ' Alternative Splicing Combinatorics Calculator'
  print *, '========================================================'
  print *, ''

! --- Model 1: Exon Skipping (Cassette Exons) ---
  ! Each internal exon can be included or excluded independently
  ! For N exons: 2 terminal (always included) + (N-2) internal
  ! Number of isoforms = 2^(N-2)

print *, '--- Model 1: Independent Exon Skipping ---'
  print *, 'Each internal exon included or excluded independently.'
  print *, 'Terminal exons always included.'
  print *, ''
  print '(A10, A15, A20)', 'N exons', 'Internal', 'Isoforms'
  print '(A10, A15, A20)', '-------', '--------', '--------'

do n = 3, 20
    total_skip = 2.0d0 ** (n - 2)
    print '(I10, I15, ES20.4)', n, n-2, total_skip
  end do

print *, ''

! --- Model 2: Mutually Exclusive Exon Clusters ---
  ! Like Dscam: 4 clusters with 12, 48, 33, 2 variants
  print *, '--- Model 2: Dscam-style Mutually Exclusive Clusters ---'
  print *, 'Drosophila Dscam gene:'
  print *, '  Exon 4 cluster:  12 variants'
  print *, '  Exon 6 cluster:  48 variants'
  print *, '  Exon 9 cluster:  33 variants'
  print *, '  Exon 17 cluster:  2 variants'
  dscam_isoforms = 12.0d0 * 48.0d0 * 33.0d0 * 2.0d0
  print '(A, F12.0)', '  Total isoforms = 12 x 48 x 33 x 2 = ', &
        dscam_isoforms
  print *, '  (More than 3x the number of genes in the genome!)'
  print *, ''

! --- Model 3: Combinatorial with Alt 5p/3p Splice Sites ---
  ! Each internal exon has probability of: inclusion, exclusion,
  ! alt 5-prime (2 sites), alt 3-prime (2 sites)
  print *, '--- Model 3: Combined Skipping + Alt Splice Sites ---'
  print *, 'Each internal exon: skip, include, alt-5p, or alt-3p'
  print *, '(4 choices per internal exon)'
  print *, ''
  print '(A10, A20)', 'N exons', 'Possible isoforms'
  print '(A10, A20)', '-------', '-----------------'

do n = 3, 15
    total_alt_ss = 4.0d0 ** (n - 2)
    print '(I10, ES20.4)', n, total_alt_ss
  end do

print *, ''

! --- Model 4: Intron Retention ---
  ! Any subset of introns can be retained
  ! N exons have N-1 introns, each retained or not: 2^(N-1)
  print *, '--- Model 4: Intron Retention ---'
  print *, 'Each of N-1 introns can be retained or spliced out.'
  print *, ''
  print '(A10, A12, A20)', 'N exons', 'Introns', 'Isoforms'
  print '(A10, A12, A20)', '-------', '-------', '--------'

do n = 3, 15
    total_skip = 2.0d0 ** (n - 1)
    print '(I10, I12, ES20.4)', n, n-1, total_skip
  end do

print *, ''

! --- Model 5: Full combinatorial (skipping + intron retention) ---
  print *, '--- Model 5: Full Combinatorial Complexity ---'
  print *, 'Exon skipping AND intron retention combined.'
  print *, 'Upper bound = 2^(N-2) x 2^(N-1) = 2^(2N-3)'
  print *, ''
  print '(A10, A20)', 'N exons', 'Max isoforms'
  print '(A10, A20)', '-------', '------------'

do n = 3, 12
    total_skip = 2.0d0 ** (2*n - 3)
    print '(I10, ES20.4)', n, total_skip
  end do

print *, ''

! --- Summary for a typical human gene ---
  print *, '========================================================'
  print *, ' Summary: Typical Human Gene (~10 exons)'
  print *, '========================================================'
  n = 10
  print '(A, I6)',    '  Exons:                           ', n
  print '(A, I6)',    '  Internal exons:                  ', n-2
  print '(A, F12.0)', '  Exon-skip isoforms (2^8):        ', 2.0d0**8
  print '(A, F12.0)', '  With alt splice sites (4^8):     ', 4.0d0**8
  print '(A, F12.0)', '  With intron retention (2^17):    ', 2.0d0**17
  print *, ''
  print *, '  In practice, most genes produce 2-10 major isoforms'
  print *, '  due to regulatory constraints (SR proteins, hnRNPs).'
  print *, '  The human genome: ~20,000 genes -> ~100,000+ transcripts'
  print *, ''

end program splicing_combinatorics

Click Run to execute the Fortran code

Code will be compiled with gfortran and executed on the server

Key Concepts Summary

Prokaryotic Transcription

Single RNAP ($\alpha_2\beta\beta'\omega$) + $\sigma$ factor
Promoter: -35 (TTGACA) + -10 (TATAAT)
Closed complex → open complex → elongation
Termination: intrinsic (hairpin + poly-U) or Rho-dependent
Speed: ~40-80 nt/s; no post-transcriptional processing

Eukaryotic Transcription

Three polymerases: Pol I (rRNA), Pol II (mRNA), Pol III (tRNA)
GTFs: TFIID, TFIIB, TFIIF, TFIIE, TFIIH → PIC
CTD code: Ser5-P (capping) → Ser2-P (elongation)
Promoter-proximal pausing: DSIF/NELF → P-TEFb release
Speed: ~20-30 nt/s; coupled to RNA processing

RNA Processing

5′ cap: m$^7$GpppN via guanylyltransferase
Splicing: spliceosome (U1-U6 snRNPs), lariat intermediate
Two transesterification reactions (branch-point A attack)
Alternative splicing: exon skip, alt SS, intron retention
3′ processing: AAUAAA, CPSF/CstF, ~200 nt poly(A) tail

Quantitative Relationships

$[\text{mRNA}]_{\text{ss}} = k_{\text{txn}}/k_{\text{deg}}$
$t_{1/2} = \ln 2 / k_{\text{deg}}$
Burst size = $k_{\text{init}}/k_{\text{off}}$
Fano factor > 1 indicates bursty expression
Dscam: 12 $\times$ 48 $\times$ 33 $\times$ 2 = 38,016 isoforms

← DNA Repair Translation →

Share:X Reddit LinkedIn