Part II: Transcriptomics | Chapter 8

Single-Cell & Spatial Transcriptomics

Resolving transcriptomic heterogeneity at single-cell resolution and mapping gene expression in spatial context

1. The Need for Single-Cell Resolution

Bulk RNA-Seq measures the average gene expression across millions of cells, obscuring the substantial heterogeneity that exists within seemingly homogeneous populations. This averaging effect is particularly problematic in tissues composed of multiple cell types (e.g., the brain contains hundreds of neuronal subtypes, glia, and vascular cells), in tumors exhibiting intratumoral heterogeneity, and in dynamic processes such as cellular differentiation where individual cells occupy different states along a continuum.

Simpson's Paradox in Bulk RNA-Seq

Bulk measurements can produce misleading results when cell composition changes between conditions. For example, if a treatment causes depletion of cell type A (which highly expresses gene X), bulk RNA-Seq would report gene X as downregulated, even if its expression within each cell type is unchanged. This compositional confounding is a form of Simpson's paradox that can only be resolved at single-cell resolution. Furthermore, bimodal gene expression distributions (where a gene is either fully ON or fully OFF in individual cells) appear as intermediate expression in bulk assays, masking the discrete, switch-like regulatory mechanisms governing many biological decisions.

Single-cell RNA sequencing (scRNA-seq) addresses these limitations by profiling the transcriptome of individual cells. Since its first application by Tang et al. in 2009, the technology has undergone remarkable scaling: early studies profiled dozens of cells, while modern high-throughput platforms routinely capture tens of thousands to millions of cells per experiment. This scaling has been driven by innovations in microfluidics, combinatorial barcoding, and droplet-based cell partitioning.

Key Applications of scRNA-seq

Cell Atlas Projects: The Human Cell Atlas aims to create comprehensive reference maps of all human cell types and states using scRNA-seq data.

Tumor Heterogeneity: Identifying drug-resistant subclones, characterizing the tumor microenvironment, and predicting therapy response.

Developmental Biology: Mapping cell fate decisions, lineage hierarchies, and signaling gradients during embryogenesis.

Immunology: Characterizing immune cell diversity, clonal expansion of T cells, and immune responses to infection or vaccination.

2. Single-Cell RNA-Seq Technologies

Droplet-Based Methods

Droplet-based platforms encapsulate individual cells in nanoliter-scale aqueous droplets along with barcoded gel beads in an oil emulsion. Each gel bead carries millions of copies of an oligonucleotide containing: (1) a PCR handle for downstream amplification, (2) a cell barcode (16 nt in 10x Genomics) that uniquely identifies the cell, (3) a unique molecular identifier (UMI, 12 nt) for PCR duplicate identification, and (4) a poly(dT) sequence for mRNA capture.

Major scRNA-seq Platforms

Platform	Type	Throughput	Coverage	Sensitivity
10x Chromium	Droplet	500-10,000 cells	3' or 5' end	~1,500-3,000 genes/cell
Drop-seq	Droplet	1,000-10,000 cells	3' end	~500-1,500 genes/cell
Smart-seq2	Plate	96-384 cells	Full-length	~4,000-8,000 genes/cell
MARS-seq	Plate (FACS)	1,000-10,000 cells	3' end	~1,000-3,000 genes/cell
inDrop	Droplet	1,000-10,000 cells	3' end	~1,000-2,500 genes/cell

Unique Molecular Identifiers (UMIs)

UMIs are short random nucleotide sequences (typically 8-12 nt) attached to each captured mRNA molecule before PCR amplification. Because each original mRNA molecule receives a distinct UMI, all PCR duplicates of the same molecule share the same UMI + cell barcode + gene identity combination. During analysis, reads sharing the same UMI within the same cell barcode are collapsed to a single count, providing an accurate digital measure of absolute transcript numbers. This eliminates PCR amplification bias, which is particularly severe in scRNA-seq due to the extremely low input amounts (5-20 pg of total RNA per cell).

The probability of a UMI collision (two different molecules receiving the same UMI) can be estimated by the birthday problem approximation: $$P(\text{collision}) \approx 1 - e^{-n^2 / (2 \cdot 4^k)}$$ where $n$ is the number of captured mRNA molecules per gene per cell (typically 1-100) and $k$ is the UMI length in nucleotides. For a 12-nt UMI ($4^{12} = 16{,}777{,}216$ possible sequences) and $n = 100$ molecules: $$P(\text{collision}) \approx 1 - e^{-100^2 / (2 \times 16{,}777{,}216)} \approx 0.0003$$ This extremely low collision rate ensures that UMI counting accurately reflects true molecular counts.

Doublet Rate Estimation

A major quality concern in droplet-based scRNA-seq is the formation of doublets (or multiplets), where two or more cells are captured in the same droplet and assigned the same cell barcode. Doublets appear as artificial hybrid transcriptomes and can confound clustering and cell type identification.

The doublet rate follows Poisson loading statistics. If $\lambda$ is the mean number of cells per droplet: $$P(\text{doublet}) = \frac{P(k \geq 2 | k \geq 1)}{P(k \geq 1)} = \frac{1 - P(0) - P(1)}{1 - P(0)}$$ For 10x Genomics: $$\text{Doublet rate} \approx 0.8\% \times \frac{N_{cells}}{1000}$$ At typical loading of 10,000 cells, the expected doublet rate is approximately 8%. Computational tools such as Scrublet and DoubletFinder use simulated doublets to identify and remove putative doublets from the dataset.

3. scRNA-seq Analysis Pipeline

The standard scRNA-seq analysis workflow begins with raw sequencing reads and proceeds through preprocessing, quality filtering, normalization, feature selection, dimensionality reduction, clustering, and biological interpretation. Two major software ecosystems dominate the field: Seurat (R/Bioconductor) and Scanpy (Python/AnnData). Both implement similar algorithmic pipelines but differ in their data structures and syntax.

Preprocessing: Cell Ranger

For 10x Genomics data, Cell Ranger performs demultiplexing, barcode processing, alignment (using STAR), and UMI counting to produce a cell-by-gene count matrix. It distinguishes cell-containing droplets from empty droplets using the EmptyDrops algorithm, which tests whether the RNA profile of each barcode differs significantly from the ambient RNA background (the set of barcodes with very low total UMI counts).

Quality Filtering

Quality control removes low-quality cells and potential artifacts. Three primary metrics are used:

QC Metric	Low-Quality Signature	Typical Threshold	Biological Interpretation
nFeature (genes detected)	Very low or very high	200-5000 genes	Low: empty droplet or dying cell; High: doublet
nCount (total UMIs)	Very low or very high	500-25,000 UMIs	Low: poor capture; High: doublet
percent.mt (mitochondrial %)	High (>10-20%)	<10-20%	High: cell lysis (cytoplasmic mRNA lost, mito mRNA retained)

Normalization & Batch Correction

scRNA-seq count data are normalized to remove cell-to-cell differences in sequencing depth. Seurat uses log-normalization: divide by total counts per cell, multiply by a scale factor (default 10,000), and log-transform. scran uses pool-based size factors computed by deconvolution of summed counts from cell pools, which is more robust for zero-inflated data. SCTransform(implemented in Seurat v3+) uses regularized negative binomial regression to directly model the relationship between mean expression and variance, performing normalization, variance stabilization, and feature selection in a single step.

When integrating data from multiple batches, experiments, or technologies, batch correction methods such as Harmony, Scanorama, BBKNN, and Seurat CCA/RPCA integration align shared cell populations across batches while preserving biological differences. Harmony operates in PCA space, iteratively adjusting the principal component embeddings to remove batch effects while retaining biologically meaningful variation.

4. Dimensionality Reduction & Clustering

scRNA-seq datasets typically have thousands of cells and 2,000-5,000 highly variable genes. Direct analysis of this high-dimensional space is computationally expensive and statistically unreliable due to the curse of dimensionality. Dimensionality reduction and graph-based clustering are therefore central to scRNA-seq analysis.

PCA for Initial Reduction

Principal Component Analysis is typically the first dimensionality reduction step, reducing the data from ~2,000 highly variable genes to 10-50 principal components. The elbow plot (variance explained vs. PC number) or the JackStraw procedure (in Seurat) helps determine the optimal number of PCs to retain. These PCs serve as input for both non-linear visualization methods and graph-based clustering.

t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE converts pairwise similarities between data points in high-dimensional space to probabilities, then seeks a low-dimensional (typically 2D) embedding that preserves these relationships. It excels at revealing local cluster structure but does not reliably preserve global relationships (i.e., distances between clusters are not meaningful). The perplexity parameter (typically 30-50) controls the effective number of neighbors and significantly affects the visualization.

UMAP (Uniform Manifold Approximation and Projection)

UMAP has largely replaced t-SNE as the preferred visualization method because it better preserves global structure, is faster, and provides a more consistent embedding across runs. UMAP is based on Riemannian geometry and algebraic topology, constructing a high-dimensional graph representation and then optimizing a low-dimensional embedding to match this topology.

UMAP minimizes the cross-entropy between the high-dimensional and low-dimensional fuzzy topological representations. The cost function is: $$C = \sum_{i \neq j} \left[ v_{ij} \log\left(\frac{v_{ij}}{w_{ij}}\right) + (1 - v_{ij}) \log\left(\frac{1 - v_{ij}}{1 - w_{ij}}\right) \right]$$ where $v_{ij}$ represents the membership strength of the edge between points $i$ and $j$ in the high-dimensional fuzzy simplicial set: $$v_{ij} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$$ with $\rho_i$ being the distance to the nearest neighbor and $\sigma_i$ chosen such that each point has approximately $k$ effective neighbors (controlled by the n_neighbors parameter). In the low-dimensional space: $$w_{ij} = \left(1 + a \|y_i - y_j\|^{2b}\right)^{-1}$$ Key parameters: n_neighbors (typically 15-30) controls local vs. global structure balance; min_dist (typically 0.1-0.5) controls how tightly points are packed.

Graph-Based Clustering: Leiden & Louvain

Both algorithms operate on a k-nearest neighbor (KNN) graph constructed in PC space, where cells are nodes and edges connect cells with similar expression profiles (weighted by Jaccard similarity in the shared nearest neighbor variant). The Louvain algorithmpartitions the graph by iteratively optimizing modularity, a measure of how densely connected nodes within communities are compared to a random graph. The Leiden algorithm (an improvement over Louvain) guarantees well-connected communities by introducing a refinement phase that avoids arbitrarily poorly connected clusters. The resolution parameter controls cluster granularity: higher values yield more clusters.

Cell Type Annotation Strategies

Marker Gene Approach: Identify differentially expressed marker genes for each cluster (Wilcoxon rank-sum test) and match against known cell-type markers from literature or databases (CellMarker, PanglaoDB).

Reference-Based Transfer: Transfer cell type labels from annotated reference datasets using tools like SingleR, scmap, or Seurat label transfer (based on anchor identification in shared PCA space).

Machine Learning Classifiers: Train supervised models (e.g., CellTypist, scANVI) on large reference atlases to automatically classify cells in query datasets with high accuracy and scalability.

Interactive Example: PCA Dimensionality Reduction on Simulated Single-Cell Data

This Python script simulates a single-cell gene expression matrix with three distinct cell types, performs PCA from scratch (using eigen decomposition of the covariance matrix), and shows how cells separate in PC space. It also computes variance explained by each component.

Python

script.py89 lines

import numpy as np

np.random.seed(42)

# Simulate 3 cell types with distinct expression profiles
n_cells_per_type = 30
n_genes = 8
gene_names = ["CD3E", "CD19", "CD14", "GZMB", "MS4A1", "LYZ", "NKG7", "CD79A"]
cell_types = ["T-cell", "B-cell", "Monocyte"]

# Mean expression per gene per cell type (log-scale)
profiles = {
    "T-cell":   [5.2, 0.3, 0.2, 4.1, 0.1, 0.5, 3.8, 0.2],
    "B-cell":   [0.4, 4.8, 0.3, 0.2, 5.1, 0.4, 0.3, 4.5],
    "Monocyte": [0.3, 0.2, 5.5, 0.1, 0.3, 5.8, 0.2, 0.1],
}

# Generate simulated expression data
data = []
labels = []
for ct in cell_types:
    for _ in range(n_cells_per_type):
        expr = [max(0, np.random.normal(m, 0.8)) for m in profiles[ct]]
        data.append(expr)
        labels.append(ct)

X = np.array(data)
n_total = X.shape[0]

print(f"Simulated dataset: {n_total} cells x {n_genes} genes")
print(f"Cell types: {', '.join(cell_types)} ({n_cells_per_type} each)")
print()

# Show first few cells
print("=== Sample of Raw Expression Matrix ===")
print(f"{'Cell':<12}{'Type':<12}" + "".join([f"{g:>8}" for g in gene_names]))
print("-" * (24 + 8 * n_genes))
for i in [0, 1, 30, 31, 60, 61]:
    vals = "".join([f"{X[i,j]:>8.2f}" for j in range(n_genes)])
    print(f"{'Cell_'+str(i+1):<12}{labels[i]:<12}{vals}")
print("...")

# Step 1: Center the data
X_centered = X - X.mean(axis=0)

# Step 2: Covariance matrix
cov_matrix = np.cov(X_centered, rowvar=False)

# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Variance explained
var_explained = eigenvalues / eigenvalues.sum() * 100

print("\n=== PCA Results ===")
print(f"{'PC':<6}{'Eigenvalue':>12}{'Var Explained':>15}{'Cumulative':>12}")
print("-" * 45)
cum = 0
for i in range(min(6, n_genes)):
    cum += var_explained[i]
    print(f"PC{i+1:<4}{eigenvalues[i]:>12.3f}{var_explained[i]:>14.1f}%{cum:>11.1f}%")

# Project data onto PC1 and PC2
pc_scores = X_centered @ eigenvectors[:, :2]

print("\n=== Cell Positions in PC Space (PC1 vs PC2) ===")
print(f"{'Type':<12}{'PC1 mean':>10}{'PC1 std':>10}{'PC2 mean':>10}{'PC2 std':>10}")
print("-" * 52)
for ct in cell_types:
    mask = [l == ct for l in labels]
    pc1_vals = pc_scores[mask, 0]
    pc2_vals = pc_scores[mask, 1]
    print(f"{ct:<12}{pc1_vals.mean():>10.3f}{pc1_vals.std():>10.3f}"
          f"{pc2_vals.mean():>10.3f}{pc2_vals.std():>10.3f}")

# Top loading genes for PC1 and PC2
print("\n=== Top Gene Loadings ===")
print(f"{'Gene':<10}{'PC1 loading':>14}{'PC2 loading':>14}")
print("-" * 38)
for i in range(n_genes):
    print(f"{gene_names[i]:<10}{eigenvectors[i,0]:>14.4f}{eigenvectors[i,1]:>14.4f}")

print("\nInterpretation: Cell types form distinct clusters in PC space.")
print(f"PC1 ({var_explained[0]:.1f}%) and PC2 ({var_explained[1]:.1f}%) capture the major variation.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Interactive Example: K-Means Cell Clustering by Gene Expression (Fortran)

This Fortran program implements k-means clustering on simulated single-cell gene expression data with 3 cell populations. It iterates until convergence and reports cluster assignments, centroids, and within-cluster sum of squares.

Fortran

program.f90168 lines

program kmeans_cell_clustering
  implicit none
  integer, parameter :: ncells = 30, ngenes = 4, k = 3, max_iter = 50
  double precision :: data(ncells, ngenes)
  double precision :: centroids(k, ngenes), new_centroids(k, ngenes)
  integer :: assignment(ncells), counts(k)
  double precision :: dist, min_dist, wcss(k), total_wcss
  integer :: i, j, g, iter, min_idx, changed
  logical :: converged

! Simulated expression data: 3 clusters of 10 cells each
  ! Cluster 1 (T-cells): high gene1,gene2; low gene3,gene4
  do i = 1, 10
    data(i,1) = 5.0d0 + 0.5d0 * (i - 5.5d0) / 5.0d0
    data(i,2) = 4.5d0 + 0.4d0 * (i - 5.5d0) / 5.0d0
    data(i,3) = 0.5d0 + 0.2d0 * (i - 5.5d0) / 5.0d0
    data(i,4) = 0.3d0 + 0.15d0 * (i - 5.5d0) / 5.0d0
  end do
  ! Cluster 2 (B-cells): low gene1; high gene3,gene4
  do i = 11, 20
    data(i,1) = 0.4d0 + 0.3d0 * (i - 15.5d0) / 5.0d0
    data(i,2) = 0.6d0 + 0.2d0 * (i - 15.5d0) / 5.0d0
    data(i,3) = 5.2d0 + 0.6d0 * (i - 15.5d0) / 5.0d0
    data(i,4) = 4.8d0 + 0.5d0 * (i - 15.5d0) / 5.0d0
  end do
  ! Cluster 3 (Monocytes): medium gene1; high gene2; low others
  do i = 21, 30
    data(i,1) = 2.5d0 + 0.4d0 * (i - 25.5d0) / 5.0d0
    data(i,2) = 5.8d0 + 0.3d0 * (i - 25.5d0) / 5.0d0
    data(i,3) = 1.0d0 + 0.3d0 * (i - 25.5d0) / 5.0d0
    data(i,4) = 0.8d0 + 0.2d0 * (i - 25.5d0) / 5.0d0
  end do

write(*,'(A)') '============================================================'
  write(*,'(A)') '   K-Means Clustering of Single-Cell Expression Data'
  write(*,'(A,I3,A,I2,A,I2,A)') '   (', ncells, ' cells, ', ngenes, &
    ' genes, k=', k, ')'
  write(*,'(A)') '============================================================'
  write(*,'(A)')

! Print sample data
  write(*,'(A)') '--- Sample Input Data ---'
  write(*,'(A8,A10,A10,A10,A10)') 'Cell', 'Gene1', 'Gene2', 'Gene3', 'Gene4'
  write(*,'(A)') '------------------------------------------------'
  do i = 1, ncells, 5
    write(*,'(A4,I4,4F10.3)') 'C', i, (data(i,g), g=1,ngenes)
  end do
  write(*,'(A)') '...'
  write(*,'(A)')

! Initialize centroids to first k data points (spread out)
  do g = 1, ngenes
    centroids(1,g) = data(1,g)
    centroids(2,g) = data(11,g)
    centroids(3,g) = data(21,g)
  end do

! K-means iteration
  converged = .false.
  do iter = 1, max_iter
    changed = 0

! Assignment step
    do i = 1, ncells
      min_dist = 1.0d10
      min_idx = 1
      do j = 1, k
        dist = 0.0d0
        do g = 1, ngenes
          dist = dist + (data(i,g) - centroids(j,g))**2
        end do
        if (dist < min_dist) then
          min_dist = dist
          min_idx = j
        end if
      end do
      if (assignment(i) /= min_idx) changed = changed + 1
      assignment(i) = min_idx
    end do

! Update step
    new_centroids = 0.0d0
    counts = 0
    do i = 1, ncells
      j = assignment(i)
      counts(j) = counts(j) + 1
      do g = 1, ngenes
        new_centroids(j,g) = new_centroids(j,g) + data(i,g)
      end do
    end do

do j = 1, k
      if (counts(j) > 0) then
        do g = 1, ngenes
          new_centroids(j,g) = new_centroids(j,g) / counts(j)
        end do
      end if
    end do

centroids = new_centroids

write(*,'(A,I3,A,I3,A,3(A,I3))') 'Iter ', iter, &
      ':  changed=', changed, '  sizes=[', &
      counts(1), ',', counts(2), ',', counts(3)
    write(*,'(A)') ']'

if (changed == 0) then
      converged = .true.
      exit
    end if
  end do

write(*,'(A)')
  if (converged) then
    write(*,'(A,I3,A)') 'Converged after ', iter, ' iterations.'
  else
    write(*,'(A)') 'Did not converge within max iterations.'
  end if
  write(*,'(A)')

! Report final centroids
  write(*,'(A)') '--- Final Cluster Centroids ---'
  write(*,'(A10,A10,A10,A10,A10,A8)') 'Cluster', 'Gene1', 'Gene2', &
    'Gene3', 'Gene4', 'Size'
  write(*,'(A)') '------------------------------------------------------'
  do j = 1, k
    write(*,'(A6,I4,4F10.3,I8)') 'Cl', j, &
      (centroids(j,g), g=1,ngenes), counts(j)
  end do
  write(*,'(A)')

! Compute WCSS
  total_wcss = 0.0d0
  do j = 1, k
    wcss(j) = 0.0d0
  end do
  do i = 1, ncells
    j = assignment(i)
    do g = 1, ngenes
      wcss(j) = wcss(j) + (data(i,g) - centroids(j,g))**2
    end do
  end do
  do j = 1, k
    total_wcss = total_wcss + wcss(j)
  end do

write(*,'(A)') '--- Within-Cluster Sum of Squares (WCSS) ---'
  do j = 1, k
    write(*,'(A,I2,A,F10.4)') 'Cluster ', j, ': WCSS = ', wcss(j)
  end do
  write(*,'(A,F10.4)') 'Total WCSS: ', total_wcss
  write(*,'(A)')

! Show cell assignments
  write(*,'(A)') '--- Cell Assignments ---'
  write(*,'(A8,A10,A20)') 'Cell', 'Cluster', 'True Group'
  write(*,'(A)') '--------------------------------------'
  do i = 1, ncells
    if (i <= 10) then
      write(*,'(A4,I4,I10,A20)') 'C', i, assignment(i), 'Group 1 (T-cell)'
    else if (i <= 20) then
      write(*,'(A4,I4,I10,A20)') 'C', i, assignment(i), 'Group 2 (B-cell)'
    else
      write(*,'(A4,I4,I10,A20)') 'C', i, assignment(i), 'Group 3 (Monocyte)'
    end if
  end do
end program kmeans_cell_clustering

Click Run to execute the Fortran code

Code will be compiled with gfortran and executed on the server

5. Trajectory Analysis & RNA Velocity

Biological processes such as differentiation, activation, and cell cycle progression are continuous rather than discrete. Trajectory analysis (pseudotime inference) orders cells along these continuous processes based on their transcriptomic similarity, reconstructing the temporal dynamics of gene expression without requiring time-series experiments.

Pseudotime Methods

Monocle 3 constructs a principal graph (tree or graph structure) through the UMAP embedding using the SimplePPT algorithm and assigns each cell a pseudotime value based on its geodesic distance from a specified root cell. Slingshot fits principal curves through cluster centroids in reduced-dimensional space, handling bifurcating trajectories by identifying lineage-specific curves. PAGA (Partition-based Graph Abstraction, in Scanpy) provides a coarse-grained map of connectivity between cell clusters, which can guide more detailed trajectory inference.

RNA Velocity

RNA velocity, introduced by La Manno et al. (2018), infers the future transcriptional state of each cell by exploiting the ratio of unspliced (nascent) to spliced (mature) mRNA molecules. Standard scRNA-seq captures both spliced and unspliced reads, distinguished by the presence of intronic sequences. The key insight is that for a gene being upregulated, unspliced mRNA accumulates before the corresponding increase in spliced mRNA, and vice versa for downregulation.

The RNA velocity model describes the temporal dynamics of spliced ($s$) and unspliced ($u$) mRNA for each gene: $$\frac{du}{dt} = \alpha(t) - \beta u$$ $$\frac{ds}{dt} = \beta u - \gamma s$$ where $\alpha(t)$ is the transcription rate, $\beta$ is the splicing rate, and $\gamma$ is the degradation rate. At steady state ($du/dt = 0, ds/dt = 0$): $$s_{steady} = \frac{\beta}{\gamma} u_{steady} = \frac{\alpha}{\gamma}$$ The RNA velocity for each gene in each cell is estimated as the deviation from this steady-state relationship: $$v_s = \frac{ds}{dt} = \beta u - \gamma s$$ Positive velocity indicates the gene is being upregulated; negative velocity indicates downregulation. The velocity vector for each cell is projected onto the UMAP or t-SNE embedding to visualize the predicted direction of transcriptional change, producing the characteristic arrow plots.

Shannon Entropy for Cell State Diversity

Shannon entropy can quantify the transcriptomic diversity within a cell population or the differentiation potential of individual cells. Higher entropy indicates a more uniform distribution of gene expression across many genes (characteristic of progenitor/stem cells), while lower entropy indicates specialization (characteristic of differentiated cells).

For a cell $c$ with expression profile normalized to proportions $p_i = x_i / \sum_j x_j$: $$H(c) = -\sum_{i=1}^{G} p_i \log_2(p_i)$$ where $G$ is the number of genes. The maximum entropy $H_{max} = \log_2(G)$ occurs when all genes are equally expressed. The relative entropy (or specificity) is: $$S(c) = 1 - \frac{H(c)}{H_{max}}$$ Similarly, mutual information between gene expression and cell type can identify genes most informative for distinguishing cell types: $$I(X; Y) = \sum_{x} \sum_{y} p(x, y) \log_2 \frac{p(x,y)}{p(x)p(y)}$$

6. Spatial Transcriptomics

While scRNA-seq provides single-cell resolution, it requires tissue dissociation, which destroys spatial context. Spatial transcriptomics technologies measure gene expression while preserving the original tissue architecture, enabling the study of cell-cell communication, tissue organization, and spatial gene expression gradients. The field has expanded rapidly, with Nature Methods naming spatially resolved transcriptomics the 2020 Method of the Year.

Spatial Transcriptomics Technologies

10x Visium

Sequencing-based approach where tissue sections are placed on slides containing arrays of barcoded oligonucleotides at ~55 micrometer spot diameter (capturing 1-10 cells per spot). mRNA is captured, reverse-transcribed with spatial barcodes, and sequenced. Provides whole-transcriptome data with near-cellular resolution. Visium HD achieves 2-micrometer resolution through smaller bin sizes.

MERFISH (Multiplexed Error-Robust FISH)

Imaging-based approach using combinatorial FISH with error-robust binary barcoding. Each RNA species is assigned a unique binary barcode across multiple rounds of hybridization and imaging. Achieves subcellular resolution for hundreds to thousands of genes simultaneously. Developed by Xiaowei Zhuang's group and commercialized by Vizgen (MERSCOPE platform).

seqFISH / seqFISH+

Sequential fluorescence in situ hybridization uses multiple rounds of probe hybridization, imaging, and probe removal to assign barcodes to individual RNA molecules. seqFISH+ can profile over 10,000 genes at single-molecule resolution in individual cells within intact tissue sections. Similar in principle to MERFISH but uses a different barcoding strategy.

Slide-seq / Slide-seqV2

Uses a dense array of DNA-barcoded beads (10 micrometer diameter) on a rubber-coated glass surface. Tissue sections are transferred onto the bead array, and mRNA is captured and sequenced with spatial barcode information. Achieves near-cellular resolution across the entire transcriptome.

Deconvolution & Integration

For sequencing-based spatial methods where each spot contains multiple cells, deconvolution methods estimate the cell-type composition of each spot by integrating spatial data with a reference scRNA-seq dataset. Tools such as cell2location, RCTD (Robust Cell Type Decomposition), SPOTlight, and Tangram use different statistical frameworks (Bayesian models, NMF, optimal transport) to map single-cell reference profiles onto spatial spots. These methods effectively achieve virtual single-cell resolution from multi-cell spatial measurements.

Multi-Omics Integration

Emerging technologies combine single-cell transcriptomics with other modalities: CITE-seq simultaneously measures mRNA and surface protein (ADTs) via oligonucleotide-conjugated antibodies; SHARE-seqand 10x Multiome jointly profile chromatin accessibility (ATAC-seq) and gene expression; scBS-seq combines single-cell bisulfite sequencing with transcriptomics. Computational integration methods like WNN (Weighted Nearest Neighbor) in Seurat v4 and MOFA+ identify shared and modality-specific sources of variation, providing a comprehensive view of cellular identity and regulation.

7. Key Concepts & Chapter Summary

Comparing scRNA-seq Approaches

Property	Droplet-Based	Plate-Based	Spatial
Cell throughput	High (10^3-10^6)	Low (10^2-10^3)	Medium-High
Sensitivity (genes/cell)	1,500-3,000	4,000-8,000	Variable (100-10,000)
Transcript coverage	3' or 5' end	Full-length	Targeted or whole
Spatial information	Lost	Lost	Preserved
Cost per cell	$0.01-0.10	$1-10	Variable

Chapter Summary

●Single-cell RNA-seq resolves transcriptomic heterogeneity invisible to bulk methods, revealing rare cell types, continuous transitions, and cell-to-cell variability.
●Droplet-based methods (10x Chromium) offer high throughput with 3'/5' end coverage, while plate-based methods (Smart-seq2) provide full-length transcripts at lower throughput.
●UMIs eliminate PCR amplification bias by enabling digital counting of original mRNA molecules; doublet detection is essential for data quality.
●The standard analysis pipeline encompasses QC filtering, normalization, PCA, UMAP visualization, graph-based clustering (Leiden/Louvain), and cell type annotation.
●RNA velocity leverages the ratio of unspliced to spliced mRNA to predict future cell states, adding temporal directionality to snapshot data.
●Spatial transcriptomics (Visium, MERFISH, seqFISH) preserves tissue context; deconvolution methods integrate scRNA-seq references to achieve virtual single-cell spatial resolution.
●Multi-omics integration (CITE-seq, Multiome, WNN) combines transcriptomics with protein, chromatin, and epigenomic measurements for comprehensive cellular profiling.

Share:X Reddit LinkedIn

←Previous: RNA-Seq: Principles & Analysis Next: Protein Separation & 2D-PAGE→