Single-Cell & Spatial Transcriptomics
Resolving transcriptomic heterogeneity at single-cell resolution and mapping gene expression in spatial context
1. The Need for Single-Cell Resolution
Bulk RNA-Seq measures the average gene expression across millions of cells, obscuring the substantial heterogeneity that exists within seemingly homogeneous populations. This averaging effect is particularly problematic in tissues composed of multiple cell types (e.g., the brain contains hundreds of neuronal subtypes, glia, and vascular cells), in tumors exhibiting intratumoral heterogeneity, and in dynamic processes such as cellular differentiation where individual cells occupy different states along a continuum.
Simpson's Paradox in Bulk RNA-Seq
Bulk measurements can produce misleading results when cell composition changes between conditions. For example, if a treatment causes depletion of cell type A (which highly expresses gene X), bulk RNA-Seq would report gene X as downregulated, even if its expression within each cell type is unchanged. This compositional confounding is a form of Simpson's paradox that can only be resolved at single-cell resolution. Furthermore, bimodal gene expression distributions (where a gene is either fully ON or fully OFF in individual cells) appear as intermediate expression in bulk assays, masking the discrete, switch-like regulatory mechanisms governing many biological decisions.
Single-cell RNA sequencing (scRNA-seq) addresses these limitations by profiling the transcriptome of individual cells. Since its first application by Tang et al. in 2009, the technology has undergone remarkable scaling: early studies profiled dozens of cells, while modern high-throughput platforms routinely capture tens of thousands to millions of cells per experiment. This scaling has been driven by innovations in microfluidics, combinatorial barcoding, and droplet-based cell partitioning.
Key Applications of scRNA-seq
Cell Atlas Projects: The Human Cell Atlas aims to create comprehensive reference maps of all human cell types and states using scRNA-seq data.
Tumor Heterogeneity: Identifying drug-resistant subclones, characterizing the tumor microenvironment, and predicting therapy response.
Developmental Biology: Mapping cell fate decisions, lineage hierarchies, and signaling gradients during embryogenesis.
Immunology: Characterizing immune cell diversity, clonal expansion of T cells, and immune responses to infection or vaccination.
2. Single-Cell RNA-Seq Technologies
Droplet-Based Methods
Droplet-based platforms encapsulate individual cells in nanoliter-scale aqueous droplets along with barcoded gel beads in an oil emulsion. Each gel bead carries millions of copies of an oligonucleotide containing: (1) a PCR handle for downstream amplification, (2) a cell barcode (16 nt in 10x Genomics) that uniquely identifies the cell, (3) a unique molecular identifier (UMI, 12 nt) for PCR duplicate identification, and (4) a poly(dT) sequence for mRNA capture.
Major scRNA-seq Platforms
| Platform | Type | Throughput | Coverage | Sensitivity |
|---|---|---|---|---|
| 10x Chromium | Droplet | 500-10,000 cells | 3' or 5' end | ~1,500-3,000 genes/cell |
| Drop-seq | Droplet | 1,000-10,000 cells | 3' end | ~500-1,500 genes/cell |
| Smart-seq2 | Plate | 96-384 cells | Full-length | ~4,000-8,000 genes/cell |
| MARS-seq | Plate (FACS) | 1,000-10,000 cells | 3' end | ~1,000-3,000 genes/cell |
| inDrop | Droplet | 1,000-10,000 cells | 3' end | ~1,000-2,500 genes/cell |
Unique Molecular Identifiers (UMIs)
UMIs are short random nucleotide sequences (typically 8-12 nt) attached to each captured mRNA molecule before PCR amplification. Because each original mRNA molecule receives a distinct UMI, all PCR duplicates of the same molecule share the same UMI + cell barcode + gene identity combination. During analysis, reads sharing the same UMI within the same cell barcode are collapsed to a single count, providing an accurate digital measure of absolute transcript numbers. This eliminates PCR amplification bias, which is particularly severe in scRNA-seq due to the extremely low input amounts (5-20 pg of total RNA per cell).
The probability of a UMI collision (two different molecules receiving the same UMI) can be estimated by the birthday problem approximation: $$P(\text{collision}) \approx 1 - e^{-n^2 / (2 \cdot 4^k)}$$ where \(n\) is the number of captured mRNA molecules per gene per cell (typically 1-100) and \(k\) is the UMI length in nucleotides. For a 12-nt UMI (\(4^{12} = 16{,}777{,}216\) possible sequences) and \(n = 100\) molecules: $$P(\text{collision}) \approx 1 - e^{-100^2 / (2 \times 16{,}777{,}216)} \approx 0.0003$$ This extremely low collision rate ensures that UMI counting accurately reflects true molecular counts.
Doublet Rate Estimation
A major quality concern in droplet-based scRNA-seq is the formation of doublets (or multiplets), where two or more cells are captured in the same droplet and assigned the same cell barcode. Doublets appear as artificial hybrid transcriptomes and can confound clustering and cell type identification.
The doublet rate follows Poisson loading statistics. If \(\lambda\) is the mean number of cells per droplet: $$P(\text{doublet}) = \frac{P(k \geq 2 | k \geq 1)}{P(k \geq 1)} = \frac{1 - P(0) - P(1)}{1 - P(0)}$$ For 10x Genomics: $$\text{Doublet rate} \approx 0.8\% \times \frac{N_{cells}}{1000}$$ At typical loading of 10,000 cells, the expected doublet rate is approximately 8%. Computational tools such as Scrublet and DoubletFinder use simulated doublets to identify and remove putative doublets from the dataset.
3. scRNA-seq Analysis Pipeline
The standard scRNA-seq analysis workflow begins with raw sequencing reads and proceeds through preprocessing, quality filtering, normalization, feature selection, dimensionality reduction, clustering, and biological interpretation. Two major software ecosystems dominate the field: Seurat (R/Bioconductor) and Scanpy (Python/AnnData). Both implement similar algorithmic pipelines but differ in their data structures and syntax.
Preprocessing: Cell Ranger
For 10x Genomics data, Cell Ranger performs demultiplexing, barcode processing, alignment (using STAR), and UMI counting to produce a cell-by-gene count matrix. It distinguishes cell-containing droplets from empty droplets using the EmptyDrops algorithm, which tests whether the RNA profile of each barcode differs significantly from the ambient RNA background (the set of barcodes with very low total UMI counts).
Quality Filtering
Quality control removes low-quality cells and potential artifacts. Three primary metrics are used:
| QC Metric | Low-Quality Signature | Typical Threshold | Biological Interpretation |
|---|---|---|---|
| nFeature (genes detected) | Very low or very high | 200-5000 genes | Low: empty droplet or dying cell; High: doublet |
| nCount (total UMIs) | Very low or very high | 500-25,000 UMIs | Low: poor capture; High: doublet |
| percent.mt (mitochondrial %) | High (>10-20%) | <10-20% | High: cell lysis (cytoplasmic mRNA lost, mito mRNA retained) |
Normalization & Batch Correction
scRNA-seq count data are normalized to remove cell-to-cell differences in sequencing depth. Seurat uses log-normalization: divide by total counts per cell, multiply by a scale factor (default 10,000), and log-transform. scran uses pool-based size factors computed by deconvolution of summed counts from cell pools, which is more robust for zero-inflated data. SCTransform(implemented in Seurat v3+) uses regularized negative binomial regression to directly model the relationship between mean expression and variance, performing normalization, variance stabilization, and feature selection in a single step.
When integrating data from multiple batches, experiments, or technologies, batch correction methods such as Harmony, Scanorama, BBKNN, and Seurat CCA/RPCA integration align shared cell populations across batches while preserving biological differences. Harmony operates in PCA space, iteratively adjusting the principal component embeddings to remove batch effects while retaining biologically meaningful variation.
4. Dimensionality Reduction & Clustering
scRNA-seq datasets typically have thousands of cells and 2,000-5,000 highly variable genes. Direct analysis of this high-dimensional space is computationally expensive and statistically unreliable due to the curse of dimensionality. Dimensionality reduction and graph-based clustering are therefore central to scRNA-seq analysis.
PCA for Initial Reduction
Principal Component Analysis is typically the first dimensionality reduction step, reducing the data from ~2,000 highly variable genes to 10-50 principal components. The elbow plot (variance explained vs. PC number) or the JackStraw procedure (in Seurat) helps determine the optimal number of PCs to retain. These PCs serve as input for both non-linear visualization methods and graph-based clustering.
t-SNE (t-distributed Stochastic Neighbor Embedding)
t-SNE converts pairwise similarities between data points in high-dimensional space to probabilities, then seeks a low-dimensional (typically 2D) embedding that preserves these relationships. It excels at revealing local cluster structure but does not reliably preserve global relationships (i.e., distances between clusters are not meaningful). The perplexity parameter (typically 30-50) controls the effective number of neighbors and significantly affects the visualization.
UMAP (Uniform Manifold Approximation and Projection)
UMAP has largely replaced t-SNE as the preferred visualization method because it better preserves global structure, is faster, and provides a more consistent embedding across runs. UMAP is based on Riemannian geometry and algebraic topology, constructing a high-dimensional graph representation and then optimizing a low-dimensional embedding to match this topology.
UMAP minimizes the cross-entropy between the high-dimensional and low-dimensional fuzzy topological representations. The cost function is: $$C = \sum_{i \neq j} \left[ v_{ij} \log\left(\frac{v_{ij}}{w_{ij}}\right) + (1 - v_{ij}) \log\left(\frac{1 - v_{ij}}{1 - w_{ij}}\right) \right]$$ where \(v_{ij}\) represents the membership strength of the edge between points \(i\) and \(j\) in the high-dimensional fuzzy simplicial set: $$v_{ij} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$$ with \(\rho_i\) being the distance to the nearest neighbor and \(\sigma_i\) chosen such that each point has approximately \(k\) effective neighbors (controlled by the n_neighbors parameter). In the low-dimensional space: $$w_{ij} = \left(1 + a \|y_i - y_j\|^{2b}\right)^{-1}$$ Key parameters: n_neighbors (typically 15-30) controls local vs. global structure balance; min_dist (typically 0.1-0.5) controls how tightly points are packed.
Graph-Based Clustering: Leiden & Louvain
Both algorithms operate on a k-nearest neighbor (KNN) graph constructed in PC space, where cells are nodes and edges connect cells with similar expression profiles (weighted by Jaccard similarity in the shared nearest neighbor variant). The Louvain algorithmpartitions the graph by iteratively optimizing modularity, a measure of how densely connected nodes within communities are compared to a random graph. The Leiden algorithm (an improvement over Louvain) guarantees well-connected communities by introducing a refinement phase that avoids arbitrarily poorly connected clusters. The resolution parameter controls cluster granularity: higher values yield more clusters.
Cell Type Annotation Strategies
5. Trajectory Analysis & RNA Velocity
Biological processes such as differentiation, activation, and cell cycle progression are continuous rather than discrete. Trajectory analysis (pseudotime inference) orders cells along these continuous processes based on their transcriptomic similarity, reconstructing the temporal dynamics of gene expression without requiring time-series experiments.
Pseudotime Methods
Monocle 3 constructs a principal graph (tree or graph structure) through the UMAP embedding using the SimplePPT algorithm and assigns each cell a pseudotime value based on its geodesic distance from a specified root cell. Slingshot fits principal curves through cluster centroids in reduced-dimensional space, handling bifurcating trajectories by identifying lineage-specific curves. PAGA (Partition-based Graph Abstraction, in Scanpy) provides a coarse-grained map of connectivity between cell clusters, which can guide more detailed trajectory inference.
RNA Velocity
RNA velocity, introduced by La Manno et al. (2018), infers the future transcriptional state of each cell by exploiting the ratio of unspliced (nascent) to spliced (mature) mRNA molecules. Standard scRNA-seq captures both spliced and unspliced reads, distinguished by the presence of intronic sequences. The key insight is that for a gene being upregulated, unspliced mRNA accumulates before the corresponding increase in spliced mRNA, and vice versa for downregulation.
The RNA velocity model describes the temporal dynamics of spliced (\(s\)) and unspliced (\(u\)) mRNA for each gene: $$\frac{du}{dt} = \alpha(t) - \beta u$$ $$\frac{ds}{dt} = \beta u - \gamma s$$ where \(\alpha(t)\) is the transcription rate, \(\beta\) is the splicing rate, and \(\gamma\) is the degradation rate. At steady state (\(du/dt = 0, ds/dt = 0\)): $$s_{steady} = \frac{\beta}{\gamma} u_{steady} = \frac{\alpha}{\gamma}$$ The RNA velocity for each gene in each cell is estimated as the deviation from this steady-state relationship: $$v_s = \frac{ds}{dt} = \beta u - \gamma s$$ Positive velocity indicates the gene is being upregulated; negative velocity indicates downregulation. The velocity vector for each cell is projected onto the UMAP or t-SNE embedding to visualize the predicted direction of transcriptional change, producing the characteristic arrow plots.
Shannon Entropy for Cell State Diversity
Shannon entropy can quantify the transcriptomic diversity within a cell population or the differentiation potential of individual cells. Higher entropy indicates a more uniform distribution of gene expression across many genes (characteristic of progenitor/stem cells), while lower entropy indicates specialization (characteristic of differentiated cells).
For a cell \(c\) with expression profile normalized to proportions \(p_i = x_i / \sum_j x_j\): $$H(c) = -\sum_{i=1}^{G} p_i \log_2(p_i)$$ where \(G\) is the number of genes. The maximum entropy \(H_{max} = \log_2(G)\) occurs when all genes are equally expressed. The relative entropy (or specificity) is: $$S(c) = 1 - \frac{H(c)}{H_{max}}$$ Similarly, mutual information between gene expression and cell type can identify genes most informative for distinguishing cell types: $$I(X; Y) = \sum_{x} \sum_{y} p(x, y) \log_2 \frac{p(x,y)}{p(x)p(y)}$$
6. Spatial Transcriptomics
While scRNA-seq provides single-cell resolution, it requires tissue dissociation, which destroys spatial context. Spatial transcriptomics technologies measure gene expression while preserving the original tissue architecture, enabling the study of cell-cell communication, tissue organization, and spatial gene expression gradients. The field has expanded rapidly, with Nature Methods naming spatially resolved transcriptomics the 2020 Method of the Year.
Spatial Transcriptomics Technologies
10x Visium
Sequencing-based approach where tissue sections are placed on slides containing arrays of barcoded oligonucleotides at ~55 micrometer spot diameter (capturing 1-10 cells per spot). mRNA is captured, reverse-transcribed with spatial barcodes, and sequenced. Provides whole-transcriptome data with near-cellular resolution. Visium HD achieves 2-micrometer resolution through smaller bin sizes.
MERFISH (Multiplexed Error-Robust FISH)
Imaging-based approach using combinatorial FISH with error-robust binary barcoding. Each RNA species is assigned a unique binary barcode across multiple rounds of hybridization and imaging. Achieves subcellular resolution for hundreds to thousands of genes simultaneously. Developed by Xiaowei Zhuang's group and commercialized by Vizgen (MERSCOPE platform).
seqFISH / seqFISH+
Sequential fluorescence in situ hybridization uses multiple rounds of probe hybridization, imaging, and probe removal to assign barcodes to individual RNA molecules. seqFISH+ can profile over 10,000 genes at single-molecule resolution in individual cells within intact tissue sections. Similar in principle to MERFISH but uses a different barcoding strategy.
Slide-seq / Slide-seqV2
Uses a dense array of DNA-barcoded beads (10 micrometer diameter) on a rubber-coated glass surface. Tissue sections are transferred onto the bead array, and mRNA is captured and sequenced with spatial barcode information. Achieves near-cellular resolution across the entire transcriptome.
Deconvolution & Integration
For sequencing-based spatial methods where each spot contains multiple cells, deconvolution methods estimate the cell-type composition of each spot by integrating spatial data with a reference scRNA-seq dataset. Tools such as cell2location, RCTD (Robust Cell Type Decomposition), SPOTlight, and Tangram use different statistical frameworks (Bayesian models, NMF, optimal transport) to map single-cell reference profiles onto spatial spots. These methods effectively achieve virtual single-cell resolution from multi-cell spatial measurements.
Multi-Omics Integration
Emerging technologies combine single-cell transcriptomics with other modalities: CITE-seq simultaneously measures mRNA and surface protein (ADTs) via oligonucleotide-conjugated antibodies; SHARE-seqand 10x Multiome jointly profile chromatin accessibility (ATAC-seq) and gene expression; scBS-seq combines single-cell bisulfite sequencing with transcriptomics. Computational integration methods like WNN (Weighted Nearest Neighbor) in Seurat v4 and MOFA+ identify shared and modality-specific sources of variation, providing a comprehensive view of cellular identity and regulation.
7. Key Concepts & Chapter Summary
Comparing scRNA-seq Approaches
| Property | Droplet-Based | Plate-Based | Spatial |
|---|---|---|---|
| Cell throughput | High (10^3-10^6) | Low (10^2-10^3) | Medium-High |
| Sensitivity (genes/cell) | 1,500-3,000 | 4,000-8,000 | Variable (100-10,000) |
| Transcript coverage | 3' or 5' end | Full-length | Targeted or whole |
| Spatial information | Lost | Lost | Preserved |
| Cost per cell | $0.01-0.10 | $1-10 | Variable |
Chapter Summary
- โSingle-cell RNA-seq resolves transcriptomic heterogeneity invisible to bulk methods, revealing rare cell types, continuous transitions, and cell-to-cell variability.
- โDroplet-based methods (10x Chromium) offer high throughput with 3'/5' end coverage, while plate-based methods (Smart-seq2) provide full-length transcripts at lower throughput.
- โUMIs eliminate PCR amplification bias by enabling digital counting of original mRNA molecules; doublet detection is essential for data quality.
- โThe standard analysis pipeline encompasses QC filtering, normalization, PCA, UMAP visualization, graph-based clustering (Leiden/Louvain), and cell type annotation.
- โRNA velocity leverages the ratio of unspliced to spliced mRNA to predict future cell states, adding temporal directionality to snapshot data.
- โSpatial transcriptomics (Visium, MERFISH, seqFISH) preserves tissue context; deconvolution methods integrate scRNA-seq references to achieve virtual single-cell spatial resolution.
- โMulti-omics integration (CITE-seq, Multiome, WNN) combines transcriptomics with protein, chromatin, and epigenomic measurements for comprehensive cellular profiling.