Microarray Technology
Hybridization-based transcriptome profiling: principles, experimental design, and statistical analysis
1. DNA Microarray Principles
DNA microarrays exploit the principle of nucleic acid hybridization to simultaneously measure the expression levels of thousands of genes. The technology is based on the complementary base pairing between immobilized DNA probes on a solid surface (the array) and fluorescently labeled target nucleic acids derived from the biological sample of interest. The intensity of the fluorescent signal at each probe location is proportional to the abundance of the corresponding transcript in the sample.
Major Microarray Platforms
Spotted cDNA Arrays
PCR-amplified cDNA clones (typically 200-2000 bp) are mechanically deposited (spotted) onto glass slides using robotic pin-based or inkjet printing systems. Each spot represents a single gene. These arrays are fabricated in academic laboratories and offer flexibility in probe design. They typically employ two-color (competitive) hybridization, where the experimental and reference samples are labeled with different fluorescent dyes (usually Cy5/red and Cy3/green).
Affymetrix GeneChip (Oligonucleotide Arrays)
Short oligonucleotides (25-mers) are synthesized directly on the chip surface using photolithographic maskless array synthesis (MAS). Each gene is represented by a probe set consisting of 11-20 perfect match (PM) probes and corresponding mismatch (MM) probes, where the central nucleotide is substituted. PM-MM signal differences are used to estimate specific hybridization. These are single-color arrays where only one sample is hybridized per chip.
Hybridization Thermodynamics
The specificity and efficiency of microarray hybridization are governed by the thermodynamics of nucleic acid duplex formation. The hybridization reaction between a target molecule (T) and an immobilized probe (P) follows the equilibrium:
$$P + T \rightleftharpoons PT$$ The equilibrium dissociation constant is: $$K_d = \frac{[P][T]}{[PT]}$$ The Gibbs free energy of hybridization determines the stability of the duplex: $$\Delta G^\circ = \Delta H^\circ - T \Delta S^\circ = -RT \ln K_a$$ where \(\Delta H^\circ\) is the enthalpy change (primarily due to hydrogen bonding and base stacking), \(\Delta S^\circ\) is the entropy change (loss of translational and rotational freedom), \(T\) is the absolute temperature, and \(K_a = 1/K_d\) is the association constant.
The melting temperature (Tm) at which 50% of the duplexes are dissociated can be estimated using the nearest-neighbor thermodynamic model. For oligonucleotide probes under standard conditions:
$$T_m = \frac{\Delta H^\circ}{\Delta S^\circ + R \ln(C_T / 4)} - 273.15$$ where \(C_T\) is the total strand concentration and \(R\) is the universal gas constant (1.987 cal/mol·K). Practical adjustments account for salt concentration: $$T_m^{adj} = T_m + 16.6 \log_{10}[\text{Na}^+]$$ Probe design aims for uniform \(T_m\) values across the array (typically 60-70°C) to ensure consistent hybridization stringency.
2. Experimental Design: Two-Color vs. Single-Color
The choice between two-color and single-color microarray platforms has profound implications for experimental design, statistical analysis, and the types of biological questions that can be addressed.
Platform Comparison
| Feature | Two-Color (cDNA) | Single-Color (Affymetrix) |
|---|---|---|
| Samples per array | 2 (competitive hybridization) | 1 |
| Dye labels | Cy3 (green, 532 nm) & Cy5 (red, 635 nm) | Biotin-streptavidin-PE |
| Primary measurement | Log-ratio (M = log2(R/G)) | Absolute intensity |
| Dye bias | Yes (requires dye-swap) | Not applicable |
| Probe length | 200-2000 bp | 25 nt |
| Reproducibility | Moderate | High |
Image Analysis & Spot Quantification
After hybridization and washing, the array is scanned using a confocal laser scanner that excites fluorophores at their respective wavelengths. For two-color arrays, the scanner produces two images (one for each channel). Image analysis software (e.g., GenePix, ScanArray) identifies spot boundaries through segmentation algorithms, quantifies the mean or median fluorescence intensity within each spot, and estimates local background intensity from surrounding pixels. The raw signal for each spot is then calculated as the foreground intensity minus the background intensity.
Background Correction Methods
Background correction is critical because non-specific hybridization, autofluorescence of the slide surface, and optical noise contribute to the measured signal. Common approaches include: (1) local background subtraction, where the median intensity of pixels surrounding each spot is subtracted from the spot signal; (2) morphological opening, which estimates background using image processing techniques; and (3) model-based correction such as the RMA (Robust Multi-array Average) convolution model, which assumes the observed signal is a sum of exponentially distributed true signal and normally distributed background noise.
3. Normalization Methods
Normalization is essential for removing systematic technical variation that could confound biological signal. The goal is to make expression measurements comparable across arrays while preserving true biological differences. Different normalization strategies are appropriate for different array platforms and experimental designs.
LOESS (LOcally Estimated Scatterplot Smoothing)
LOESS normalization is the standard method for two-color arrays. It corrects intensity-dependent dye bias by fitting a local regression curve to the MA plot, where M = log2(R/G) is plotted against A = (1/2)log2(R * G). The assumption is that most genes are not differentially expressed, so the LOESS curve should pass through M = 0 across all intensity ranges.
For each probe \(i\) with log-ratio \(M_i\) and average log-intensity \(A_i\): $$M_i^{norm} = M_i - \hat{c}(A_i)$$ where \(\hat{c}(A_i)\) is the LOESS fit evaluated at intensity \(A_i\). The LOESS estimator at a point \(A_0\) is obtained by weighted least squares: $$\hat{c}(A_0) = \arg\min_{\beta_0, \beta_1} \sum_{i=1}^{n} w_i(A_0) \left(M_i - \beta_0 - \beta_1(A_i - A_0)\right)^2$$ where the weights \(w_i(A_0)\) are determined by the tricube kernel function, giving higher weight to probes with intensity values closer to \(A_0\).
Quantile Normalization
Quantile normalization forces the distribution of probe intensities to be identical across all arrays. This is the standard normalization method used in the RMA (Robust Multi-array Average) preprocessing pipeline for Affymetrix arrays. The algorithm works as follows: (1) sort the intensity values on each array, (2) compute the mean intensity at each rank across arrays, (3) replace each intensity value with the mean for its rank, and (4) reorder to the original probe order.
Algorithm: Quantile Normalization
- Given a matrix X with n probes (rows) and m arrays (columns).
- Sort each column independently to obtain sorted matrix X_sort.
- Compute row means of X_sort: each row mean becomes the target quantile.
- Replace each entry in X_sort with the corresponding row mean.
- Unsort each column to restore original probe order, yielding normalized matrix X_norm.
MAS5 (Affymetrix Statistical Algorithm)
The MAS5 algorithm, developed by Affymetrix, uses a robust average (Tukey biweight) of PM-MM differences within each probe set to estimate expression. It performs per-chip normalization by scaling to a target intensity value (typically 500). MAS5 also provides presence/absence calls (Present, Marginal, Absent) based on a Wilcoxon signed-rank test comparing PM and MM intensities within each probe set. While largely superseded by RMA and GCRMA for expression estimation, MAS5 detection calls remain useful for filtering unexpressed genes.
4. Differential Expression Analysis
The primary goal of most microarray experiments is to identify genes whose expression levels differ significantly between experimental conditions. This requires both an effect size measure (fold change) and a statistical test of significance.
The t-Statistic and Moderated t-Test
The standard two-sample t-test compares mean expression levels between two groups. However, with the small sample sizes typical of microarray experiments (often n = 3-5 per group), the gene-wise variance estimates are unreliable. The empirical Bayes moderated t-test, implemented in the limma R package, addresses this by borrowing information across genes to stabilize variance estimates.
The ordinary t-statistic for gene \(g\) comparing two groups: $$t_g = \frac{\bar{x}_{g1} - \bar{x}_{g2}}{s_g \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$ In the limma framework, the moderated t-statistic replaces \(s_g^2\) with the posterior variance: $$\tilde{t}_g = \frac{\bar{x}_{g1} - \bar{x}_{g2}}{\tilde{s}_g \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$ where: $$\tilde{s}_g^2 = \frac{d_0 s_0^2 + d_g s_g^2}{d_0 + d_g}$$ Here \(s_0^2\) and \(d_0\) are the prior variance and prior degrees of freedom estimated from the distribution of all gene-wise variances, and \(d_g\) is the residual degrees of freedom for gene \(g\).
Multiple Testing Correction
When testing thousands of genes simultaneously, the probability of false positives increases dramatically. If 20,000 genes are tested at a significance threshold of p < 0.05, we would expect 1,000 false positives by chance alone. Multiple testing correction methods control the error rate at an acceptable level.
Multiple Testing Correction Methods
| Method | Controls | Formula | Stringency |
|---|---|---|---|
| Bonferroni | FWER | p_adj = min(m * p_i, 1) | Very conservative |
| Holm | FWER | Step-down Bonferroni | Conservative |
| Benjamini-Hochberg | FDR | q_i = min(m * p_(i) / i, 1) | Moderate |
The Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR): $$\text{FDR} = E\left[\frac{V}{R}\right]$$ where \(V\) is the number of false positives and \(R\) is the total number of rejections. The procedure: (1) rank p-values in ascending order: \(p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}\); (2) find the largest \(k\) such that: $$p_{(k)} \leq \frac{k}{m} \cdot \alpha$$ (3) reject all hypotheses \(H_{(1)}, \dots, H_{(k)}\). The adjusted p-value (q-value) for the \(i\)-th ranked test is: $$q_{(i)} = \min_{j \geq i}\left(\frac{m \cdot p_{(j)}}{j}\right)$$
5. Visualization & Gene Clustering
Volcano Plots
Volcano plots are the standard visualization for differential expression results. The x-axis shows the log2(fold change) and the y-axis shows the -log10(adjusted p-value). Genes that are both statistically significant (above the horizontal threshold) and biologically meaningful (beyond the vertical fold change thresholds) appear in the upper corners of the plot. This dual filtering approach ensures that identified genes exhibit both statistical confidence and a biologically relevant effect size. Typical thresholds are |log2FC| > 1 (2-fold change) and FDR < 0.05.
MA Plots
MA plots display the log-ratio M = log2(R/G) versus the average log-intensity A = (1/2)log2(R * G) for two-color arrays, or equivalent measures for single-color arrays. They are particularly useful for assessing the success of normalization: after proper normalization, the cloud of points should be centered on M = 0 across all intensity levels, with no systematic intensity-dependent bias.
Gene Clustering Methods
Clustering algorithms group genes with similar expression patterns across experimental conditions, enabling the identification of co-regulated gene modules and the discovery of functional relationships.
Clustering Approaches
Hierarchical Clustering
Agglomerative (bottom-up) hierarchical clustering begins with each gene as a singleton cluster and iteratively merges the two most similar clusters until a single cluster remains. The result is a dendrogram that can be cut at different heights to define clusters. Distance metrics include Euclidean distance, Pearson correlation distance (1 - r), and Spearman rank correlation distance. Linkage methods include single linkage (minimum distance), complete linkage (maximum distance), average linkage (UPGMA), and Ward's method (minimum variance). The choice of distance metric and linkage method significantly affects clustering results.
K-Means Clustering
K-means partitions genes into k pre-specified clusters by minimizing the within-cluster sum of squares. The algorithm iterates between assigning genes to the nearest centroid and recomputing centroids until convergence. Advantages include computational efficiency for large datasets and well-defined cluster assignments. Disadvantages include sensitivity to initialization (addressed by multiple random starts or k-means++), requirement to pre-specify k (assessable via the elbow method, silhouette analysis, or gap statistic), and assumption of spherical clusters.
6. Data Standards & Limitations
MIAME Standard
The Minimum Information About a Microarray Experiment (MIAME) standard, developed by the Microarray Gene Expression Data (MGED) society, specifies the minimum information required to unambiguously interpret and independently verify microarray results. MIAME requires documentation of six components: (1) experimental design, (2) array design, (3) samples used, (4) hybridization procedures, (5) measurements (images, quantification, specifications), and (6) normalization controls (types, values, specifications). Major journals now require MIAME-compliant data submission to public repositories such as GEO (Gene Expression Omnibus) and ArrayExpress.
Microarray Limitations vs. RNA-Seq
| Limitation | Microarray | RNA-Seq |
|---|---|---|
| Dynamic range | ~103 (signal saturation) | >105 (count-based) |
| Novel transcript detection | No (requires pre-designed probes) | Yes (de novo assembly possible) |
| Isoform resolution | Limited (exon arrays help) | Full (with long reads) |
| Cross-hybridization | Significant issue | Multimapping (analogous) |
| Background noise | High (non-specific binding) | Low |
| Cost per sample (2024) | $150-300 | $200-500 |
Chapter Summary
- ●DNA microarrays exploit nucleic acid hybridization to measure thousands of gene expression levels simultaneously.
- ●Two-color arrays (spotted cDNA) measure relative expression via Cy3/Cy5 ratios, while single-color arrays (Affymetrix) measure absolute intensities.
- ●LOESS normalization corrects intensity-dependent dye bias in two-color arrays; quantile normalization ensures identical intensity distributions across Affymetrix arrays.
- ●The moderated t-test (limma) stabilizes variance estimates by borrowing information across genes, improving statistical power with small sample sizes.
- ●The Benjamini-Hochberg procedure controls the False Discovery Rate, which is the expected proportion of false positives among all rejected hypotheses.
- ●Despite being largely supplanted by RNA-Seq, microarrays remain relevant in clinical diagnostics and for studies with extensive historical data.