Machine Learning in Omics
From classical statistical learning to deep neural networks for biological discovery and clinical prediction
19.1 Supervised Learning for Omics
Supervised learning algorithms learn a mapping from input features (e.g., gene expression profiles) to output labels (e.g., disease status, drug response) using labeled training data. In omics, the defining challenge is the "large p, small n" problem: datasets typically contain thousands to millions of features ($p$) but only tens to hundreds of samples ($n$). This high dimensionality creates severe risks of overfitting and demands careful regularization, feature selection, and validation strategies.
Support Vector Machines (SVMs)
SVMs find the maximum-margin hyperplane that separates classes in feature space. For linearly separable data with labels $y_i \in \{-1, +1\}$ and features$\mathbf{x}_i \in \mathbb{R}^p$, the soft-margin SVM solves:
SVM Primal Objective
Where $\mathbf{w}$ is the weight vector, $b$ is the bias,$\xi_i$ are slack variables allowing misclassification, and $C > 0$controls the trade-off between margin width and training error. The kernel trick maps data into a higher-dimensional space via $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)$, enabling non-linear decision boundaries. Common kernels include the radial basis function (RBF):$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$.
Random Forests
Random forests are ensemble methods that aggregate predictions from many decision trees, each trained on a bootstrap sample of the data with a random subset of features at each split. They handle high-dimensional data naturally, provide built-in feature importance measures, and are relatively robust to hyperparameter choices. For classification, the predicted class is the majority vote; for regression, it is the average prediction. Out-of-bag (OOB) error provides an unbiased estimate of generalization performance without a separate validation set.
Neural Networks
Feedforward neural networks (multilayer perceptrons) learn non-linear mappings through compositions of affine transformations and non-linear activation functions. A layer computes$\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$ where $\sigma$is an activation function (ReLU, sigmoid, tanh). For multi-class classification with $K$ classes, the output layer uses the softmax function:
Softmax Function
Training minimizes the cross-entropy loss:
Where $y_{ik}$ is the one-hot encoded true label and $\hat{p}_{ik}$ is the predicted probability from softmax. Backpropagation computes gradients, and optimizers (Adam, SGD with momentum) update weights iteratively.
Comparison of Supervised Methods for Omics
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| SVM | Effective in high-dim; kernel flexibility | Scaling $O(n^2)$; limited interpretability | Small-to-medium sample sizes |
| Random Forest | Feature importance; robust; handles mixed types | Correlated features share importance | Exploratory analysis; biomarker ranking |
| Neural Networks | Arbitrary non-linearities; representation learning | Data-hungry; black-box; overfitting risk | Large datasets; multi-modal inputs |
| Logistic Regression | Interpretable coefficients; probabilistic output | Linear decision boundary; needs regularization | Baseline model; clinical risk scores |
19.2 Unsupervised Learning & Dimensionality Reduction
Unsupervised methods discover structure in omics data without predefined labels. They are essential for exploratory analysis: identifying patient subtypes, revealing batch effects, detecting co-regulated gene modules, and visualizing high-dimensional data in two or three dimensions.
Principal Component Analysis (PCA)
PCA finds orthogonal directions of maximum variance by computing the eigendecomposition of the covariance matrix. Given centered data matrix $X \in \mathbb{R}^{n \times p}$:
PCA Eigenvalue Decomposition
Where $\Sigma$ is the $p \times p$ sample covariance matrix,$V = [\mathbf{v}_1, \dots, \mathbf{v}_p]$ contains the eigenvectors (principal components), and $\Lambda = \text{diag}(\lambda_1, \dots, \lambda_p)$ with$\lambda_1 \geq \lambda_2 \geq \dots$ are the eigenvalues (variance explained). The proportion of variance explained by the first $k$ components is$\sum_{i=1}^k \lambda_i / \sum_{i=1}^p \lambda_i$. In practice, singular value decomposition (SVD) of $X$ is preferred for numerical stability.
t-SNE & UMAP
While PCA preserves global variance structure, non-linear methods are better at revealing local neighborhood relationships in complex datasets, particularly single-cell data.
t-SNE
t-distributed Stochastic Neighbor Embedding converts high-dimensional pairwise distances into conditional probabilities using Gaussian kernels in high-D and Student-t distributions in low-D. Minimizes KL divergence between the two distributions. The perplexity parameter (typically 5–50) balances local vs. global structure. Note: t-SNE does not preserve global distances—cluster separations and sizes are not meaningful.
UMAP
Uniform Manifold Approximation and Projection is grounded in Riemannian geometry and algebraic topology. Constructs a fuzzy simplicial complex from high-D data and optimizes a low-D representation to preserve its topological structure. Faster than t-SNE, better preservation of global structure, and supports embedding new data points. Default parameters: n_neighbors=15, min_dist=0.1.
Clustering Methods
Clustering assigns samples or features to groups based on similarity. In omics, clustering identifies patient subtypes, cell populations, or co-expressed gene modules.
- k-means: Partitions data into $k$ clusters by iteratively minimizing within-cluster sum of squares: $\min \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2$. Requires specifying $k$ (use silhouette scores or gap statistic for selection). Fast but assumes spherical clusters.
- Hierarchical clustering: Builds a tree (dendrogram) of nested clusters using agglomerative (bottom-up) or divisive (top-down) approaches. Linkage criteria (Ward, complete, average) determine merge distances. Widely used for heatmaps of gene expression. Does not require specifying $k$ a priori.
- Leiden / Louvain: Graph-based community detection used extensively in single-cell analysis. Constructs a k-nearest-neighbor graph and optimizes modularity. Leiden improves upon Louvain by guaranteeing well-connected communities. Resolution parameter controls granularity.
- Gaussian Mixture Models (GMM): Probabilistic clustering assuming data arise from a mixture of Gaussians. Fit via EM algorithm. Provides soft cluster assignments (probabilities). BIC or AIC for model selection.
19.3 Feature Selection & Regularization
With omics datasets containing thousands of features and relatively few samples, feature selection is critical for building interpretable and generalizable models. Reducing the feature space mitigates the curse of dimensionality, reduces computational cost, and improves model performance by eliminating noisy or redundant variables.
LASSO Regression ($\ell_1$ Regularization)
The Least Absolute Shrinkage and Selection Operator adds an $\ell_1$ penalty to the ordinary least squares objective, inducing sparsity by shrinking many coefficients exactly to zero:
LASSO Objective
Where $\lambda > 0$ is the regularization strength. As $\lambda$ increases, more coefficients become zero, yielding a sparser model. The non-differentiability of the$\ell_1$ norm at zero is what produces exact sparsity. LASSO can select at most$\min(n, p)$ features and tends to arbitrarily select one from groups of correlated features.
Elastic Net
Elastic net combines $\ell_1$ and $\ell_2$ penalties to overcome LASSO's limitations with correlated features:
Elastic Net Objective
The $\ell_2$ penalty encourages grouping of correlated features (selecting all or none together), while the $\ell_1$ penalty maintains sparsity. The mixing parameter$\alpha = \lambda_1 / (\lambda_1 + \lambda_2)$ balances between ridge ($\alpha = 0$) and LASSO ($\alpha = 1$).
Other Feature Selection Approaches
Mutual Information-Based Selection
Ranks features by their mutual information with the target variable$I(X_j; Y)$. Captures non-linear dependencies unlike correlation-based methods. mRMR (minimum Redundancy Maximum Relevance) extends this by also penalizing redundancy among selected features: $\max \left[ I(X_j; Y) - \frac{1}{|S|} \sum_{X_s \in S} I(X_j; X_s) \right]$.
Recursive Feature Elimination (RFE)
Wrapper method that iteratively trains a model (e.g., SVM or RF), ranks features by importance, removes the least important, and repeats until a desired number of features remain. RFE-CV uses cross-validation to select the optimal feature subset size. Computationally expensive but accounts for feature interactions during selection.
19.4 Cross-Validation & Model Evaluation
Reliable estimation of model performance is crucial in omics, where small sample sizes make overfitting a constant threat. Cross-validation (CV) provides an unbiased estimate of how well a model generalizes to unseen data by systematically partitioning the dataset into training and validation folds.
Cross-Validation Strategies
- k-Fold CV: Data split into $k$ folds; each fold serves as validation once while remaining $k-1$ folds are used for training. Typically $k = 5$ or $k = 10$. Bias-variance trade-off: larger $k$ reduces bias but increases variance and computation.
- Leave-One-Out CV (LOOCV): Special case where $k = n$. Nearly unbiased but high variance. Practical only for small datasets or computationally cheap models.
- Stratified CV: Ensures each fold has approximately the same class distribution as the full dataset. Essential for imbalanced classes (common in clinical omics: few disease cases vs. many controls).
- Nested CV: Outer loop estimates generalization performance; inner loop tunes hyperparameters. Prevents optimistic bias from tuning on the test fold. Critical for honest reporting in omics studies.
Evaluation Metrics
The choice of evaluation metric depends on the clinical context. Accuracy can be misleading for imbalanced datasets; a model predicting "no cancer" for every patient achieves 99% accuracy if only 1% have cancer.
| Metric | Formula | Interpretation |
|---|---|---|
| Accuracy | $\frac{TP + TN}{TP + TN + FP + FN}$ | Overall correctness; misleading if classes imbalanced |
| Precision (PPV) | $\frac{TP}{TP + FP}$ | Of predicted positives, how many are true? |
| Recall (Sensitivity) | $\frac{TP}{TP + FN}$ | Of actual positives, how many were detected? |
| F1 Score | $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ | Harmonic mean; balances precision and recall |
| AUC-ROC | $\int_0^1 \text{TPR}(t) \, d\text{FPR}(t)$ | Threshold-independent; probability that a random positive ranks higher than a random negative |
Common Pitfalls in Omics ML
- Data leakage: Feature selection or normalization performed before train/test split allows information from the test set to influence the model, producing overly optimistic estimates.
- Confounding batch effects: If disease status is correlated with processing batch, a model may learn batch signatures rather than biological signal.
- No external validation: Performance assessed only on the same cohort is inflated. Independent validation cohorts are essential for clinical translatability.
- Publication bias: Reporting only the best model from many trials produces the "winner's curse"—true performance is likely lower.
19.5 Deep Learning for Omics
Deep learning has achieved remarkable successes in omics, particularly for tasks involving raw sequence data, images, and large-scale multi-modal datasets where hand-crafted features are insufficient. The key advantage is automatic feature learning: deep networks learn hierarchical representations directly from data.
Autoencoders for Denoising & Imputation
Autoencoders learn compressed representations (bottleneck) of input data through an encoder-decoder architecture. Variational autoencoders (VAEs) add a probabilistic framework, enforcing a structured latent space. In single-cell analysis, scVI uses a deep generative model to account for library size, batch effects, and dropout noise simultaneously.
Variational Autoencoder (VAE) Loss
The first term is the reconstruction loss (how well the decoder recovers the input). The KL divergence term regularizes the latent space to approximate the prior$p(\mathbf{z}) = \mathcal{N}(0, I)$, ensuring smooth and interpretable latent representations.
CNNs for Sequence Data
Convolutional neural networks excel at detecting local patterns (motifs) in DNA and protein sequences. DeepBind and DeepSEA pioneered the use of 1D CNNs for predicting transcription factor binding sites, chromatin accessibility, and variant effects from sequence alone. The convolution operation scans learned filters across the sequence, and deeper layers capture higher-order combinatorial patterns.
Transformers & Protein Language Models
Transformer architectures, originally developed for natural language processing, have revolutionized protein structure prediction (AlphaFold2) and protein function annotation. Models such as ESM (Evolutionary Scale Modeling) and ProtTrans are trained on hundreds of millions of protein sequences using masked language modeling. They learn contextual embeddings that capture evolutionary and structural information, enabling zero-shot prediction of mutation effects, secondary structure, and protein-protein interactions.
Model Interpretation: SHAP & Attention
Interpretability is essential for biological discovery and clinical trust. SHAP (SHapley Additive exPlanations) values assign each feature an additive contribution to a prediction, based on cooperative game theory. For deep learning models, attention weights reveal which input positions (e.g., amino acid residues, genomic loci) the model focuses on when making predictions. Integrated gradients and saliency maps provide alternative attribution methods.
Transfer Learning in Omics
Transfer learning leverages knowledge from a model pre-trained on a large dataset to improve performance on a smaller target dataset. This is particularly valuable in omics where labeled clinical datasets are expensive to generate.
- Domain adaptation: Pre-train on cell lines, fine-tune on patient samples
- Cross-species transfer: Pre-train on mouse data, adapt to human
- Foundation models: Large pre-trained models (Geneformer for single-cell, ESM for proteins) serve as general-purpose feature extractors that can be fine-tuned for specific downstream tasks
- Multi-task learning: Jointly predict multiple related outcomes (e.g., drug sensitivity across cancer types) to share statistical strength
19.6 Overfitting, Regularization & Benchmarking
Overfitting occurs when a model captures noise in the training data rather than the underlying biological signal, resulting in excellent training performance but poor generalization. In omics, the extreme $p \gg n$ regime makes this an ever-present concern. Regularization techniques constrain model complexity to combat overfitting.
Regularization Techniques
| Technique | Mechanism | Application |
|---|---|---|
| $\ell_1$ (LASSO) | Drives weights to zero (sparsity) | Feature selection in linear models |
| $\ell_2$ (Ridge / Weight decay) | Shrinks weights toward zero | All models; prevents extreme weights |
| Dropout | Randomly zeroes neuron activations during training | Neural networks; implicit ensemble |
| Early stopping | Halts training when validation loss stops improving | Neural networks; gradient boosting |
| Batch normalization | Normalizes layer inputs; smooths loss landscape | Deep networks; stabilizes training |
Benchmarking Best Practices
Rigorous benchmarking is essential for comparing ML methods in omics. The DREAM challenges and MAQC consortium have established gold standards for evaluation. Key principles include:
- Use standardized datasets: Published benchmarks with known ground truth enable fair comparison across studies
- Report multiple metrics: AUC-ROC, AUC-PR, F1, and calibration plots provide complementary views of performance
- Statistical comparison: Use paired tests (Wilcoxon signed-rank) or corrected resampled t-tests across CV folds to assess significance of performance differences
- Include baselines: Always compare to simple baselines (logistic regression, random classifier) to quantify the value added by complex models
- Report computational cost: Training time, memory, and inference speed matter for clinical deployment
- Code and data availability: Full reproducibility requires sharing preprocessing code, model architectures, hyperparameters, and random seeds