🤖

AI & Machine Learning in Biology

From AlphaFold's revolution in protein structure prediction to generative models designing novel proteins. Explore how deep learning is transforming our understanding of molecular biology.

1. AlphaFold: Solving Protein Structure

AlphaFold2 (Jumper et al., Nature 2021) achieved near-experimental accuracy in protein structure prediction, solving a 50-year grand challenge. Its architecture combines evolutionary information from multiple sequence alignments (MSAs) with geometric reasoning through equivariant neural networks.

The Evoformer Module

The core of AlphaFold2 is the Evoformer, a stack of 48 blocks that jointly process MSA and pair representations. The attention mechanism in the MSA rows captures within-sequence correlations:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + b\right)V$$

where the bias term $b$ incorporates pair representation information, allowing structural knowledge to guide MSA processing. Triangle attention updates enforce geometric consistency by ensuring that the predicted distances satisfy the triangle inequality:

$$z_{ij} \leftarrow z_{ij} + \sum_k \text{TriangleAttn}(z_{ik}, z_{kj}, z_{ij})$$

Structure Module & IPA

The Structure Module converts pair representations into 3D coordinates using Invariant Point Attention (IPA), which operates on SE(3)-equivariant frames attached to each residue. The loss function uses Frame Aligned Point Error (FAPE):

$$\mathcal{L}_{\text{FAPE}} = \frac{1}{N^2} \sum_{i,j} \left\| T_i^{-1} \circ \mathbf{x}_j^{\text{pred}} - T_i^{-1} \circ \mathbf{x}_j^{\text{true}} \right\|$$

2. Protein Language Models

Protein language models treat amino acid sequences as "sentences" in a biological language, learning grammar (biochemical rules) and semantics (structural and functional properties) from millions of evolved sequences.

ESM-2 (Meta AI)

  • - 15 billion parameters (largest: 15B)
  • - Trained on 65 million UniRef sequences
  • - Masked language modeling (15% masking)
  • - Attention maps encode contact information
  • - Zero-shot structure prediction capability
  • - Embeddings capture evolutionary conservation
$$P(\text{aa}_i | \text{context}) = \text{softmax}(W \cdot h_i^{(L)} + b)$$

ProtTrans Family

  • - ProtBERT, ProtAlbert, ProtXLNet, ProtT5
  • - Trained on UniRef and BFD (>2B sequences)
  • - Transfer learning for downstream tasks
  • - Secondary structure prediction (~82% Q3)
  • - Subcellular localization prediction
  • - Function annotation from embeddings
$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \text{masked}} \log P(\text{aa}_i | \mathbf{x}_{\backslash i})$$

3. Generative Models for Protein Design

Diffusion models have emerged as powerful tools for de novo protein design, generating novel backbone structures that can be designed to fold and perform specified functions.

RFDiffusion

RFDiffusion (Watson et al., Nature 2023) applies denoising diffusion to protein backbone generation, using a modified RoseTTAFold architecture as the denoising network. The forward diffusion process adds noise to protein coordinates:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

The reverse (denoising) process learns to reconstruct protein structure:

$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

Graph Neural Networks for Molecular Properties

GNNs represent molecules as graphs where atoms are nodes and bonds are edges. Message passing neural networks (MPNNs) update node features through iterative aggregation:

$$\mathbf{h}_v^{(k+1)} = \text{UPDATE}^{(k)}\left(\mathbf{h}_v^{(k)}, \sum_{u \in \mathcal{N}(v)} \text{MSG}^{(k)}(\mathbf{h}_v^{(k)}, \mathbf{h}_u^{(k)}, \mathbf{e}_{vu})\right)$$

Applications include predicting binding affinities, solubility, toxicity, and reaction outcomes. SchNet, DimeNet, and PaiNN incorporate 3D geometric information for molecular property prediction.

4. ML-Enhanced Molecular Dynamics

Machine learning potentials bridge the accuracy of quantum mechanics with the speed of classical force fields. Neural network potentials (NNPs) learn the potential energy surface from quantum chemical data.

Neural Network Potentials

The total energy is decomposed into atomic contributions:

$$E = \sum_i E_i(\{\mathbf{r}_j : j \in \mathcal{N}_i\})$$

ANI, SchNet, NequIP, and MACE achieve DFT-level accuracy at fraction of the cost.

Enhanced Sampling with ML

ML identifies collective variables for enhanced sampling:

$$\xi = f_\theta(\mathbf{r}_1, \mathbf{r}_2, \ldots, \mathbf{r}_N)$$

Autoencoders and variational approaches discover slow degrees of freedom automatically.

Python Simulation: Secondary Structure Prediction

This simulation implements a simplified Chou-Fasman-like secondary structure prediction algorithm using amino acid propensities. It also provides an overview of AlphaFold2 architecture and protein language model approaches, with sequence feature analysis including hydrophobicity profiling.

Protein Secondary Structure Prediction & Feature Analysis

Python

Chou-Fasman propensity-based structure prediction with hydrophobicity profiling

script.py179 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Fortran Computation: Gradient Descent on Lennard-Jones

This Fortran program demonstrates numerical gradient descent optimization applied to finding the equilibrium distance of the Lennard-Jones potential — a fundamental building block in molecular mechanics force fields used by ML-enhanced simulations.

Gradient Descent Optimizer for Lennard-Jones Potential

Fortran

Numerical optimization finding the equilibrium separation in a Lennard-Jones pair potential

gradient_descent_lj.f90114 lines

Click Run to execute the Fortran code

Code will be compiled with gfortran and executed on the server

Video Lectures

AlphaFold and the Future of Biology

DeepMind's presentation on AlphaFold2, its architecture, training methodology, and implications for structural biology and drug discovery.

Machine Learning for Molecular Science

Overview of machine learning approaches in molecular science, including neural network potentials, generative models, and property prediction.

Key Concepts Summary

AlphaFold2

Evoformer + Structure Module achieves near-experimental accuracy in protein structure prediction

Protein Language Models

ESM-2 and ProtTrans learn structural and functional embeddings from evolutionary sequences

Diffusion Models

RFDiffusion generates novel protein backbones through learned denoising of 3D coordinates

ML Potentials

Neural network potentials achieve quantum accuracy at classical speed for molecular simulations