AI & Machine Learning in Biology
From AlphaFold's revolution in protein structure prediction to generative models designing novel proteins. Explore how deep learning is transforming our understanding of molecular biology.
1. AlphaFold: Solving Protein Structure
AlphaFold2 (Jumper et al., Nature 2021) achieved near-experimental accuracy in protein structure prediction, solving a 50-year grand challenge. Its architecture combines evolutionary information from multiple sequence alignments (MSAs) with geometric reasoning through equivariant neural networks.
The Evoformer Module
The core of AlphaFold2 is the Evoformer, a stack of 48 blocks that jointly process MSA and pair representations. The attention mechanism in the MSA rows captures within-sequence correlations:
where the bias term $b$ incorporates pair representation information, allowing structural knowledge to guide MSA processing. Triangle attention updates enforce geometric consistency by ensuring that the predicted distances satisfy the triangle inequality:
Structure Module & IPA
The Structure Module converts pair representations into 3D coordinates using Invariant Point Attention (IPA), which operates on SE(3)-equivariant frames attached to each residue. The loss function uses Frame Aligned Point Error (FAPE):
2. Protein Language Models
Protein language models treat amino acid sequences as "sentences" in a biological language, learning grammar (biochemical rules) and semantics (structural and functional properties) from millions of evolved sequences.
ESM-2 (Meta AI)
- - 15 billion parameters (largest: 15B)
- - Trained on 65 million UniRef sequences
- - Masked language modeling (15% masking)
- - Attention maps encode contact information
- - Zero-shot structure prediction capability
- - Embeddings capture evolutionary conservation
ProtTrans Family
- - ProtBERT, ProtAlbert, ProtXLNet, ProtT5
- - Trained on UniRef and BFD (>2B sequences)
- - Transfer learning for downstream tasks
- - Secondary structure prediction (~82% Q3)
- - Subcellular localization prediction
- - Function annotation from embeddings
3. Generative Models for Protein Design
Diffusion models have emerged as powerful tools for de novo protein design, generating novel backbone structures that can be designed to fold and perform specified functions.
RFDiffusion
RFDiffusion (Watson et al., Nature 2023) applies denoising diffusion to protein backbone generation, using a modified RoseTTAFold architecture as the denoising network. The forward diffusion process adds noise to protein coordinates:
The reverse (denoising) process learns to reconstruct protein structure:
Graph Neural Networks for Molecular Properties
GNNs represent molecules as graphs where atoms are nodes and bonds are edges. Message passing neural networks (MPNNs) update node features through iterative aggregation:
Applications include predicting binding affinities, solubility, toxicity, and reaction outcomes. SchNet, DimeNet, and PaiNN incorporate 3D geometric information for molecular property prediction.
4. ML-Enhanced Molecular Dynamics
Machine learning potentials bridge the accuracy of quantum mechanics with the speed of classical force fields. Neural network potentials (NNPs) learn the potential energy surface from quantum chemical data.
Neural Network Potentials
The total energy is decomposed into atomic contributions:
ANI, SchNet, NequIP, and MACE achieve DFT-level accuracy at fraction of the cost.
Enhanced Sampling with ML
ML identifies collective variables for enhanced sampling:
Autoencoders and variational approaches discover slow degrees of freedom automatically.
Python Simulation: Secondary Structure Prediction
This simulation implements a simplified Chou-Fasman-like secondary structure prediction algorithm using amino acid propensities. It also provides an overview of AlphaFold2 architecture and protein language model approaches, with sequence feature analysis including hydrophobicity profiling.
Protein Secondary Structure Prediction & Feature Analysis
PythonChou-Fasman propensity-based structure prediction with hydrophobicity profiling
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Fortran Computation: Gradient Descent on Lennard-Jones
This Fortran program demonstrates numerical gradient descent optimization applied to finding the equilibrium distance of the Lennard-Jones potential — a fundamental building block in molecular mechanics force fields used by ML-enhanced simulations.
Gradient Descent Optimizer for Lennard-Jones Potential
FortranNumerical optimization finding the equilibrium separation in a Lennard-Jones pair potential
Click Run to execute the Fortran code
Code will be compiled with gfortran and executed on the server
Video Lectures
AlphaFold and the Future of Biology
DeepMind's presentation on AlphaFold2, its architecture, training methodology, and implications for structural biology and drug discovery.
Machine Learning for Molecular Science
Overview of machine learning approaches in molecular science, including neural network potentials, generative models, and property prediction.
Key Concepts Summary
AlphaFold2
Evoformer + Structure Module achieves near-experimental accuracy in protein structure prediction
Protein Language Models
ESM-2 and ProtTrans learn structural and functional embeddings from evolutionary sequences
Diffusion Models
RFDiffusion generates novel protein backbones through learned denoising of 3D coordinates
ML Potentials
Neural network potentials achieve quantum accuracy at classical speed for molecular simulations