🤖

AI & Machine Learning in Biology

From AlphaFold's revolution in protein structure prediction to generative models designing novel proteins. Explore how deep learning is transforming our understanding of molecular biology.

1. AlphaFold: Solving Protein Structure

AlphaFold2 (Jumper et al., Nature 2021) achieved near-experimental accuracy in protein structure prediction, solving a 50-year grand challenge. Its architecture combines evolutionary information from multiple sequence alignments (MSAs) with geometric reasoning through equivariant neural networks.

The Evoformer Module

The core of AlphaFold2 is the Evoformer, a stack of 48 blocks that jointly process MSA and pair representations. The attention mechanism in the MSA rows captures within-sequence correlations:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + b\right)V$$

where the bias term $b$ incorporates pair representation information, allowing structural knowledge to guide MSA processing. Triangle attention updates enforce geometric consistency by ensuring that the predicted distances satisfy the triangle inequality:

$$z_{ij} \leftarrow z_{ij} + \sum_k \text{TriangleAttn}(z_{ik}, z_{kj}, z_{ij})$$

Structure Module & IPA

The Structure Module converts pair representations into 3D coordinates using Invariant Point Attention (IPA), which operates on SE(3)-equivariant frames attached to each residue. The loss function uses Frame Aligned Point Error (FAPE):

$$\mathcal{L}_{\text{FAPE}} = \frac{1}{N^2} \sum_{i,j} \left\| T_i^{-1} \circ \mathbf{x}_j^{\text{pred}} - T_i^{-1} \circ \mathbf{x}_j^{\text{true}} \right\|$$

2. Protein Language Models

Protein language models treat amino acid sequences as "sentences" in a biological language, learning grammar (biochemical rules) and semantics (structural and functional properties) from millions of evolved sequences.

ESM-2 (Meta AI)

- 15 billion parameters (largest: 15B)
- Trained on 65 million UniRef sequences
- Masked language modeling (15% masking)
- Attention maps encode contact information
- Zero-shot structure prediction capability
- Embeddings capture evolutionary conservation

$$P(\text{aa}_i | \text{context}) = \text{softmax}(W \cdot h_i^{(L)} + b)$$

ProtTrans Family

- ProtBERT, ProtAlbert, ProtXLNet, ProtT5
- Trained on UniRef and BFD (>2B sequences)
- Transfer learning for downstream tasks
- Secondary structure prediction (~82% Q3)
- Subcellular localization prediction
- Function annotation from embeddings

$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \text{masked}} \log P(\text{aa}_i | \mathbf{x}_{\backslash i})$$

3. Generative Models for Protein Design

Diffusion models have emerged as powerful tools for de novo protein design, generating novel backbone structures that can be designed to fold and perform specified functions.

RFDiffusion

RFDiffusion (Watson et al., Nature 2023) applies denoising diffusion to protein backbone generation, using a modified RoseTTAFold architecture as the denoising network. The forward diffusion process adds noise to protein coordinates:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

The reverse (denoising) process learns to reconstruct protein structure:

$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

Graph Neural Networks for Molecular Properties

GNNs represent molecules as graphs where atoms are nodes and bonds are edges. Message passing neural networks (MPNNs) update node features through iterative aggregation:

$$\mathbf{h}_v^{(k+1)} = \text{UPDATE}^{(k)}\left(\mathbf{h}_v^{(k)}, \sum_{u \in \mathcal{N}(v)} \text{MSG}^{(k)}(\mathbf{h}_v^{(k)}, \mathbf{h}_u^{(k)}, \mathbf{e}_{vu})\right)$$

Applications include predicting binding affinities, solubility, toxicity, and reaction outcomes. SchNet, DimeNet, and PaiNN incorporate 3D geometric information for molecular property prediction.

4. ML-Enhanced Molecular Dynamics

Machine learning potentials bridge the accuracy of quantum mechanics with the speed of classical force fields. Neural network potentials (NNPs) learn the potential energy surface from quantum chemical data.

Neural Network Potentials

The total energy is decomposed into atomic contributions:

$$E = \sum_i E_i(\{\mathbf{r}_j : j \in \mathcal{N}_i\})$$

ANI, SchNet, NequIP, and MACE achieve DFT-level accuracy at fraction of the cost.

Enhanced Sampling with ML

ML identifies collective variables for enhanced sampling:

$$\xi = f_\theta(\mathbf{r}_1, \mathbf{r}_2, \ldots, \mathbf{r}_N)$$

Autoencoders and variational approaches discover slow degrees of freedom automatically.

Python Simulation: Secondary Structure Prediction

This simulation implements a simplified Chou-Fasman-like secondary structure prediction algorithm using amino acid propensities. It also provides an overview of AlphaFold2 architecture and protein language model approaches, with sequence feature analysis including hydrophobicity profiling.

Protein Secondary Structure Prediction & Feature Analysis

Python

Chou-Fasman propensity-based structure prediction with hydrophobicity profiling

script.py179 lines

import numpy as np

# Simple Protein Secondary Structure Prediction from Sequence Features
# ====================================================================
# Uses amino acid propensities (Chou-Fasman-like) to predict helix, sheet, coil

print("=" * 65)
print("PROTEIN SECONDARY STRUCTURE PREDICTION")
print("Using Amino Acid Propensity Scores")
print("=" * 65)

# Chou-Fasman propensity parameters (simplified)
# Format: {AA: (helix_propensity, sheet_propensity)}
propensities = {
    'A': (1.42, 0.83), 'R': (0.98, 0.93), 'N': (0.67, 0.89),
    'D': (1.01, 0.54), 'C': (0.70, 1.19), 'E': (1.51, 0.37),
    'Q': (1.11, 1.10), 'G': (0.57, 0.75), 'H': (1.00, 0.87),
    'I': (1.08, 1.60), 'L': (1.21, 1.30), 'K': (1.16, 0.74),
    'M': (1.45, 1.05), 'F': (1.13, 1.38), 'P': (0.57, 0.55),
    'S': (0.77, 0.75), 'T': (0.83, 1.19), 'W': (1.08, 1.37),
    'Y': (0.69, 1.47), 'V': (1.06, 1.70)
}

# Test sequences
sequences = {
    "Helix-rich (Myoglobin fragment)": "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMF",
    "Sheet-rich (Immunoglobulin)": "VQLVESGGGLVQPGRSLRLSCAASG",
    "Mixed (Lysozyme fragment)": "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAK",
}

def predict_structure(sequence, window=5):
    """Predict secondary structure using sliding window propensities."""
    n = len(sequence)
    helix_scores = np.zeros(n)
    sheet_scores = np.zeros(n)

half_w = window // 2

for i in range(n):
        start = max(0, i - half_w)
        end = min(n, i + half_w + 1)
        h_sum = 0
        s_sum = 0
        count = 0
        for j in range(start, end):
            aa = sequence[j]
            if aa in propensities:
                h_sum += propensities[aa][0]
                s_sum += propensities[aa][1]
                count += 1
        if count > 0:
            helix_scores[i] = h_sum / count
            sheet_scores[i] = s_sum / count

# Assign structure
    structure = []
    for i in range(n):
        if helix_scores[i] > 1.03 and helix_scores[i] > sheet_scores[i]:
            structure.append('H')
        elif sheet_scores[i] > 1.05 and sheet_scores[i] > helix_scores[i]:
            structure.append('E')
        else:
            structure.append('C')

return ''.join(structure), helix_scores, sheet_scores

for name, seq in sequences.items():
    print(f"\n{'='*60}")
    print(f"Sequence: {name}")
    print(f"{'='*60}")

pred, h_scores, s_scores = predict_structure(seq)

# Print in blocks of 50
    for start in range(0, len(seq), 50):
        end = min(start + 50, len(seq))
        print(f"\nSeq:  {seq[start:end]}")
        print(f"Pred: {pred[start:end]}")
        print(f"      {''.join(['|' if p == 'H' else ('=' if p == 'E' else '.') for p in pred[start:end]])}")

# Statistics
    n_h = pred.count('H')
    n_e = pred.count('E')
    n_c = pred.count('C')
    total = len(pred)
    print(f"\nComposition: Helix={n_h}/{total} ({100*n_h/total:.1f}%), "
          f"Sheet={n_e}/{total} ({100*n_e/total:.1f}%), "
          f"Coil={n_c}/{total} ({100*n_c/total:.1f}%)")

# AlphaFold architecture overview
print("\n" + "=" * 65)
print("ALPHAFOLD2 ARCHITECTURE OVERVIEW")
print("=" * 65)

print("""
Input Processing:
  - MSA (Multiple Sequence Alignment): captures evolutionary info
  - Template structures: provides structural priors
  - Pair representation: residue-residue relationships

Evoformer (48 blocks):
  - MSA row attention: within-sequence patterns
  - MSA column attention: cross-sequence conservation
  - Pair representation: triangle attention updates
  - Outer product mean: MSA -> pair communication

Structure Module (8 blocks):
  - Invariant Point Attention (IPA)
  - Backbone frame updates (SE(3) equivariant)
  - Side-chain torsion angle prediction

Key Innovation: End-to-end differentiable structure prediction
  - FAPE loss (Frame Aligned Point Error)
  - Auxiliary losses: pLDDT, distogram, masked MSA
""")

# ESM-2 Protein Language Model
print("=" * 65)
print("PROTEIN LANGUAGE MODELS")
print("=" * 65)

print("""
ESM-2 (Meta AI, 2022):
  - 15B parameter transformer trained on 65M protein sequences
  - Masked language modeling objective (like BERT)
  - Emergent structure prediction from attention maps
  - Contact prediction accuracy: ~75% for long-range contacts

ProtTrans (2021):
  - Family of models (ProtBERT, ProtAlbert, ProtT5, etc.)
  - Trained on UniRef/BFD databases (>2B sequences)
  - Per-residue embeddings capture structural & functional info

RFDiffusion (2023):
  - Denoising diffusion for protein backbone generation
  - Conditioned on motif scaffolding, binding targets
  - Can design novel protein folds not seen in nature
""")

# Simple feature analysis
print("=" * 65)
print("SEQUENCE FEATURE ANALYSIS")
print("=" * 65)

test_seq = "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMF"
print(f"\nSequence: {test_seq}")
print(f"Length: {len(test_seq)} residues")

# Amino acid composition
aa_counts = {}
for aa in test_seq:
    aa_counts[aa] = aa_counts.get(aa, 0) + 1

print("\nAmino acid composition:")
for aa in sorted(aa_counts.keys()):
    pct = 100 * aa_counts[aa] / len(test_seq)
    bar = '#' * int(pct / 2)
    print(f"  {aa}: {aa_counts[aa]:2d} ({pct:5.1f}%) {bar}")

# Hydrophobicity profile (Kyte-Doolittle)
kd_scale = {
    'A': 1.8, 'R': -4.5, 'N': -3.5, 'D': -3.5, 'C': 2.5,
    'E': -3.5, 'Q': -3.5, 'G': -0.4, 'H': -3.2, 'I': 4.5,
    'L': 3.8, 'K': -3.9, 'M': 1.9, 'F': 2.8, 'P': -1.6,
    'S': -0.8, 'T': -0.7, 'W': -0.9, 'Y': -1.3, 'V': 4.2
}

print("\nHydrophobicity profile (Kyte-Doolittle, window=5):")
window = 5
for i in range(len(test_seq) - window + 1):
    segment = test_seq[i:i+window]
    hydro = np.mean([kd_scale.get(aa, 0) for aa in segment])
    if i % 3 == 0:
        bar_len = int((hydro + 5) * 3)
        bar = '#' * max(0, bar_len)
        print(f"  Pos {i+1:2d}-{i+window:2d} [{segment}]: {hydro:6.2f} {bar}")

print("\n[Simulation complete]")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Fortran Computation: Gradient Descent on Lennard-Jones

This Fortran program demonstrates numerical gradient descent optimization applied to finding the equilibrium distance of the Lennard-Jones potential — a fundamental building block in molecular mechanics force fields used by ML-enhanced simulations.

Gradient Descent Optimizer for Lennard-Jones Potential

Fortran

Numerical optimization finding the equilibrium separation in a Lennard-Jones pair potential

gradient_descent_lj.f90114 lines

program gradient_descent_lj
  ! Gradient Descent Optimizer for Lennard-Jones Potential
  ! Finds equilibrium distance for a pair of particles
  implicit none

integer, parameter :: dp = selected_real_kind(15, 307)
  real(dp) :: r, r_min, r_max, dr
  real(dp) :: sigma, epsilon_lj, energy, force
  real(dp) :: lr, grad, r_opt, E_opt
  real(dp) :: r_analytical
  integer :: i, max_iter, n_points

! LJ parameters (argon-like)
  sigma = 3.405_dp       ! Angstroms
  epsilon_lj = 0.0103_dp ! eV (120 K * kB)

write(*,'(A)') '=========================================================='
  write(*,'(A)') ' GRADIENT DESCENT OPTIMIZATION OF LENNARD-JONES POTENTIAL'
  write(*,'(A)') '=========================================================='
  write(*,'(A)') ''
  write(*,'(A,F8.4,A)') ' Sigma   = ', sigma, ' Angstroms'
  write(*,'(A,F8.6,A)') ' Epsilon = ', epsilon_lj, ' eV'
  write(*,'(A)') ''

! Analytical minimum
  r_analytical = sigma * 2.0_dp**(1.0_dp/6.0_dp)
  write(*,'(A,F10.6,A)') ' Analytical r_min = ', r_analytical, ' Angstroms'
  write(*,'(A,F12.8,A)') ' Analytical E_min = ', -epsilon_lj, ' eV'
  write(*,'(A)') ''

! --- LJ Potential Energy Surface ---
  write(*,'(A)') '--- Lennard-Jones Potential Energy Curve ---'
  write(*,'(A8,A14,A14)') 'r (A)', 'V(r) (eV)', 'F(r) (eV/A)'
  write(*,'(A)') '--------------------------------------'

r_min = 3.0_dp
  r_max = 8.0_dp
  n_points = 25
  dr = (r_max - r_min) / real(n_points - 1, dp)

do i = 1, n_points
    r = r_min + real(i-1, dp) * dr
    call lj_energy_force(r, sigma, epsilon_lj, energy, force)
    write(*,'(F8.3,F14.8,F14.8)') r, energy, force
  end do

! --- Gradient Descent Optimization ---
  write(*,'(A)') ''
  write(*,'(A)') '--- Gradient Descent Optimization ---'

r = 5.0_dp     ! Initial guess (far from minimum)
  lr = 0.01_dp   ! Learning rate
  max_iter = 200

write(*,'(A,F8.4,A)') ' Initial r = ', r, ' Angstroms'
  write(*,'(A,F8.6)')   ' Learning rate = ', lr
  write(*,'(A)') ''
  write(*,'(A6,A12,A14,A14,A14)') &
    'Iter', 'r (A)', 'V(r) (eV)', 'Force', '|grad|'
  write(*,'(A)') '------------------------------------------------------------'

do i = 1, max_iter
    call lj_energy_force(r, sigma, epsilon_lj, energy, force)
    grad = -force  ! gradient of V = -force

if (mod(i, 10) == 1 .or. i <= 5 .or. abs(grad) < 1.0e-8_dp) then
      write(*,'(I6,F12.6,F14.8,F14.8,ES14.4)') i, r, energy, force, abs(grad)
    end if

! Check convergence
    if (abs(grad) < 1.0e-10_dp) then
      write(*,'(A,I4,A)') ' Converged after ', i, ' iterations!'
      exit
    end if

! Gradient descent step with adaptive learning rate
    r = r - lr * grad

! Ensure r stays positive and reasonable
    if (r < 0.8_dp * sigma) r = 0.8_dp * sigma
  end do

r_opt = r
  call lj_energy_force(r_opt, sigma, epsilon_lj, E_opt, force)

write(*,'(A)') ''
  write(*,'(A)') '--- Optimization Results ---'
  write(*,'(A,F12.8,A)') ' Optimized r    = ', r_opt, ' Angstroms'
  write(*,'(A,F12.8,A)') ' Optimized V(r) = ', E_opt, ' eV'
  write(*,'(A,F12.8,A)') ' Analytical r   = ', r_analytical, ' Angstroms'
  write(*,'(A,F12.8,A)') ' Analytical V   = ', -epsilon_lj, ' eV'
  write(*,'(A,ES12.4,A)') ' r error        = ', abs(r_opt - r_analytical), ' Angstroms'
  write(*,'(A,ES12.4,A)') ' V error        = ', abs(E_opt + epsilon_lj), ' eV'
  write(*,'(A)') ''
  write(*,'(A)') '[Computation complete]'

contains

subroutine lj_energy_force(r, sig, eps, V, F)
    real(dp), intent(in) :: r, sig, eps
    real(dp), intent(out) :: V, F
    real(dp) :: sr6, sr12

sr6 = (sig / r)**6
    sr12 = sr6 * sr6

! V(r) = 4*eps * [(sig/r)^12 - (sig/r)^6]
    V = 4.0_dp * eps * (sr12 - sr6)

! F(r) = -dV/dr = 4*eps * [12*sig^12/r^13 - 6*sig^6/r^7]
    F = 4.0_dp * eps * (12.0_dp * sr12 / r - 6.0_dp * sr6 / r)
  end subroutine

end program gradient_descent_lj

Click Run to execute the Fortran code

Code will be compiled with gfortran and executed on the server

Video Lectures

AlphaFold and the Future of Biology

DeepMind's presentation on AlphaFold2, its architecture, training methodology, and implications for structural biology and drug discovery.

Machine Learning for Molecular Science

Overview of machine learning approaches in molecular science, including neural network potentials, generative models, and property prediction.

Key Concepts Summary

AlphaFold2

Evoformer + Structure Module achieves near-experimental accuracy in protein structure prediction

Protein Language Models

ESM-2 and ProtTrans learn structural and functional embeddings from evolutionary sequences

Diffusion Models

RFDiffusion generates novel protein backbones through learned denoising of 3D coordinates

ML Potentials

Neural network potentials achieve quantum accuracy at classical speed for molecular simulations

← Photosynthesis Experimental Techniques →

Share:X Reddit LinkedIn