Module 2
Substitution Matrices & MSA
PAM (Dayhoff 1978) and BLOSUM (Henikoff 1992) substitution matrices encode the amino-acid replacement frequencies observed in aligned proteins. Multiple sequence alignment (MSA) extends pairwise methods to entire families, underpinning profile HMMs (HMMER), phylogenetics, and homology modelling.
1. Log-Odds Score Matrices
For amino acids i, j with background frequencies pi, pj and observed joint frequency qij:
\[ s(i, j) \;=\; \frac{1}{\lambda}\,\log\!\frac{q_{ij}}{p_i\,p_j} \]
Positive scores indicate substitutions more frequent than by chance; negative scores the reverse. The λ normalisation sets a consistent half-bit or third-bit unit system. BLOSUM62 is the de-facto default for protein alignment; BLOSUM45 and BLOSUM80 tune to distant vs. close homology.
2. PAM vs BLOSUM
Dayhoff’s PAM matrices extrapolate from closely-related proteins, producing nested PAM1, PAM40, PAM120 series by matrix multiplication. BLOSUM matrices are built directly from alignments at specific percent-identity blocks — BLOSUM62 is derived from ungapped blocks of proteins ≥62% identical. In practice BLOSUM performs better at the 30–70% identity range and is the standard.
Simulation: BLOSUM62 Visualisation
Click Run to execute the Python code
Code will be executed with Python 3 on the server
3. Multiple Sequence Alignment
MSA extends DP to multiple sequences. Exact MSA is NP-hard; practical tools use progressive (ClustalW, T-Coffee), iterative (MUSCLE, MAFFT), or consistency-based (T-Coffee) heuristics. Alignment quality depends on guide-tree order and column scoring. Benchmarks (BAliBASE, SABmark) compare tools; MAFFT L-INS-i and Clustal Omega are current defaults.
4. Profile HMMs & HMMER
Once aligned, a family can be summarised as a profile HMM: per-column emission and transition probabilities that encode position-specific conservation. HMMER (Eddy 2011) builds profile HMMs and searches databases for family members, more sensitively than BLAST for distant homology. Pfam (now InterPro) is the canonical protein-family database built on HMMs.
Key References
• Dayhoff, M. O. et al. (1978). “A model of evolutionary change in proteins.” In Atlas of Protein Sequence and Structure, 5.
• Henikoff, S. & Henikoff, J. G. (1992). “Amino acid substitution matrices from protein blocks.” Proc. Natl. Acad. Sci., 89, 10915–10919.
• Katoh, K. & Standley, D. M. (2013). “MAFFT multiple sequence alignment software v7.” Mol. Biol. Evol., 30, 772–780.
• Eddy, S. R. (2011). “Accelerated profile HMM searches.” PLOS Comput. Biol., 7, e1002195.