Module 2

Substitution Matrices & MSA

PAM (Dayhoff 1978) and BLOSUM (Henikoff 1992) substitution matrices encode the amino-acid replacement frequencies observed in aligned proteins. Multiple sequence alignment (MSA) extends pairwise methods to entire families, underpinning profile HMMs (HMMER), phylogenetics, and homology modelling.

1. Log-Odds Score Matrices

For amino acids i, j with background frequencies pi, pj and observed joint frequency qij:

\[ s(i, j) \;=\; \frac{1}{\lambda}\,\log\!\frac{q_{ij}}{p_i\,p_j} \]

Positive scores indicate substitutions more frequent than by chance; negative scores the reverse. The λ normalisation sets a consistent half-bit or third-bit unit system. BLOSUM62 is the de-facto default for protein alignment; BLOSUM45 and BLOSUM80 tune to distant vs. close homology.

2. PAM vs BLOSUM

Dayhoff’s PAM matrices extrapolate from closely-related proteins, producing nested PAM1, PAM40, PAM120 series by matrix multiplication. BLOSUM matrices are built directly from alignments at specific percent-identity blocks — BLOSUM62 is derived from ungapped blocks of proteins ≥62% identical. In practice BLOSUM performs better at the 30–70% identity range and is the standard.

Simulation: BLOSUM62 Visualisation

Python
script.py33 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Multiple Sequence Alignment

MSA extends DP to multiple sequences. Exact MSA is NP-hard; practical tools use progressive (ClustalW, T-Coffee), iterative (MUSCLE, MAFFT), or consistency-based (T-Coffee) heuristics. Alignment quality depends on guide-tree order and column scoring. Benchmarks (BAliBASE, SABmark) compare tools; MAFFT L-INS-i and Clustal Omega are current defaults.

4. Profile HMMs & HMMER

Once aligned, a family can be summarised as a profile HMM: per-column emission and transition probabilities that encode position-specific conservation. HMMER (Eddy 2011) builds profile HMMs and searches databases for family members, more sensitively than BLAST for distant homology. Pfam (now InterPro) is the canonical protein-family database built on HMMs.

Key References

• Dayhoff, M. O. et al. (1978). “A model of evolutionary change in proteins.” In Atlas of Protein Sequence and Structure, 5.

• Henikoff, S. & Henikoff, J. G. (1992). “Amino acid substitution matrices from protein blocks.” Proc. Natl. Acad. Sci., 89, 10915–10919.

• Katoh, K. & Standley, D. M. (2013). “MAFFT multiple sequence alignment software v7.” Mol. Biol. Evol., 30, 772–780.

• Eddy, S. R. (2011). “Accelerated profile HMM searches.” PLOS Comput. Biol., 7, e1002195.

Share:XRedditLinkedIn