Module 2

Substitution Matrices & MSA

PAM (Dayhoff 1978) and BLOSUM (Henikoff 1992) substitution matrices encode the amino-acid replacement frequencies observed in aligned proteins. Multiple sequence alignment (MSA) extends pairwise methods to entire families, underpinning profile HMMs (HMMER), phylogenetics, and homology modelling.

1. Log-Odds Score Matrices

For amino acids i, j with background frequencies p_i, p_j and observed joint frequency q_ij:

\[ s(i, j) \;=\; \frac{1}{\lambda}\,\log\!\frac{q_{ij}}{p_i\,p_j} \]

Positive scores indicate substitutions more frequent than by chance; negative scores the reverse. The λ normalisation sets a consistent half-bit or third-bit unit system. BLOSUM62 is the de-facto default for protein alignment; BLOSUM45 and BLOSUM80 tune to distant vs. close homology.

2. PAM vs BLOSUM

Dayhoff’s PAM matrices extrapolate from closely-related proteins, producing nested PAM1, PAM40, PAM120 series by matrix multiplication. BLOSUM matrices are built directly from alignments at specific percent-identity blocks — BLOSUM62 is derived from ungapped blocks of proteins ≥62% identical. In practice BLOSUM performs better at the 30–70% identity range and is the standard.

Simulation: BLOSUM62 Visualisation

Python

script.py33 lines

import numpy as np, matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Simplified BLOSUM62 subset: top-6 most-frequent amino acids
aa = ['A','R','N','D','C','Q']
# Pseudo-BLOSUM62 scores
bl = np.array([
    [ 4,-1,-2,-2, 0,-1],
    [-1, 5, 0,-2,-3, 1],
    [-2, 0, 6, 1,-3, 0],
    [-2,-2, 1, 6,-3, 0],
    [ 0,-3,-3,-3, 9,-3],
    [-1, 1, 0, 0,-3, 5],
])

fig, ax = plt.subplots(figsize=(8, 7), facecolor='#0a0a1a')
ax.set_facecolor('#111827'); ax.tick_params(colors='#cbd5e1')
for s in ax.spines.values(): s.set_color('#334155')
im = ax.imshow(bl, cmap='RdBu_r', vmin=-4, vmax=9)
ax.set_xticks(range(len(aa))); ax.set_xticklabels(aa)
ax.set_yticks(range(len(aa))); ax.set_yticklabels(aa)
for i in range(len(aa)):
    for j in range(len(aa)):
        ax.text(j, i, bl[i,j], ha='center', va='center',
                color='#0a0a1a' if abs(bl[i,j])>2 else '#fde68a', fontweight='bold')
ax.set_title('BLOSUM62 (subset): log-odds substitution scores',
             color='#5eead4', fontweight='bold')
plt.colorbar(im, ax=ax, fraction=0.04)
plt.tight_layout()
plt.savefig('output.png', dpi=120, bbox_inches='tight', facecolor='#0a0a1a')
print('BLOSUM62: s(i,j) = (1/lambda) log(q_ij / (p_i * p_j))')
print('Positive score = substitution observed more than chance')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Multiple Sequence Alignment

MSA extends DP to multiple sequences. Exact MSA is NP-hard; practical tools use progressive (ClustalW, T-Coffee), iterative (MUSCLE, MAFFT), or consistency-based (T-Coffee) heuristics. Alignment quality depends on guide-tree order and column scoring. Benchmarks (BAliBASE, SABmark) compare tools; MAFFT L-INS-i and Clustal Omega are current defaults.

4. Profile HMMs & HMMER

Once aligned, a family can be summarised as a profile HMM: per-column emission and transition probabilities that encode position-specific conservation. HMMER (Eddy 2011) builds profile HMMs and searches databases for family members, more sensitively than BLAST for distant homology. Pfam (now InterPro) is the canonical protein-family database built on HMMs.

Key References

• Dayhoff, M. O. et al. (1978). “A model of evolutionary change in proteins.” In Atlas of Protein Sequence and Structure, 5.

• Henikoff, S. & Henikoff, J. G. (1992). “Amino acid substitution matrices from protein blocks.” Proc. Natl. Acad. Sci., 89, 10915–10919.

• Katoh, K. & Standley, D. M. (2013). “MAFFT multiple sequence alignment software v7.” Mol. Biol. Evol., 30, 772–780.

• Eddy, S. R. (2011). “Accelerated profile HMM searches.” PLOS Comput. Biol., 7, e1002195.

Share:X Reddit LinkedIn

← Module 1 Module 3: Assembly →