Module 1

Sequence Alignment & Dynamic Programming

Pairwise sequence alignment is the mathematical core of bioinformatics. This module presents Needleman-Wunsch global alignment, Smith-Waterman local alignment, affine gap penalties, BLAST seeding heuristics, and the Karlin-Altschul statistics that underpin E-values.

1. Needleman-Wunsch Global Alignment

Needleman & Wunsch 1970 introduced the dynamic-programming approach. For sequences s and t, fill matrix H:

\[ H_{i,j} = \max\begin{cases} H_{i-1,j-1} + \sigma(s_i, t_j) \\ H_{i-1,j} + g \\ H_{i,j-1} + g \end{cases} \]

with the score matrix σ (match/mismatch) and gap penalty g. Complexity is O(nm). Traceback from H_n,m recovers the optimal alignment. Hirschberg 1975 achieves linear space with divide-and-conquer at the same time complexity.

2. Smith-Waterman Local Alignment

Smith & Waterman 1981 modified the recurrence to allow H_i,j = 0, restarting alignment anywhere. The result is the highest-scoring local subsequence match — the natural analog for finding conserved domains in divergent proteins.

3. Affine Gaps & Gotoh

Biological gaps are bursty: inserting one base is cheap once the gap is opened. Gotoh 1982 extended DP to affine gap penalties — gap-opening cost plus gap-extension cost — by maintaining three matrices (match, I, D). This is the default scoring scheme in all modern aligners.

Simulation: Needleman-Wunsch DP Matrix

Python

script.py34 lines

import numpy as np, matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Needleman-Wunsch global alignment
s1 = 'GATTACA'
s2 = 'GCATGCU'
match, mismatch, gap = 1, -1, -1

n, m = len(s1), len(s2)
H = np.zeros((n+1, m+1), dtype=int)
for i in range(n+1): H[i, 0] = gap * i
for j in range(m+1): H[0, j] = gap * j
for i in range(1, n+1):
    for j in range(1, m+1):
        score = match if s1[i-1] == s2[j-1] else mismatch
        H[i, j] = max(H[i-1, j-1] + score, H[i-1, j] + gap, H[i, j-1] + gap)

fig, ax = plt.subplots(figsize=(9, 7), facecolor='#0a0a1a')
ax.set_facecolor('#111827'); ax.tick_params(colors='#cbd5e1')
for s in ax.spines.values(): s.set_color('#334155')
im = ax.imshow(H, cmap='viridis')
ax.set_xticks(range(m+1)); ax.set_xticklabels([' '] + list(s2), color='#cbd5e1')
ax.set_yticks(range(n+1)); ax.set_yticklabels([' '] + list(s1), color='#cbd5e1')
ax.set_title(f'Needleman-Wunsch DP matrix, score={H[-1,-1]}',
             color='#5eead4', fontweight='bold')
for i in range(n+1):
    for j in range(m+1):
        ax.text(j, i, H[i,j], ha='center', va='center',
                color='#fde68a', fontsize=11)
plt.colorbar(im, ax=ax, fraction=0.03)
plt.tight_layout()
plt.savefig('output.png', dpi=120, bbox_inches='tight', facecolor='#0a0a1a')
print(f'Optimal global alignment score: {H[-1,-1]}')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4. BLAST Heuristics & E-values

Exact DP against UniProt’s 250M sequences is infeasible. BLAST (Altschul 1990, 1997) uses word-level seeding: find short exact matches (k=3 protein, k=11 nucleotide), extend only productive seeds. Karlin-Altschul statistics produce the E-value:

\[ E \;=\; K\,m\,n\,e^{-\lambda S} \]

where S is raw score, m and n are query and database sizes, and λ, K are scoring-system parameters. E < 10^-5 is the conventional significance threshold.

Key References

• Needleman, S. B. & Wunsch, C. D. (1970). “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” J. Mol. Biol., 48, 443–453.

• Smith, T. F. & Waterman, M. S. (1981). “Identification of common molecular subsequences.” J. Mol. Biol., 147, 195–197.

• Altschul, S. F. et al. (1997). “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic Acids Res., 25, 3389–3402.

• Gotoh, O. (1982). “An improved algorithm for matching biological sequences.” J. Mol. Biol., 162, 705–708.

Share:X Reddit LinkedIn

← Module 0 Module 2: Substitution Matrices →