Module 1

Sequence Alignment & Dynamic Programming

Pairwise sequence alignment is the mathematical core of bioinformatics. This module presents Needleman-Wunsch global alignment, Smith-Waterman local alignment, affine gap penalties, BLAST seeding heuristics, and the Karlin-Altschul statistics that underpin E-values.

1. Needleman-Wunsch Global Alignment

Needleman & Wunsch 1970 introduced the dynamic-programming approach. For sequences s and t, fill matrix H:

\[ H_{i,j} = \max\begin{cases} H_{i-1,j-1} + \sigma(s_i, t_j) \\ H_{i-1,j} + g \\ H_{i,j-1} + g \end{cases} \]

with the score matrix σ (match/mismatch) and gap penalty g. Complexity is O(nm). Traceback from Hn,m recovers the optimal alignment. Hirschberg 1975 achieves linear space with divide-and-conquer at the same time complexity.

2. Smith-Waterman Local Alignment

Smith & Waterman 1981 modified the recurrence to allow Hi,j = 0, restarting alignment anywhere. The result is the highest-scoring local subsequence match — the natural analog for finding conserved domains in divergent proteins.

3. Affine Gaps & Gotoh

Biological gaps are bursty: inserting one base is cheap once the gap is opened. Gotoh 1982 extended DP to affine gap penalties — gap-opening cost plus gap-extension cost — by maintaining three matrices (match, I, D). This is the default scoring scheme in all modern aligners.

Simulation: Needleman-Wunsch DP Matrix

Python
script.py34 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4. BLAST Heuristics & E-values

Exact DP against UniProt’s 250M sequences is infeasible. BLAST (Altschul 1990, 1997) uses word-level seeding: find short exact matches (k=3 protein, k=11 nucleotide), extend only productive seeds. Karlin-Altschul statistics produce the E-value:

\[ E \;=\; K\,m\,n\,e^{-\lambda S} \]

where S is raw score, m and n are query and database sizes, and λ, K are scoring-system parameters. E < 10-5 is the conventional significance threshold.

Key References

• Needleman, S. B. & Wunsch, C. D. (1970). “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” J. Mol. Biol., 48, 443–453.

• Smith, T. F. & Waterman, M. S. (1981). “Identification of common molecular subsequences.” J. Mol. Biol., 147, 195–197.

• Altschul, S. F. et al. (1997). “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic Acids Res., 25, 3389–3402.

• Gotoh, O. (1982). “An improved algorithm for matching biological sequences.” J. Mol. Biol., 162, 705–708.

Share:XRedditLinkedIn