Module 1
Sequence Alignment & Dynamic Programming
Pairwise sequence alignment is the mathematical core of bioinformatics. This module presents Needleman-Wunsch global alignment, Smith-Waterman local alignment, affine gap penalties, BLAST seeding heuristics, and the Karlin-Altschul statistics that underpin E-values.
1. Needleman-Wunsch Global Alignment
Needleman & Wunsch 1970 introduced the dynamic-programming approach. For sequences s and t, fill matrix H:
\[ H_{i,j} = \max\begin{cases} H_{i-1,j-1} + \sigma(s_i, t_j) \\ H_{i-1,j} + g \\ H_{i,j-1} + g \end{cases} \]
with the score matrix σ (match/mismatch) and gap penalty g. Complexity is O(nm). Traceback from Hn,m recovers the optimal alignment. Hirschberg 1975 achieves linear space with divide-and-conquer at the same time complexity.
2. Smith-Waterman Local Alignment
Smith & Waterman 1981 modified the recurrence to allow Hi,j = 0, restarting alignment anywhere. The result is the highest-scoring local subsequence match — the natural analog for finding conserved domains in divergent proteins.
3. Affine Gaps & Gotoh
Biological gaps are bursty: inserting one base is cheap once the gap is opened. Gotoh 1982 extended DP to affine gap penalties — gap-opening cost plus gap-extension cost — by maintaining three matrices (match, I, D). This is the default scoring scheme in all modern aligners.
Simulation: Needleman-Wunsch DP Matrix
Click Run to execute the Python code
Code will be executed with Python 3 on the server
4. BLAST Heuristics & E-values
Exact DP against UniProt’s 250M sequences is infeasible. BLAST (Altschul 1990, 1997) uses word-level seeding: find short exact matches (k=3 protein, k=11 nucleotide), extend only productive seeds. Karlin-Altschul statistics produce the E-value:
\[ E \;=\; K\,m\,n\,e^{-\lambda S} \]
where S is raw score, m and n are query and database sizes, and λ, K are scoring-system parameters. E < 10-5 is the conventional significance threshold.
Key References
• Needleman, S. B. & Wunsch, C. D. (1970). “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” J. Mol. Biol., 48, 443–453.
• Smith, T. F. & Waterman, M. S. (1981). “Identification of common molecular subsequences.” J. Mol. Biol., 147, 195–197.
• Altschul, S. F. et al. (1997). “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic Acids Res., 25, 3389–3402.
• Gotoh, O. (1982). “An improved algorithm for matching biological sequences.” J. Mol. Biol., 162, 705–708.