Chapter 17: Attention & Transformers

The attention mechanism broke the bottleneck of fixed-length context vectors; the Transformer architecture replaced recurrence entirely with attention. “Attention Is All You Need” (Vaswani et al., 2017) is now the foundation of virtually every state-of-the-art NLP, vision, and multimodal system.

1. Motivation: Breaking the Information Bottleneck

In seq2seq RNNs, the encoder must compress an entire sentence into one fixed-length vector — information loss is inevitable for long sequences. Attention lets the decoder look directly at all encoder hidden states, computing a weighted sum based on relevance:

\[ \mathbf{c}_t = \sum_{s=1}^{T_{\rm enc}} \alpha_{ts}\,\mathbf{h}_s^{\rm enc}, \qquad \alpha_{ts} = \frac{\exp(e_{ts})}{\sum_{s'}\exp(e_{ts'})}, \quad e_{ts} = \text{score}(\mathbf{h}_{t-1}^{\rm dec}, \mathbf{h}_s^{\rm enc}) \]

2. Scaled Dot-Product Attention: Full Derivation

The Transformer generalises attention to three distinct roles. Given input matrix \(\mathbf{X} \in \mathbb{R}^{T \times d}\), project to queries, keys, and values:

\[ \mathbf{Q} = \mathbf{X}W_Q \in \mathbb{R}^{T \times d_k}, \quad \mathbf{K} = \mathbf{X}W_K \in \mathbb{R}^{T \times d_k}, \quad \mathbf{V} = \mathbf{X}W_V \in \mathbb{R}^{T \times d_v} \]

The attention output is:

\[ \text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V} \]

2.1 Why scale by \(\sqrt{d_k}\)?

Suppose \(\mathbf{q}, \mathbf{k} \sim \mathcal{N}(\mathbf{0}, I)\) independently. The dot product is:

\[ \mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i \]

Each term \(q_i k_i\) has mean 0 and variance 1, so \(\mathrm{Var}(\mathbf{q}\cdot\mathbf{k}) = d_k\). Without scaling, for large \(d_k\) (e.g., 512), the dot products have standard deviation \(\sqrt{d_k} \approx 22\). These large values push softmax into saturation regions where gradients are near zero.

Scaling by \(1/\sqrt{d_k}\) restores unit variance: \(\mathrm{Var}(\mathbf{q}\cdot\mathbf{k}/\sqrt{d_k}) = 1\). The softmax operates in a stable gradient region, and training converges reliably.

3. Multi-Head Attention

Rather than computing one attention function, project to \(h\) different \((Q,K,V)\) subspaces and concatenate:

\[ \text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)\,W_O \]\[ \text{head}_i = \text{Attention}(\mathbf{Q}W_Q^{(i)},\; \mathbf{K}W_K^{(i)},\; \mathbf{V}W_V^{(i)}) \]

With \(d_k = d_v = d_{\rm model}/h\), the total computation cost matches a single-head model. Different heads learn to attend to different types of relationships simultaneously (syntactic, semantic, positional).

4. Sinusoidal Positional Encoding

Attention is permutation-equivariant — it treats the sequence as a bag of tokens. Positional information is injected by adding positional encodings to the token embeddings:

\[ PE(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_{\rm model}}}\right) \]\[ PE(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_{\rm model}}}\right) \]

This choice has a key property: \(PE(\text{pos}+k)\) can be expressed as a linear function of \(PE(\text{pos})\) for any fixed offset \(k\), enabling the model to easily learn relative positions. Additionally, sinusoidal encodings generalise to sequence lengths unseen during training.

5. Transformer Architecture

Each Transformer encoder layer consists of:

  1. Multi-head self-attention: each token attends to all others
  2. Add & LayerNorm: residual connection + layer normalisation
  3. Position-wise FFN: \(\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}W_1+\mathbf{b}_1)W_2+\mathbf{b}_2\) applied independently to each position
  4. Add & LayerNorm again

Each decoder layer adds a third sub-layer: cross-attention over the encoder output.

5.1 Layer Normalisation

LayerNorm normalises across the feature dimension (not batch), making it independent of batch size:\(\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma + \varepsilon} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}\). Pre-norm (normalise before each sub-layer, GPT-2 style) is more training-stable than the original post-norm for very deep models.

6. Transformer Encoder-Decoder Diagram

ENCODERInput Embeddings + PEEncoder Layer (×N)Add & LayerNormFeed ForwardAdd & LayerNormMulti-Head Self-AttnEncoder Output K, VDECODERTarget Embeddings + PEDecoder Layer (×N)Add & LayerNormFeed ForwardAdd & LayerNormCross-Attention (Q from dec, K/V from enc)Add & LayerNormMasked Multi-Head Self-AttnK, V cross-attentionLinear + Softmax → Output probs

7. Python: Self-Attention from Scratch

We implement scaled dot-product attention and multi-head attention from scratch using NumPy. Visualisations show: (1) full attention weight matrix, (2) causal masked attention, (3) multi-head diversity, (4) the scaling variance argument, (5) sinusoidal positional encoding heatmap, and (6) PE cosine similarity showing nearby positions are more similar.

Python
script.py183 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server