Chapter 18: Large Language Models
Large language models are Transformer-based autoregressive models trained on massive text corpora. They exhibit surprising capabilities that emerge at scale โ abilities not present in small models and not explicitly programmed. This chapter covers the mathematical foundations, training pipeline, and evaluation of LLMs.
1. Autoregressive Language Modelling
A language model defines a probability distribution over sequences. Using the chain rule of probability:
Training minimises the cross-entropy loss (equivalently, maximises log-likelihood):
Perplexity = \(\exp(\mathcal{L})\) measures how โsurprisedโ the model is on average. A perplexity of 20 means the model is roughly as uncertain as choosing uniformly among 20 tokens.
2. GPT Architecture (Decoder-Only Transformer)
GPT uses a decoder-only Transformer: the full Transformer decoder of Chapter 17, but with the cross-attention layer removed. A causal mask ensures each position only attends to previous tokens, maintaining the autoregressive property.
GPT (Decoder-only)
- Causal (left-to-right) attention mask
- Trained on next-token prediction
- Natural for generation tasks
- Examples: GPT-2, GPT-3, GPT-4, LLaMA, Gemini
BERT (Encoder-only)
- Bidirectional: attends to all positions
- Trained on masked language modelling (MLM)
- Strong for understanding/classification
- Examples: BERT, RoBERTa, DeBERTa
BERT's Masked Language Modelling: randomly mask 15% of tokens, predict the masked tokens using bidirectional context. This is a denoising objective: \(P(x_{\rm masked} \mid x_{\rm unmasked})\). BERT cannot generate text naturally (it sees future tokens); GPT cannot use future context for understanding.
3. Scaling Laws
Kaplan et al. (2020) and Hoffmann et al. (2022, โChinchillaโ) showed that language model loss follows a power law in the number of parameters \(N\), training tokens \(D\), and compute \(C\):
Chinchilla finding: for a fixed compute budget, it is better to train a smaller model on more data than a larger model on fewer tokens. Optimal allocation: \(D \approx 20 \times N\) tokens.
3.1 Emergent Abilities
Some capabilities (e.g., multi-step arithmetic, chain-of-thought reasoning, in-context learning) appear suddenly above a threshold model size โ they are absent in small models and present in large ones without explicit training. This is called emergence and is an active area of research.
4. Training Pipeline: Pre-training, Fine-tuning, RLHF
- Pre-training: Train on massive text corpus \(\mathcal{D}_{\rm pre}\) by minimising cross-entropy. Produces a model \(\pi_{\rm PT}\) that can complete text.
- Supervised Fine-tuning (SFT): Fine-tune on high-quality demonstrations \(\{(prompt_i, response_i)\}\) of desired behaviour.
- Reward Model Training: Collect human preference data: pairs of responses ranked by quality. Train a reward model \(r_\phi(x, y)\) using the Bradley-Terry preference model:\[ P(y_1 \succ y_2) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2)) \]
- RLHF with PPO: Optimise the language model as a policy to maximise the reward model score, subject to a KL penalty from the SFT policy:\[ \mathcal{L}_{\rm RLHF}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[r_\phi(x,y) - \beta\,\mathrm{KL}(\pi_\theta(\cdot|x) \| \pi_{\rm SFT}(\cdot|x))\right] \]The KL term prevents the model from exploiting the reward model with nonsensical outputs that score highly.
5. Tokenisation: Byte-Pair Encoding (BPE)
LLMs operate on tokens, not characters. BPE builds a vocabulary by iteratively merging the most frequent pair of adjacent tokens:
- Start with character vocabulary.
- Count all adjacent pairs in the corpus.
- Merge the most frequent pair into a new token.
- Repeat until vocabulary reaches target size (e.g., 50,257 for GPT-2).
BPE balances vocabulary size and sequence length. Common words become single tokens; rare words decompose into sub-word tokens, maintaining coverage without an infinite vocabulary.
6. In-Context Learning & Prompting
A surprising property of large LMs: they can learn new tasks from just a few examples in the prompt, without any gradient updates. Given \(k\) examples in context, the model predicts the next output:
Chain-of-Thought (CoT)
Prompting with intermediate reasoning steps dramatically improves performance on multi-step problems. The model generates a scratchpad before the final answer.
Zero-shot Prompting
Large enough models can follow instructions without any examples, simply from the instruction text. Instruction-tuning amplifies this ability.
7. GPT Architecture Diagram
8. Python: Tiny Character-Level Transformer
We implement a 2-layer character-level Transformer from scratch in NumPy, train it on a small text, and visualise: training loss, attention patterns, perplexity, scaling law curves, character frequency (BPE motivation), and temperature sampling effects.
Click Run to execute the Python code
Code will be executed with Python 3 on the server