Chapter 16: Recurrent Networks
Recurrent neural networks process sequences by maintaining a hidden state that accumulates information from all past inputs. The vanishing gradient problem plagued early RNNs; the LSTM's gating mechanism was the breakthrough that made deep sequence learning practical.
1. The Vanilla RNN
At each time step \(t\), the RNN reads input \(\mathbf{x}_t\) and updates its hidden state:
The same weights \(W_{hh}, W_{xh}\) are reused at every step — this is parameter sharing, what makes RNNs efficient on sequences of arbitrary length.
2. Backpropagation Through Time (BPTT)
For a loss \(\mathcal{L} = \sum_t \mathcal{L}_t\) over all steps, the gradient with respect to \(W_{hh}\) requires the chain rule unrolled through time:
The gradient of \(\mathbf{h}_t\) with respect to \(\mathbf{h}_k\) (for \(k < t\)) is:
This is a product of \(t-k\) Jacobian matrices. If the largest singular value of \(W_{hh}\) is \(\lambda_1 < 1\), the product shrinks exponentially — vanishing gradients. If \(\lambda_1 > 1\) they explode. Both make training long-range dependencies impossible.
Gradient accumulation formula
3. Long Short-Term Memory (LSTM)
The LSTM (Hochreiter & Schmidhuber, 1997) adds a separate cell state \(\mathbf{c}_t\) that flows through time with only multiplicative interactions — no squashing — enabling gradients to flow over hundreds of steps.
3.1 Full Gate Equations
Why LSTMs solve vanishing gradients
The gradient through the cell state pathway is: \(\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t\). This is a scalar multiplication (no matrix multiply, no squashing), so the gradient is determined only by the forget gate values. When the forget gate is near 1, gradients flow unchanged; the network learns to keep gradients alive when needed.
4. LSTM Cell Diagram
5. GRU: Simplified Gating
The Gated Recurrent Unit (Cho et al., 2014) merges the forget and input gates into one update gate and eliminates the cell state, using only the hidden state:
GRU has fewer parameters than LSTM and is often preferred for smaller datasets. Both achieve similar performance in practice.
5.1 Bidirectional RNNs
A bidirectional RNN runs one RNN left-to-right and another right-to-left, concatenating both hidden states:\(\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]\). This gives access to both past and future context at each position — essential for NLP tasks like named entity recognition and machine translation encoders.
6. Python: RNN vs LSTM for Sequence Prediction
We implement both a vanilla RNN and an LSTM from scratch using NumPy. The task is one-step-ahead prediction of a noisy sine wave. We also measure gradient norms as a function of BPTT steps to show the vanishing gradient effect directly.
Click Run to execute the Python code
Code will be executed with Python 3 on the server