Deep Neural Networks
Depth brings representational power but also new failure modes: vanishing and exploding gradients. This chapter derives the engineering innovations โ Batch Normalisation, Dropout, residual connections, and principled weight initialisation โ that make training very deep networks reliable.
1. Activation Functions and Their Derivatives
The choice of activation function determines gradient flow, expressivity, and training stability.
ReLU (Rectified Linear Unit)
No vanishing gradient for positive inputs. Dead neurons problem: if \(z < 0\) always, gradient is permanently zero.
Leaky ReLU
Typical \(\alpha = 0.01\). Fixes dead neurons: gradient is always non-zero.
GELU (Gaussian Error Linear Unit)
Used in BERT, GPT. Smooth, non-monotone near zero, differentiable everywhere.
Swish (Ramachandran et al. 2017)
Self-gated; discovered via neural architecture search. Used in EfficientNet.
2. Vanishing & Exploding Gradients
Consider a deep network with sigmoid activations and identical weights \(w\) per layer. The gradient of the loss with respect to layer \(l\)'s pre-activation is:
The sigmoid derivative satisfies \(\sigma'(z) \leq \frac{1}{4}\) for all \(z\). Therefore, for \(L - l\) layers above layer \(l\):
If \(|w| < 4\), this factor decays exponentially with depth โ the vanishing gradient problem. If \(|w| > 4\), gradients grow exponentially โ the exploding gradient problem. ReLU has derivative 1 for positive inputs, eliminating the \(1/4\) factor, which is why it became the default.
3. Batch Normalisation โ Full Derivation
Batch Normalisation (Ioffe & Szegedy 2015) normalises the pre-activations within each mini-batch, then applies a learnable affine transformation to restore representational capacity.
Forward pass (training)
For a mini-batch \(\mathcal{B} = \{z_1,\dots,z_m\}\) at a given layer and feature dimension:
\(\gamma, \beta\) are learned parameters. Running statistics \(\mu_\mathrm{run}, \sigma^2_\mathrm{run}\) are maintained with momentum for inference.
Backward pass through BatchNorm
Given upstream gradient \(d\mathbf{y} = \partial\mathcal{L}/\partial\mathbf{y}\):
Why does BatchNorm help?
It reduces internal covariate shift: the distribution of each layer's inputs stays roughly\(\mathcal{N}(0,1)\) before the affine correction, keeping activations in the linear regime of saturating activations and allowing larger learning rates.
4. Dropout
Dropout (Srivastava et al. 2014) randomly zeroes activations during training using a Bernoulli mask:
The division by \((1-p)\) is inverted dropout: it rescales the surviving activations so that the expected value of \(\tilde{\mathbf{a}}\) equals \(\mathbf{a}\) regardless of \(p\). At inference time, no mask is applied and no scaling is needed. Dropout acts as an approximate ensemble of \(2^n\) sub-networks, reducing overfitting.
5. Residual Connections & Gradient Flow
A residual (skip) connection (He et al. 2016) adds the input directly to the output of a block:
where \(\mathcal{F}\) is the residual mapping (e.g. two conv layers + BN + ReLU). The gradient of the loss with respect to \(\mathbf{x}\) becomes:
The identity term \(\mathbf{I}\) provides a gradient highway: even if \(\partial\mathcal{F}/\partial\mathbf{x} \approx 0\), the gradient still flows back with magnitude at least \(\|\partial\mathcal{L}/\partial\mathbf{y}\|\). This is why ResNets can be trained with 100+ layers.
6. Weight Initialisation โ Xavier & He
Poor initialisation causes gradients to vanish or explode at the start of training. The goal is to choose the initial variance of weights so that the variance of activations ispreserved from layer to layer.
Xavier / Glorot initialisation (linear / tanh)
Consider a linear layer with \(n_\mathrm{in}\) inputs. Assume \(\mathbb{E}[w_i] = 0\), \(\mathrm{Var}[w_i] = \sigma_w^2\), and inputs have variance 1. Then:
Setting \(\mathrm{Var}[z] = 1\) requires \(\sigma_w^2 = 1/n_\mathrm{in}\). Accounting for both forward and backward pass (Glorot & Bengio 2010):
He initialisation (ReLU)
With ReLU, only half the neurons are active on average, so the effective variance is halved. Compensating (He et al. 2015):
In practice: \(\mathbf{W} \sim \mathcal{N}(0,\, 2/n_\mathrm{in})\) or \(\mathrm{Uniform}(-\sqrt{6/n_\mathrm{in}},\, \sqrt{6/n_\mathrm{in}})\).
7. Python: Deep Networks with & without BatchNorm
We train a \([2 \to 32 \to 32 \to 16 \to 1]\) network on a concentric rings dataset, comparing training curves with and without Batch Normalisation and visualising the learned decision boundary.
Click Run to execute the Python code
Code will be executed with Python 3 on the server