Part III · Chapter 8

Deep Neural Networks

Depth brings representational power but also new failure modes: vanishing and exploding gradients. This chapter derives the engineering innovations — Batch Normalisation, Dropout, residual connections, and principled weight initialisation — that make training very deep networks reliable.

1. Activation Functions and Their Derivatives

The choice of activation function determines gradient flow, expressivity, and training stability.

ReLU (Rectified Linear Unit)

\[ \sigma(z) = \max(0,z), \qquad \sigma'(z) = \mathbf{1}[z > 0] \]

No vanishing gradient for positive inputs. Dead neurons problem: if \(z < 0\) always, gradient is permanently zero.

Leaky ReLU

\[ \sigma(z) = \max(\alpha z,z),\quad \sigma'(z) = \begin{cases} 1 & z>0 \\ \alpha & z\leq 0 \end{cases} \]

Typical \(\alpha = 0.01\). Fixes dead neurons: gradient is always non-zero.

GELU (Gaussian Error Linear Unit)

\[ \sigma(z) = z\,\Phi(z), \qquad \Phi(z) = \tfrac{1}{2}\!\left[1+\mathrm{erf}\!\left(\tfrac{z}{\sqrt{2}}\right)\right] \]

Used in BERT, GPT. Smooth, non-monotone near zero, differentiable everywhere.

Swish (Ramachandran et al. 2017)

\[ \sigma(z) = z\,\mathrm{sigmoid}(z), \qquad \sigma'(z) = \sigma(z) + \mathrm{sigmoid}(z)(1-\sigma(z)) \]

Self-gated; discovered via neural architecture search. Used in EfficientNet.

2. Vanishing & Exploding Gradients

Consider a deep network with sigmoid activations and identical weights \(w\) per layer. The gradient of the loss with respect to layer \(l\)'s pre-activation is:

\[ \frac{\partial \mathcal{L}}{\partial z^{(l)}} = \frac{\partial \mathcal{L}}{\partial z^{(L)}} \cdot \prod_{k=l}^{L-1} w \cdot \sigma'\!\left(z^{(k)}\right) \]

The sigmoid derivative satisfies \(\sigma'(z) \leq \frac{1}{4}\) for all \(z\). Therefore, for \(L - l\) layers above layer \(l\):

\[ \left|\frac{\partial \mathcal{L}}{\partial z^{(l)}}\right| \leq \left(\frac{|w|}{4}\right)^{L-l} \cdot \left|\frac{\partial \mathcal{L}}{\partial z^{(L)}}\right| \]

If \(|w| < 4\), this factor decays exponentially with depth — the vanishing gradient problem. If \(|w| > 4\), gradients grow exponentially — the exploding gradient problem. ReLU has derivative 1 for positive inputs, eliminating the \(1/4\) factor, which is why it became the default.

3. Batch Normalisation — Full Derivation

Batch Normalisation (Ioffe & Szegedy 2015) normalises the pre-activations within each mini-batch, then applies a learnable affine transformation to restore representational capacity.

Forward pass (training)

For a mini-batch \(\mathcal{B} = \{z_1,\dots,z_m\}\) at a given layer and feature dimension:

\[ \mu_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m z_i, \qquad \sigma^2_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m (z_i - \mu_\mathcal{B})^2 \]

\[ \hat{z}_i = \frac{z_i - \mu_\mathcal{B}}{\sqrt{\sigma^2_\mathcal{B} + \varepsilon}}, \qquad y_i = \gamma\,\hat{z}_i + \beta \]

\(\gamma, \beta\) are learned parameters. Running statistics \(\mu_\mathrm{run}, \sigma^2_\mathrm{run}\) are maintained with momentum for inference.

Backward pass through BatchNorm

Given upstream gradient \(d\mathbf{y} = \partial\mathcal{L}/\partial\mathbf{y}\):

\[ \frac{\partial \mathcal{L}}{\partial \gamma} = \sum_i d y_i \cdot \hat{z}_i, \qquad \frac{\partial \mathcal{L}}{\partial \beta} = \sum_i d y_i \]

\[ \frac{\partial \mathcal{L}}{\partial \hat{z}_i} = dy_i \cdot \gamma \]

\[ \frac{\partial \mathcal{L}}{\partial z_i} = \frac{1}{\sqrt{\sigma^2+\varepsilon}}\left[d\hat{z}_i - \frac{1}{m}\sum_j d\hat{z}_j - \frac{\hat{z}_i}{m}\sum_j d\hat{z}_j\hat{z}_j\right] \]

Why does BatchNorm help?

It reduces internal covariate shift: the distribution of each layer's inputs stays roughly\(\mathcal{N}(0,1)\) before the affine correction, keeping activations in the linear regime of saturating activations and allowing larger learning rates.

4. Dropout

Dropout (Srivastava et al. 2014) randomly zeroes activations during training using a Bernoulli mask:

\[ \mathbf{m} \sim \mathrm{Bernoulli}(1-p)^n, \qquad \tilde{\mathbf{a}} = \frac{\mathbf{m} \odot \mathbf{a}}{1-p} \]

The division by \((1-p)\) is inverted dropout: it rescales the surviving activations so that the expected value of \(\tilde{\mathbf{a}}\) equals \(\mathbf{a}\) regardless of \(p\). At inference time, no mask is applied and no scaling is needed. Dropout acts as an approximate ensemble of \(2^n\) sub-networks, reducing overfitting.

5. Residual Connections & Gradient Flow

A residual (skip) connection (He et al. 2016) adds the input directly to the output of a block:

\[ \mathbf{y} = \mathcal{F}(\mathbf{x};\,\{\mathbf{W}_i\}) + \mathbf{x} \]

where \(\mathcal{F}\) is the residual mapping (e.g. two conv layers + BN + ReLU). The gradient of the loss with respect to \(\mathbf{x}\) becomes:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\left(\frac{\partial \mathcal{F}}{\partial \mathbf{x}} + \mathbf{I}\right) \]

The identity term \(\mathbf{I}\) provides a gradient highway: even if \(\partial\mathcal{F}/\partial\mathbf{x} \approx 0\), the gradient still flows back with magnitude at least \(\|\partial\mathcal{L}/\partial\mathbf{y}\|\). This is why ResNets can be trained with 100+ layers.

6. Weight Initialisation — Xavier & He

Poor initialisation causes gradients to vanish or explode at the start of training. The goal is to choose the initial variance of weights so that the variance of activations ispreserved from layer to layer.

Xavier / Glorot initialisation (linear / tanh)

Consider a linear layer with \(n_\mathrm{in}\) inputs. Assume \(\mathbb{E}[w_i] = 0\), \(\mathrm{Var}[w_i] = \sigma_w^2\), and inputs have variance 1. Then:

\[ \mathrm{Var}[z] = n_\mathrm{in}\,\sigma_w^2 \cdot \mathrm{Var}[x] \]

Setting \(\mathrm{Var}[z] = 1\) requires \(\sigma_w^2 = 1/n_\mathrm{in}\). Accounting for both forward and backward pass (Glorot & Bengio 2010):

\[ \sigma_w^2 = \frac{2}{n_\mathrm{in} + n_\mathrm{out}} \]

He initialisation (ReLU)

With ReLU, only half the neurons are active on average, so the effective variance is halved. Compensating (He et al. 2015):

\[ \sigma_w^2 = \frac{2}{n_\mathrm{in}} \]

In practice: \(\mathbf{W} \sim \mathcal{N}(0,\, 2/n_\mathrm{in})\) or \(\mathrm{Uniform}(-\sqrt{6/n_\mathrm{in}},\, \sqrt{6/n_\mathrm{in}})\).

7. Python: Deep Networks with & without BatchNorm

We train a \([2 \to 32 \to 32 \to 16 \to 1]\) network on a concentric rings dataset, comparing training curves with and without Batch Normalisation and visualising the learned decision boundary.

Python

script.py164 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(0)

# ── synthetic dataset: concentric rings ───────────────────────────────────────
def make_rings(n=600):
    idx = np.arange(n)
    r = 0.5 * (idx % 2) + np.random.randn(n) * 0.12 + 0.15
    theta = np.random.uniform(0, 2 * np.pi, n)
    X = np.column_stack([r * np.cos(theta), r * np.sin(theta)])
    y = (idx % 2).reshape(-1, 1).astype(float)
    return X.T, y.T

X, Y = make_rings(800)   # (2, 800)  (1, 800)
m = X.shape[1]

# ── activations ───────────────────────────────────────────────────────────────
def relu(z):       return np.maximum(0, z)
def relu_d(z):     return (z > 0).astype(float)
def sigmoid(z):    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_d(z):  s = sigmoid(z); return s * (1 - s)

# ── batch normalisation ───────────────────────────────────────────────────────
class BatchNorm:
    def __init__(self, n, momentum=0.9, eps=1e-8):
        self.gamma = np.ones((n, 1))
        self.beta  = np.zeros((n, 1))
        self.run_mean = np.zeros((n, 1))
        self.run_var  = np.ones((n, 1))
        self.momentum = momentum
        self.eps = eps
        self.cache = {}
    def forward(self, z, training=True):
        if training:
            mu  = z.mean(axis=1, keepdims=True)
            var = z.var(axis=1, keepdims=True)
            self.run_mean = self.momentum * self.run_mean + (1 - self.momentum) * mu
            self.run_var  = self.momentum * self.run_var  + (1 - self.momentum) * var
        else:
            mu  = self.run_mean
            var = self.run_var
        z_hat = (z - mu) / np.sqrt(var + self.eps)
        self.cache = {'z': z, 'mu': mu, 'var': var, 'z_hat': z_hat}
        return self.gamma * z_hat + self.beta
    def backward(self, dout):
        z, mu, var, z_hat = self.cache['z'], self.cache['mu'], self.cache['var'], self.cache['z_hat']
        n = z.shape[1]
        self.dgamma = (dout * z_hat).sum(axis=1, keepdims=True)
        self.dbeta  = dout.sum(axis=1, keepdims=True)
        dz_hat = dout * self.gamma
        dvar  = (-0.5 * dz_hat * (z - mu) * (var + self.eps)**(-1.5)).sum(axis=1, keepdims=True)
        dmu   = (-dz_hat / np.sqrt(var + self.eps)).sum(axis=1, keepdims=True) + dvar * (-2*(z-mu)).mean(axis=1, keepdims=True)
        dz    = dz_hat / np.sqrt(var + self.eps) + 2 * dvar * (z - mu) / n + dmu / n
        return dz

# ── network builder ───────────────────────────────────────────────────────────
def make_net(dims, use_bn=True):
    params = {}
    bns = []
    L = len(dims) - 1
    for l in range(1, L+1):
        params[f'W{l}'] = np.random.randn(dims[l], dims[l-1]) * np.sqrt(2.0 / dims[l-1])
        params[f'b{l}'] = np.zeros((dims[l], 1))
        if use_bn and l < L:
            bns.append(BatchNorm(dims[l]))
        else:
            bns.append(None)
    return params, bns, L

def forward_net(X, params, bns, L, training=True):
    cache = {'A0': X}
    for l in range(1, L+1):
        Z = params[f'W{l}'] @ cache[f'A{l-1}'] + params[f'b{l}']
        if bns[l-1] is not None:
            Z = bns[l-1].forward(Z, training)
        A = relu(Z) if l < L else sigmoid(Z)
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    return cache

def backward_net(params, bns, cache, Y, L, lr=0.01):
    m = Y.shape[1]
    AL = cache[f'A{L}']
    dA = -(Y / (AL + 1e-15)) + (1 - Y) / (1 - AL + 1e-15)
    for l in range(L, 0, -1):
        ZL = cache[f'Z{l}']
        act_d = sigmoid_d(ZL) if l == L else relu_d(ZL)
        delta = dA * act_d
        if bns[l-1] is not None:
            delta = bns[l-1].backward(delta)
            bns[l-1].gamma -= lr * bns[l-1].dgamma
            bns[l-1].beta  -= lr * bns[l-1].dbeta
        dW = delta @ cache[f'A{l-1}'].T / m
        db = delta.mean(axis=1, keepdims=True)
        dA = params[f'W{l}'].T @ delta
        params[f'W{l}'] -= lr * dW
        params[f'b{l}'] -= lr * db

def train(X, Y, dims, use_bn, epochs=400, lr=0.05):
    params, bns, L = make_net(dims, use_bn)
    losses = []
    for _ in range(epochs):
        cache = forward_net(X, params, bns, L, training=True)
        AL = cache[f'A{L}']
        loss = -np.mean(Y * np.log(AL + 1e-15) + (1-Y) * np.log(1-AL + 1e-15))
        losses.append(loss)
        backward_net(params, bns, cache, Y, L, lr)
    return losses, params, bns, L

dims = [2, 32, 32, 16, 1]
losses_no_bn, *_ = train(X, Y, dims, use_bn=False, epochs=600, lr=0.04)
losses_bn, params_bn, bns_bn, L_bn = train(X, Y, dims, use_bn=True,  epochs=600, lr=0.04)

# ── figure ─────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(14, 4.5), facecolor='#0f0f1a')

# panel 1: training curves
ax = axes[0]
ax.set_facecolor('#0f0f1a')
ax.plot(losses_no_bn, color='#f87171', linewidth=1.5, label='No BatchNorm')
ax.plot(losses_bn,    color='#4ade80', linewidth=1.5, label='With BatchNorm')
ax.set_xlabel('Epoch', color='#e2e8f0'); ax.set_ylabel('BCE Loss', color='#e2e8f0')
ax.set_title('Training Curves', color='#ddd6fe', fontsize=12)
ax.tick_params(colors='#94a3b8')
for sp in ax.spines.values(): sp.set_edgecolor('#4c1d95')
ax.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=9)

# panel 2: decision boundary with BatchNorm
ax = axes[1]
ax.set_facecolor('#0f0f1a')
h = 0.015
xx, yy = np.meshgrid(np.arange(-1.0, 1.0, h), np.arange(-1.0, 1.0, h))
grid = np.c_[xx.ravel(), yy.ravel()].T
c_cache = forward_net(grid, params_bn, bns_bn, L_bn, training=False)
Z = c_cache[f'A{L_bn}'].reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[0, 0.5, 1], colors=['#1e1b4b', '#4c1d95'], alpha=0.65)
ax.contour(xx, yy, Z, levels=[0.5], colors=['#a78bfa'], linewidths=1.5)
cols = np.where(Y.flatten() == 0, '#f87171', '#4ade80')
ax.scatter(X[0], X[1], c=cols, s=6, alpha=0.7)
ax.set_title('Decision Boundary (BN)', color='#ddd6fe', fontsize=12)
ax.tick_params(colors='#94a3b8')
for sp in ax.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 3: activation distribution comparison
ax = axes[2]
ax.set_facecolor('#0f0f1a')
ax.hist(losses_no_bn[:100], bins=25, color='#f87171', alpha=0.6, label='No BN (first 100)', density=True)
ax.hist(losses_bn[:100],    bins=25, color='#4ade80', alpha=0.6, label='BN (first 100)',    density=True)
ax.set_xlabel('Loss value', color='#e2e8f0'); ax.set_ylabel('Density', color='#e2e8f0')
ax.set_title('Loss Distribution (Early)', color='#ddd6fe', fontsize=12)
ax.tick_params(colors='#94a3b8')
for sp in ax.spines.values(): sp.set_edgecolor('#4c1d95')
ax.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8)

plt.suptitle('Deep Networks: Effect of Batch Normalisation', color='white', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print("Saved output.png")
print(f"Final loss  no BN: {losses_no_bn[-1]:.4f}")
print(f"Final loss with BN: {losses_bn[-1]:.4f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 7: Backprop Ch 9: CNNs →