Part III · Chapter 7

Perceptrons & Backpropagation

We build neural networks from first principles — starting with the single perceptron, assembling multilayer feedforward networks, and deriving the backpropagation algorithm rigorously from the chain rule. Every gradient formula is proved, not assumed.

1. The Single Perceptron

A single perceptron takes a real-valued input vector \(\mathbf{x} \in \mathbb{R}^d\), computes a linear combination, then passes the result through a nonlinear activation function:

\[ z = \mathbf{w}^\top \mathbf{x} + b, \qquad a = \sigma(z) \]

where \(\mathbf{w} \in \mathbb{R}^d\) are the weights, \(b \in \mathbb{R}\) is the bias, and \(\sigma\) is the activation (e.g. sigmoid, ReLU). For binary classification we predict \(\hat{y} = \mathbf{1}[a \geq 0.5]\).

Perceptron learning rule (Rosenblatt 1958)

Update \(\mathbf{w} \leftarrow \mathbf{w} + \eta\,(y - \hat{y})\,\mathbf{x}\) after each sample. This is a precursor to gradient descent: for a linearly separable dataset the algorithm converges in a finite number of steps (perceptron convergence theorem).

Universal Approximation Theorem

Theorem (Cybenko 1989, Hornik 1991). Let \(\sigma\) be any continuous, bounded, non-constant function (e.g. sigmoid). For any continuous function \(f : [0,1]^d \to \mathbb{R}\) and any \(\varepsilon > 0\), there exist weights \(\mathbf{W}, \mathbf{v}, \mathbf{b}\) and a width \(N\) such that the two-layer network

\[ g(\mathbf{x}) = \sum_{j=1}^{N} v_j\,\sigma\!\left(\mathbf{w}_j^\top \mathbf{x} + b_j\right) \]

satisfies \(\sup_{\mathbf{x}} |f(\mathbf{x}) - g(\mathbf{x})| < \varepsilon\). This guarantees that a sufficiently wide shallow network can represent any continuous function — but it gives no guidance on how to find the weights or how deep is helpful.

2. Feedforward Network — Forward Pass

Stack \(L\) layers. Denote activations as \(\mathbf{a}^{(0)} = \mathbf{x}\) (the input). For each layer \(l = 1, \dots, L\):

\[ \mathbf{z}^{(l)} = \mathbf{W}^{(l)}\,\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}, \qquad \mathbf{a}^{(l)} = \sigma\!\left(\mathbf{z}^{(l)}\right) \]

where \(\mathbf{W}^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}\) and \(\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}\). The network output is \(\hat{\mathbf{y}} = \mathbf{a}^{(L)}\).

Loss Function

For binary classification with sigmoid output, the binary cross-entropy is:

\[ \mathcal{L} = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log a_L^{(i)} + (1-y^{(i)})\log(1-a_L^{(i)})\right] \]

For multiclass we use softmax output and categorical cross-entropy: \(\mathcal{L} = -\frac{1}{m}\sum_i \sum_k y_k^{(i)} \log \hat{y}_k^{(i)}\).

3. Backpropagation — Full Derivation

We want to compute \(\partial \mathcal{L}/\partial \mathbf{W}^{(l)}\) and\(\partial \mathcal{L}/\partial \mathbf{b}^{(l)}\) for all \(l\). The key idea is to define an error signal (delta) at each layer, then propagate it backwards.

Step 1 — Output layer error \(\boldsymbol{\delta}^{(L)}\)

Define \(\delta_j^{(L)} := \partial \mathcal{L} / \partial z_j^{(L)}\). By the chain rule:

\[ \delta_j^{(L)} = \frac{\partial \mathcal{L}}{\partial a_j^{(L)}} \cdot \frac{\partial a_j^{(L)}}{\partial z_j^{(L)}} = \frac{\partial \mathcal{L}}{\partial a_j^{(L)}} \cdot \sigma'\!\left(z_j^{(L)}\right) \]

For BCE loss with sigmoid: \(\partial \mathcal{L}/\partial a_j^{(L)} = -(y_j/a_j - (1-y_j)/(1-a_j))\)and \(\sigma'(z) = \sigma(z)(1-\sigma(z))\), which simplifies to \(\boldsymbol{\delta}^{(L)} = \mathbf{a}^{(L)} - \mathbf{y}\).

Step 2 — Backpropagate error: \(\boldsymbol{\delta}^{(l)}\) from \(\boldsymbol{\delta}^{(l+1)}\)

\(z_j^{(l)}\) affects \(\mathcal{L}\) through all neurons \(z_k^{(l+1)}\) in the next layer. By the chain rule:

\[ \delta_j^{(l)} = \sum_k \frac{\partial \mathcal{L}}{\partial z_k^{(l+1)}} \cdot \frac{\partial z_k^{(l+1)}}{\partial a_j^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \]

Since \(z_k^{(l+1)} = \sum_j W_{kj}^{(l+1)} a_j^{(l)} + b_k^{(l+1)}\), we have\(\partial z_k^{(l+1)}/\partial a_j^{(l)} = W_{kj}^{(l+1)}\). Therefore:

\[ \boldsymbol{\delta}^{(l)} = \left(\mathbf{W}^{(l+1)}\right)^\top \boldsymbol{\delta}^{(l+1)} \odot \sigma'\!\left(\mathbf{z}^{(l)}\right) \]

where \(\odot\) denotes elementwise (Hadamard) product. This is the recursive rule that propagates error backwards through all layers.

Step 3 — Gradients with respect to weights and biases

Since \(z_j^{(l)} = \sum_k W_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}\):

\[ \frac{\partial \mathcal{L}}{\partial W_{jk}^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial W_{jk}^{(l)}} = \delta_j^{(l)} \cdot a_k^{(l-1)} \]

In matrix form (averaging over the batch of size \(m\)):

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{1}{m}\,\boldsymbol{\delta}^{(l)}\left(\mathbf{a}^{(l-1)}\right)^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \frac{1}{m}\sum_{i=1}^m \boldsymbol{\delta}^{(l,i)} \]

Backpropagation Algorithm Summary

Forward pass: compute \(\mathbf{z}^{(l)}, \mathbf{a}^{(l)}\) for \(l=1,\dots,L\), cache all values.
Output delta: compute \(\boldsymbol{\delta}^{(L)} = \partial\mathcal{L}/\partial\mathbf{z}^{(L)}\).
Backward pass: for \(l = L-1, \dots, 1\): \(\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})\).
Weight gradients: \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top / m\).
Update: \(\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta\,\partial\mathcal{L}/\partial\mathbf{W}^{(l)}\).

Computational Graph Perspective

The network defines a directed acyclic graph (DAG) where each node computes a differentiable operation. Backpropagation is simply reverse-mode automatic differentiation on this DAG: it propagates\(\partial \mathcal{L}/\partial \text{output}\) backwards, multiplying by local Jacobians at each node. Modern frameworks (PyTorch, JAX) implement this automatically; understanding the manual derivation above is the key to debugging gradient flow.

4. Neural Network Diagram

Three-layer feedforward network. Purple arrows: forward activations. Pink dashed arc: backward error signal \(\boldsymbol{\delta}\).

5. Python: Backprop from Scratch — XOR

We implement a \([2 \to 4 \to 4 \to 1]\) network with sigmoid activations and binary cross-entropy loss, solving XOR entirely from scratch in NumPy. We plot loss convergence and the learned decision boundary.

Python

script.py128 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

# ── reproducibility ──────────────────────────────────────────────────────────
np.random.seed(42)

# ── XOR dataset ──────────────────────────────────────────────────────────────
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float).T   # (2,4)
Y = np.array([[0,1,1,0]], dtype=float)                    # (1,4)

# ── activation functions ─────────────────────────────────────────────────────
def sigmoid(z):    return 1.0 / (1.0 + np.exp(-z))
def sigmoid_d(z):  s = sigmoid(z); return s * (1 - s)
def bce_loss(a, y): return -np.mean(y * np.log(a + 1e-15) + (1 - y) * np.log(1 - a + 1e-15))

# ── weight initialisation (He) ───────────────────────────────────────────────
def init_params(layers):
    params = {}
    for l in range(1, len(layers)):
        params[f'W{l}'] = np.random.randn(layers[l], layers[l-1]) * np.sqrt(2.0 / layers[l-1])
        params[f'b{l}'] = np.zeros((layers[l], 1))
    return params

# ── forward pass ─────────────────────────────────────────────────────────────
def forward(X, params, L):
    cache = {'A0': X}
    for l in range(1, L+1):
        Z = params[f'W{l}'] @ cache[f'A{l-1}'] + params[f'b{l}']
        A = sigmoid(Z)
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    return cache

# ── backward pass (full backprop derivation) ─────────────────────────────────
def backward(params, cache, Y, L):
    m = Y.shape[1]
    grads = {}
    # output layer: dL/dA^L for BCE, then delta^L = dL/dz^L
    AL = cache[f'A{L}']
    dA = -(Y / (AL + 1e-15)) + (1 - Y) / (1 - AL + 1e-15)
    for l in range(L, 0, -1):
        ZL = cache[f'Z{l}']
        delta = dA * sigmoid_d(ZL)                # delta^l = dA * sigma'(z^l)
        grads[f'dW{l}'] = delta @ cache[f'A{l-1}'].T / m
        grads[f'db{l}'] = np.mean(delta, axis=1, keepdims=True)
        dA = params[f'W{l}'].T @ delta             # propagate: dA^(l-1)
    return grads

# ── gradient descent update ───────────────────────────────────────────────────
def update(params, grads, lr, L):
    for l in range(1, L+1):
        params[f'W{l}'] -= lr * grads[f'dW{l}']
        params[f'b{l}'] -= lr * grads[f'db{l}']
    return params

# ── training loop ─────────────────────────────────────────────────────────────
layers = [2, 4, 4, 1]
L = len(layers) - 1
params = init_params(layers)
losses = []

for epoch in range(10000):
    cache = forward(X, params, L)
    loss = bce_loss(cache[f'A{L}'], Y)
    losses.append(loss)
    grads = backward(params, cache, Y, L)
    params = update(params, grads, lr=0.5, L=L)

preds = (cache[f'A{L}'] > 0.5).astype(int)
print(f"Final loss: {losses[-1]:.6f}")
print(f"Predictions: {preds.flatten()}  |  Targets: {Y.flatten().astype(int)}")
print(f"Accuracy: {(preds == Y).mean()*100:.1f}%")

# ── figure ────────────────────────────────────────────────────────────────────
fig = plt.figure(figsize=(13, 5), facecolor='#0f0f1a')
gs = gridspec.GridSpec(1, 2, figure=fig, wspace=0.35)

# ── panel 1: loss convergence ─────────────────────────────────────────────────
ax1 = fig.add_subplot(gs[0])
ax1.set_facecolor('#0f0f1a')
ax1.plot(losses, color='#a78bfa', linewidth=1.5, label='Training loss')
ax1.set_xlabel('Epoch', color='#e2e8f0')
ax1.set_ylabel('BCE Loss', color='#e2e8f0')
ax1.set_title('Loss Convergence (XOR)', color='#ddd6fe', fontsize=13)
ax1.tick_params(colors='#94a3b8')
for spine in ax1.spines.values():
    spine.set_edgecolor('#4c1d95')
ax1.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe')

# ── panel 2: decision boundary ────────────────────────────────────────────────
ax2 = fig.add_subplot(gs[1])
ax2.set_facecolor('#0f0f1a')

h = 0.01
xx, yy = np.meshgrid(np.arange(-0.3, 1.3, h), np.arange(-0.3, 1.3, h))
grid = np.c_[xx.ravel(), yy.ravel()].T
c_cache = forward(grid, params, L)
Z = c_cache[f'A{L}'].reshape(xx.shape)

ax2.contourf(xx, yy, Z, levels=[0, 0.5, 1], colors=['#1e1b4b', '#4c1d95'], alpha=0.6)
ax2.contour(xx, yy, Z, levels=[0.5], colors=['#a78bfa'], linewidths=1.5)

colors = ['#f87171' if y == 0 else '#4ade80' for y in Y.flatten()]
for i in range(X.shape[1]):
    ax2.scatter(X[0, i], X[1, i], c=colors[i], s=180, zorder=5, edgecolors='white', linewidths=1)
    ax2.annotate(f'({int(X[0,i])},{int(X[1,i])})', (X[0,i]+0.04, X[1,i]+0.04),
                 color='#e2e8f0', fontsize=9)

ax2.set_xlabel('x₁', color='#e2e8f0')
ax2.set_ylabel('x₂', color='#e2e8f0')
ax2.set_title('Decision Boundary (XOR)', color='#ddd6fe', fontsize=13)
ax2.tick_params(colors='#94a3b8')
for spine in ax2.spines.values():
    spine.set_edgecolor('#4c1d95')

# legend patches
import matplotlib.patches as mpatches
p0 = mpatches.Patch(color='#f87171', label='Label 0')
p1 = mpatches.Patch(color='#4ade80', label='Label 1')
ax2.legend(handles=[p0, p1], facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe')

plt.suptitle('Backpropagation on XOR', color='white', fontsize=15, fontweight='bold', y=1.01)
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print("Saved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 6: SVMs Ch 8: Deep Networks →