Perceptrons & Backpropagation
We build neural networks from first principles โ starting with the single perceptron, assembling multilayer feedforward networks, and deriving the backpropagation algorithm rigorously from the chain rule. Every gradient formula is proved, not assumed.
1. The Single Perceptron
A single perceptron takes a real-valued input vector \(\mathbf{x} \in \mathbb{R}^d\), computes a linear combination, then passes the result through a nonlinear activation function:
where \(\mathbf{w} \in \mathbb{R}^d\) are the weights, \(b \in \mathbb{R}\) is the bias, and \(\sigma\) is the activation (e.g. sigmoid, ReLU). For binary classification we predict \(\hat{y} = \mathbf{1}[a \geq 0.5]\).
Perceptron learning rule (Rosenblatt 1958)
Update \(\mathbf{w} \leftarrow \mathbf{w} + \eta\,(y - \hat{y})\,\mathbf{x}\) after each sample. This is a precursor to gradient descent: for a linearly separable dataset the algorithm converges in a finite number of steps (perceptron convergence theorem).
Universal Approximation Theorem
Theorem (Cybenko 1989, Hornik 1991). Let \(\sigma\) be any continuous, bounded, non-constant function (e.g. sigmoid). For any continuous function \(f : [0,1]^d \to \mathbb{R}\) and any \(\varepsilon > 0\), there exist weights \(\mathbf{W}, \mathbf{v}, \mathbf{b}\) and a width \(N\) such that the two-layer network
satisfies \(\sup_{\mathbf{x}} |f(\mathbf{x}) - g(\mathbf{x})| < \varepsilon\). This guarantees that a sufficiently wide shallow network can represent any continuous function โ but it gives no guidance on how to find the weights or how deep is helpful.
2. Feedforward Network โ Forward Pass
Stack \(L\) layers. Denote activations as \(\mathbf{a}^{(0)} = \mathbf{x}\) (the input). For each layer \(l = 1, \dots, L\):
where \(\mathbf{W}^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}\) and \(\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}\). The network output is \(\hat{\mathbf{y}} = \mathbf{a}^{(L)}\).
Loss Function
For binary classification with sigmoid output, the binary cross-entropy is:
For multiclass we use softmax output and categorical cross-entropy: \(\mathcal{L} = -\frac{1}{m}\sum_i \sum_k y_k^{(i)} \log \hat{y}_k^{(i)}\).
3. Backpropagation โ Full Derivation
We want to compute \(\partial \mathcal{L}/\partial \mathbf{W}^{(l)}\) and\(\partial \mathcal{L}/\partial \mathbf{b}^{(l)}\) for all \(l\). The key idea is to define an error signal (delta) at each layer, then propagate it backwards.
Step 1 โ Output layer error \(\boldsymbol{\delta}^{(L)}\)
Define \(\delta_j^{(L)} := \partial \mathcal{L} / \partial z_j^{(L)}\). By the chain rule:
For BCE loss with sigmoid: \(\partial \mathcal{L}/\partial a_j^{(L)} = -(y_j/a_j - (1-y_j)/(1-a_j))\)and \(\sigma'(z) = \sigma(z)(1-\sigma(z))\), which simplifies to \(\boldsymbol{\delta}^{(L)} = \mathbf{a}^{(L)} - \mathbf{y}\).
Step 2 โ Backpropagate error: \(\boldsymbol{\delta}^{(l)}\) from \(\boldsymbol{\delta}^{(l+1)}\)
\(z_j^{(l)}\) affects \(\mathcal{L}\) through all neurons \(z_k^{(l+1)}\) in the next layer. By the chain rule:
Since \(z_k^{(l+1)} = \sum_j W_{kj}^{(l+1)} a_j^{(l)} + b_k^{(l+1)}\), we have\(\partial z_k^{(l+1)}/\partial a_j^{(l)} = W_{kj}^{(l+1)}\). Therefore:
where \(\odot\) denotes elementwise (Hadamard) product. This is the recursive rule that propagates error backwards through all layers.
Step 3 โ Gradients with respect to weights and biases
Since \(z_j^{(l)} = \sum_k W_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}\):
In matrix form (averaging over the batch of size \(m\)):
Backpropagation Algorithm Summary
- Forward pass: compute \(\mathbf{z}^{(l)}, \mathbf{a}^{(l)}\) for \(l=1,\dots,L\), cache all values.
- Output delta: compute \(\boldsymbol{\delta}^{(L)} = \partial\mathcal{L}/\partial\mathbf{z}^{(L)}\).
- Backward pass: for \(l = L-1, \dots, 1\): \(\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})\).
- Weight gradients: \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top / m\).
- Update: \(\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta\,\partial\mathcal{L}/\partial\mathbf{W}^{(l)}\).
Computational Graph Perspective
The network defines a directed acyclic graph (DAG) where each node computes a differentiable operation. Backpropagation is simply reverse-mode automatic differentiation on this DAG: it propagates\(\partial \mathcal{L}/\partial \text{output}\) backwards, multiplying by local Jacobians at each node. Modern frameworks (PyTorch, JAX) implement this automatically; understanding the manual derivation above is the key to debugging gradient flow.
4. Neural Network Diagram
Three-layer feedforward network. Purple arrows: forward activations. Pink dashed arc: backward error signal \(\boldsymbol{\delta}\).
5. Python: Backprop from Scratch โ XOR
We implement a \([2 \to 4 \to 4 \to 1]\) network with sigmoid activations and binary cross-entropy loss, solving XOR entirely from scratch in NumPy. We plot loss convergence and the learned decision boundary.
Click Run to execute the Python code
Code will be executed with Python 3 on the server