Convolutional Neural Networks
Convolutional networks exploit the spatial structure of images through local connectivity and parameter sharing. This chapter derives the convolution operation rigorously, analyses translation equivariance, and traces the architectural evolution from LeNet to EfficientNet.
1. Discrete Convolution
The 1D discrete convolution of signal \(x\) with filter (kernel) \(h\) is:
In deep learning, we commonly use cross-correlation (the filter is not flipped), but the operation is still called convolution by convention:
2D Convolution for Images
For a 2D input \(\mathbf{X} \in \mathbb{R}^{H \times W}\) and kernel \(\mathbf{K} \in \mathbb{R}^{K_h \times K_w}\):
where \(s\) is the stride. With zero-padding \(p\), the output size is:
Common choices: valid padding (\(p=0\), output shrinks), same padding (\(p = (K-1)/2\) for \(s=1\), output size preserved).
2. Convolution Operation Illustrated
Applying a 3ร3 vertical-edge (Sobel-x) kernel to a 5ร5 input (valid padding, stride 1) yields a 3ร3 feature map. Purple region: the receptive field for output element (0,0).
3. Parameter Sharing & Translation Equivariance
In a fully connected layer, the weight matrix has \(H \cdot W \cdot H' \cdot W'\) parameters. A convolutional layer with \(C\) filters of size \(K_h \times K_w\) has only\(C \cdot K_h \cdot K_w\) parameters โ independent of the spatial size. The same kernel is applied at every spatial location: this is parameter sharing.
Translation Equivariance
A function \(f\) is equivariant to translation \(T_\delta\) if:
Convolution satisfies this: shifting the input shifts the feature map by the same amount. This is why CNNs detect features wherever they appear in the image. Note: max-pooling introduces approximate translation invariance (small shifts don't change the output).
Pooling Layers
Pooling reduces spatial dimensions, providing spatial compression and limited translation invariance.
Max Pooling
Selects the most activated feature in each region. Gradient flows only to the maximally activated unit.
Average Pooling
Smoother; gradient distributes uniformly. Global average pooling replaces flatten + FC layers in modern architectures.
4. Backprop Through Convolution
Given upstream gradient \(\partial\mathcal{L}/\partial\mathbf{Y}\), we need gradients with respect to both the kernel and the input.
Gradient w.r.t. kernel \(\mathbf{K}\)
This is itself a (strided) cross-correlation of the input with the upstream gradient.
Gradient w.r.t. input \(\mathbf{X}\)
This is a full convolution (with the flipped kernel) of the upstream gradient โ equivalently, a transposed convolution (deconvolution).
5. Modern CNN Architectures
LeNet-5 (LeCun 1998)
First successful CNN. Two conv layers (5ร5, tanh), two avg-pool layers, three FC layers. ~60K params. Digit recognition on MNIST.
AlexNet (Krizhevsky 2012)
Won ImageNet 2012 by a large margin. 5 conv + 3 FC, ReLU activations, Dropout, data augmentation. ~60M params. Launched the deep learning era.
VGGNet (Simonyan 2015)
All 3ร3 kernels, deeper (16โ19 layers). Key insight: two 3ร3 convolutions have the same receptive field as one 5ร5 but with fewer parameters and an extra nonlinearity.
ResNet (He 2016)
Residual connections allow training 50โ152 layers. Won ImageNet 2015. Introduced BatchNorm as standard. Still widely used as backbone.
EfficientNet (Tan & Le 2019)
Compound scaling: jointly scale depth, width, and resolution by a fixed ratio derived from a neural architecture search baseline. State-of-the-art accuracy/efficiency tradeoff.
6. Python: 2D Convolution from Scratch
We implement 2D cross-correlation in pure NumPy (no libraries), apply four kernels (horizontal edge, vertical edge, blur, sharpen) to a synthetic image, and visualise the feature maps and max-pooled outputs.
Click Run to execute the Python code
Code will be executed with Python 3 on the server