Optimization Theory
Training a machine learning model is an optimisation problem. This chapter derives the tools โ gradients, Hessians, convexity, gradient descent and its variants, and constrained optimisation via KKT conditions โ that explain how and why learning works.
1. Gradient, Hessian, and Jacobian
For a scalar function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient is the vector of partial derivatives:
The gradient points in the direction of steepest ascent; \(-\nabla f\) is the direction of steepest descent. The Hessian is the symmetric matrix of second derivatives:
For a vector-valued function \(\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian is:
Taylor Expansion
The second-order Taylor expansion of \(f\) around \(\mathbf{x}_0\) is:
This expansion is the foundation of gradient descent (first-order) and Newton's method (second-order).
2. Convexity
A function \(f\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(t \in [0,1]\):
Second-order condition: a twice-differentiable \(f\) is convex if and only if \(\mathbf{H}(\mathbf{x}) \succeq 0\) (positive semi-definite) for all \(\mathbf{x}\).
Convexity guarantees that every local minimum is a global minimum. This is why convex losses (squared error, cross-entropy, hinge) are preferred: gradient descent is guaranteed to find the optimum.
Convex functions have a unique global minimum; non-convex functions have local minima where gradient descent can get trapped.
3. Gradient Descent โ Full Derivation
We want to decrease \(f(\mathbf{x})\). The first-order Taylor expansion says:
To minimise this over \(\boldsymbol{\delta}\) with the constraint \(\|\boldsymbol{\delta}\| = \eta\) (step budget), the optimal direction is \(\boldsymbol{\delta}^* = -\eta \,\nabla f(\mathbf{x}) / \|\nabla f(\mathbf{x})\|\). Taking \(\boldsymbol{\delta} = -\eta\nabla f\) (proportional to gradient magnitude) gives the gradient descent update:
Learning rate analysis: For \(L\)-smooth convex \(f\) (Hessian eigenvalues \(\leq L\)), choosing \(\eta \leq 1/L\) guarantees monotone descent:
Momentum (Heavy Ball)
Momentum accumulates a velocity vector to damp oscillations:
Adam Optimizer โ Derivation
Adam (Adaptive Moment Estimation) maintains per-parameter first and second moment estimates. At step \(t\):
The bias correction ensures that \(\hat{\mathbf{m}}_1 \approx \mathbf{g}_1\) even at initialisation when \(\mathbf{m}_0 = \mathbf{0}\). Typical values: \(\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{-8}, \eta = 10^{-3}\).
Stochastic Gradient Descent (SGD)
For large datasets, computing \(\nabla f = \frac{1}{n}\sum_i \nabla \ell_i\) is expensive. SGD approximates it with a mini-batch \(\mathcal{B} \subset \{1,\ldots,n\}\):
With a diminishing learning rate schedule \(\sum_t \eta_t = \infty\), \(\sum_t \eta_t^2 < \infty\), SGD converges for convex objectives.
4. Lagrange Multipliers & KKT Conditions
Consider the constrained problem:
Form the Lagrangian by introducing multipliers \(\alpha_i \geq 0\) for inequalities and \(\nu_j\) for equalities:
The KKT conditions are necessary (and for convex problems, sufficient) for optimality:
Complementary slackness says that either the constraint is active (\(g_i = 0\)) or its multiplier is zero (\(\alpha_i = 0\)). This will be crucial for the SVM dual derivation in Chapter 6.
Python: GD vs Momentum vs Adam on Rosenbrock
The Rosenbrock function \(f(x,y) = (1-x)^2 + 100(y-x^2)^2\) is a classic non-convex benchmark with a narrow curved valley. We compare gradient descent, momentum, and Adam โ showing how adaptive methods dramatically outperform vanilla GD.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Key Takeaways
- โ The gradient is the steepest ascent direction; gradient descent takes steps of \(-\eta\nabla f\) derived from the first-order Taylor expansion.
- โ Convexity (positive semi-definite Hessian) guarantees every local minimum is global โ gradient descent finds the true solution.
- โ Momentum reduces oscillations by accumulating a velocity; Adam adds per-parameter adaptive learning rates via bias-corrected moment estimates.
- โ KKT conditions are the necessary (and for convex problems, sufficient) conditions for constrained optimality โ the backbone of the SVM dual.
- โ Complementary slackness \(\alpha_i g_i = 0\) identifies which constraints are active at the solution.