Part I ยท Chapter 3

Optimization Theory

Training a machine learning model is an optimisation problem. This chapter derives the tools โ€” gradients, Hessians, convexity, gradient descent and its variants, and constrained optimisation via KKT conditions โ€” that explain how and why learning works.

1. Gradient, Hessian, and Jacobian

For a scalar function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient is the vector of partial derivatives:

\[ \nabla f(\mathbf{x}) = \begin{pmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{pmatrix} \in \mathbb{R}^n \]

The gradient points in the direction of steepest ascent; \(-\nabla f\) is the direction of steepest descent. The Hessian is the symmetric matrix of second derivatives:

\[ \mathbf{H}(\mathbf{x}) = \nabla^2 f(\mathbf{x}),\quad H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \]

For a vector-valued function \(\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian is:

\[ \mathbf{J} = \frac{\partial \mathbf{g}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}, \quad J_{ij} = \frac{\partial g_i}{\partial x_j} \]

Taylor Expansion

The second-order Taylor expansion of \(f\) around \(\mathbf{x}_0\) is:

\[ f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top \boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^\top \mathbf{H}(\mathbf{x}_0)\boldsymbol{\delta} \]

This expansion is the foundation of gradient descent (first-order) and Newton's method (second-order).

2. Convexity

A function \(f\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(t \in [0,1]\):

\[ f(t\mathbf{x} + (1-t)\mathbf{y}) \leq t f(\mathbf{x}) + (1-t)f(\mathbf{y}) \]

Second-order condition: a twice-differentiable \(f\) is convex if and only if \(\mathbf{H}(\mathbf{x}) \succeq 0\) (positive semi-definite) for all \(\mathbf{x}\).

Convexity guarantees that every local minimum is a global minimum. This is why convex losses (squared error, cross-entropy, hinge) are preferred: gradient descent is guaranteed to find the optimum.

Convex lossglobal minchord always above curveNon-convex losslocal minlocal minglobal minGD may get stuck

Convex functions have a unique global minimum; non-convex functions have local minima where gradient descent can get trapped.

3. Gradient Descent โ€” Full Derivation

We want to decrease \(f(\mathbf{x})\). The first-order Taylor expansion says:

\[ f(\mathbf{x} + \boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\top \boldsymbol{\delta} \]

To minimise this over \(\boldsymbol{\delta}\) with the constraint \(\|\boldsymbol{\delta}\| = \eta\) (step budget), the optimal direction is \(\boldsymbol{\delta}^* = -\eta \,\nabla f(\mathbf{x}) / \|\nabla f(\mathbf{x})\|\). Taking \(\boldsymbol{\delta} = -\eta\nabla f\) (proportional to gradient magnitude) gives the gradient descent update:

\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \,\nabla f(\mathbf{x}_t) \]

Learning rate analysis: For \(L\)-smooth convex \(f\) (Hessian eigenvalues \(\leq L\)), choosing \(\eta \leq 1/L\) guarantees monotone descent:

\[ f(\mathbf{x}_{t+1}) \leq f(\mathbf{x}_t) - \frac{\eta}{2}\|\nabla f(\mathbf{x}_t)\|^2 \]

Momentum (Heavy Ball)

Momentum accumulates a velocity vector to damp oscillations:

\[ \mathbf{v}_{t+1} = \beta\,\mathbf{v}_t + (1-\beta)\nabla f(\mathbf{x}_t), \quad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\mathbf{v}_{t+1} \]

Adam Optimizer โ€” Derivation

Adam (Adaptive Moment Estimation) maintains per-parameter first and second moment estimates. At step \(t\):

\[ \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t \quad \text{(1st moment: mean)} \]
\[ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2 \quad \text{(2nd moment: uncentred variance)} \]
\[ \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \quad \text{(bias correction)} \]
\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \varepsilon}\,\hat{\mathbf{m}}_t \]

The bias correction ensures that \(\hat{\mathbf{m}}_1 \approx \mathbf{g}_1\) even at initialisation when \(\mathbf{m}_0 = \mathbf{0}\). Typical values: \(\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{-8}, \eta = 10^{-3}\).

Stochastic Gradient Descent (SGD)

For large datasets, computing \(\nabla f = \frac{1}{n}\sum_i \nabla \ell_i\) is expensive. SGD approximates it with a mini-batch \(\mathcal{B} \subset \{1,\ldots,n\}\):

\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta_t \,\frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} \nabla \ell_i(\mathbf{x}_t) \]

With a diminishing learning rate schedule \(\sum_t \eta_t = \infty\), \(\sum_t \eta_t^2 < \infty\), SGD converges for convex objectives.

4. Lagrange Multipliers & KKT Conditions

Consider the constrained problem:

\[ \min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0,\ i=1,\ldots,m \quad \text{and} \quad h_j(\mathbf{x}) = 0,\ j=1,\ldots,p \]

Form the Lagrangian by introducing multipliers \(\alpha_i \geq 0\) for inequalities and \(\nu_j\) for equalities:

\[ \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \nu_j h_j(\mathbf{x}) \]

The KKT conditions are necessary (and for convex problems, sufficient) for optimality:

Stationarity
\[ \nabla_{\mathbf{x}}\mathcal{L} = \nabla f(\mathbf{x}^*) + \sum_i \alpha_i \nabla g_i(\mathbf{x}^*) + \sum_j \nu_j \nabla h_j(\mathbf{x}^*) = \mathbf{0} \]
Primal Feasibility
\[ g_i(\mathbf{x}^*) \leq 0, \quad h_j(\mathbf{x}^*) = 0 \]
Dual Feasibility
\[ \alpha_i \geq 0 \]
Complementary Slackness
\[ \alpha_i g_i(\mathbf{x}^*) = 0 \quad \forall i \]

Complementary slackness says that either the constraint is active (\(g_i = 0\)) or its multiplier is zero (\(\alpha_i = 0\)). This will be crucial for the SVM dual derivation in Chapter 6.

Python: GD vs Momentum vs Adam on Rosenbrock

The Rosenbrock function \(f(x,y) = (1-x)^2 + 100(y-x^2)^2\) is a classic non-convex benchmark with a narrow curved valley. We compare gradient descent, momentum, and Adam โ€” showing how adaptive methods dramatically outperform vanilla GD.

Python
script.py120 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

  • โœ“ The gradient is the steepest ascent direction; gradient descent takes steps of \(-\eta\nabla f\) derived from the first-order Taylor expansion.
  • โœ“ Convexity (positive semi-definite Hessian) guarantees every local minimum is global โ€” gradient descent finds the true solution.
  • โœ“ Momentum reduces oscillations by accumulating a velocity; Adam adds per-parameter adaptive learning rates via bias-corrected moment estimates.
  • โœ“ KKT conditions are the necessary (and for convex problems, sufficient) conditions for constrained optimality โ€” the backbone of the SVM dual.
  • โœ“ Complementary slackness \(\alpha_i g_i = 0\) identifies which constraints are active at the solution.