Part I · Chapter 3

Optimization Theory

Training a machine learning model is an optimisation problem. This chapter derives the tools — gradients, Hessians, convexity, gradient descent and its variants, and constrained optimisation via KKT conditions — that explain how and why learning works.

1. Gradient, Hessian, and Jacobian

For a scalar function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient is the vector of partial derivatives:

\[ \nabla f(\mathbf{x}) = \begin{pmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{pmatrix} \in \mathbb{R}^n \]

The gradient points in the direction of steepest ascent; \(-\nabla f\) is the direction of steepest descent. The Hessian is the symmetric matrix of second derivatives:

\[ \mathbf{H}(\mathbf{x}) = \nabla^2 f(\mathbf{x}),\quad H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \]

For a vector-valued function \(\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian is:

\[ \mathbf{J} = \frac{\partial \mathbf{g}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}, \quad J_{ij} = \frac{\partial g_i}{\partial x_j} \]

Taylor Expansion

The second-order Taylor expansion of \(f\) around \(\mathbf{x}_0\) is:

\[ f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top \boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^\top \mathbf{H}(\mathbf{x}_0)\boldsymbol{\delta} \]

This expansion is the foundation of gradient descent (first-order) and Newton's method (second-order).

2. Convexity

A function \(f\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(t \in [0,1]\):

\[ f(t\mathbf{x} + (1-t)\mathbf{y}) \leq t f(\mathbf{x}) + (1-t)f(\mathbf{y}) \]

Second-order condition: a twice-differentiable \(f\) is convex if and only if \(\mathbf{H}(\mathbf{x}) \succeq 0\) (positive semi-definite) for all \(\mathbf{x}\).

Convexity guarantees that every local minimum is a global minimum. This is why convex losses (squared error, cross-entropy, hinge) are preferred: gradient descent is guaranteed to find the optimum.

Convex functions have a unique global minimum; non-convex functions have local minima where gradient descent can get trapped.

3. Gradient Descent — Full Derivation

We want to decrease \(f(\mathbf{x})\). The first-order Taylor expansion says:

\[ f(\mathbf{x} + \boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\top \boldsymbol{\delta} \]

To minimise this over \(\boldsymbol{\delta}\) with the constraint \(\|\boldsymbol{\delta}\| = \eta\) (step budget), the optimal direction is \(\boldsymbol{\delta}^* = -\eta \,\nabla f(\mathbf{x}) / \|\nabla f(\mathbf{x})\|\). Taking \(\boldsymbol{\delta} = -\eta\nabla f\) (proportional to gradient magnitude) gives the gradient descent update:

\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \,\nabla f(\mathbf{x}_t) \]

Learning rate analysis: For \(L\)-smooth convex \(f\) (Hessian eigenvalues \(\leq L\)), choosing \(\eta \leq 1/L\) guarantees monotone descent:

\[ f(\mathbf{x}_{t+1}) \leq f(\mathbf{x}_t) - \frac{\eta}{2}\|\nabla f(\mathbf{x}_t)\|^2 \]

Momentum (Heavy Ball)

Momentum accumulates a velocity vector to damp oscillations:

\[ \mathbf{v}_{t+1} = \beta\,\mathbf{v}_t + (1-\beta)\nabla f(\mathbf{x}_t), \quad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\mathbf{v}_{t+1} \]

Adam Optimizer — Derivation

Adam (Adaptive Moment Estimation) maintains per-parameter first and second moment estimates. At step \(t\):

\[ \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t \quad \text{(1st moment: mean)} \]

\[ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2 \quad \text{(2nd moment: uncentred variance)} \]

\[ \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \quad \text{(bias correction)} \]

\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \varepsilon}\,\hat{\mathbf{m}}_t \]

The bias correction ensures that \(\hat{\mathbf{m}}_1 \approx \mathbf{g}_1\) even at initialisation when \(\mathbf{m}_0 = \mathbf{0}\). Typical values: \(\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{-8}, \eta = 10^{-3}\).

Stochastic Gradient Descent (SGD)

For large datasets, computing \(\nabla f = \frac{1}{n}\sum_i \nabla \ell_i\) is expensive. SGD approximates it with a mini-batch \(\mathcal{B} \subset \{1,\ldots,n\}\):

\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta_t \,\frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} \nabla \ell_i(\mathbf{x}_t) \]

With a diminishing learning rate schedule \(\sum_t \eta_t = \infty\), \(\sum_t \eta_t^2 < \infty\), SGD converges for convex objectives.

4. Lagrange Multipliers & KKT Conditions

Consider the constrained problem:

\[ \min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0,\ i=1,\ldots,m \quad \text{and} \quad h_j(\mathbf{x}) = 0,\ j=1,\ldots,p \]

Form the Lagrangian by introducing multipliers \(\alpha_i \geq 0\) for inequalities and \(\nu_j\) for equalities:

\[ \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \nu_j h_j(\mathbf{x}) \]

The KKT conditions are necessary (and for convex problems, sufficient) for optimality:

Stationarity

\[ \nabla_{\mathbf{x}}\mathcal{L} = \nabla f(\mathbf{x}^*) + \sum_i \alpha_i \nabla g_i(\mathbf{x}^*) + \sum_j \nu_j \nabla h_j(\mathbf{x}^*) = \mathbf{0} \]

Primal Feasibility

\[ g_i(\mathbf{x}^*) \leq 0, \quad h_j(\mathbf{x}^*) = 0 \]

Dual Feasibility

\[ \alpha_i \geq 0 \]

Complementary Slackness

\[ \alpha_i g_i(\mathbf{x}^*) = 0 \quad \forall i \]

Complementary slackness says that either the constraint is active (\(g_i = 0\)) or its multiplier is zero (\(\alpha_i = 0\)). This will be crucial for the SVM dual derivation in Chapter 6.

Python: GD vs Momentum vs Adam on Rosenbrock

The Rosenbrock function \(f(x,y) = (1-x)^2 + 100(y-x^2)^2\) is a classic non-convex benchmark with a narrow curved valley. We compare gradient descent, momentum, and Adam — showing how adaptive methods dramatically outperform vanilla GD.

Python

script.py120 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# ── Rosenbrock function and gradients ──
def f(x, y):
    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2

def grad_f(x, y):
    gx = -2 * (1 - x) - 400 * x * (y - x ** 2)
    gy = 200 * (y - x ** 2)
    return np.array([gx, gy])

def hess_f(x, y):
    h11 = 2 - 400 * (y - 3 * x ** 2)
    h12 = -400 * x
    h22 = 200.0
    return np.array([[h11, h12], [h12, h22]])

MAX_ITER = 2000

# ── Vanilla gradient descent ──
def gradient_descent(lr=1e-3, max_iter=MAX_ITER):
    p = np.array([-1.0, 1.0])
    path = [p.copy()]
    for _ in range(max_iter):
        p = p - lr * grad_f(*p)
        path.append(p.copy())
        if np.linalg.norm(grad_f(*p)) < 1e-6:
            break
    return np.array(path)

# ── Gradient descent with momentum (heavy ball) ──
def momentum_gd(lr=1e-3, beta=0.9, max_iter=MAX_ITER):
    p = np.array([-1.0, 1.0])
    v = np.zeros(2)
    path = [p.copy()]
    for _ in range(max_iter):
        g = grad_f(*p)
        v = beta * v + (1 - beta) * g
        p = p - lr * v
        path.append(p.copy())
        if np.linalg.norm(g) < 1e-6:
            break
    return np.array(path)

# ── Adam optimizer ──
def adam(lr=1e-2, beta1=0.9, beta2=0.999, eps=1e-8, max_iter=MAX_ITER):
    p = np.array([-1.0, 1.0])
    m = np.zeros(2)
    v = np.zeros(2)
    path = [p.copy()]
    for t in range(1, max_iter + 1):
        g = grad_f(*p)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g ** 2
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        p = p - lr * m_hat / (np.sqrt(v_hat) + eps)
        path.append(p.copy())
        if np.linalg.norm(g) < 1e-6:
            break
    return np.array(path)

path_gd = gradient_descent(lr=1e-3)
path_mom = momentum_gd(lr=1e-3, beta=0.9)
path_adam = adam(lr=1e-2)

print(f'GD steps:       {len(path_gd)},  final f = {f(*path_gd[-1]):.6f}')
print(f'Momentum steps: {len(path_mom)}, final f = {f(*path_mom[-1]):.6f}')
print(f'Adam steps:     {len(path_adam)}, final f = {f(*path_adam[-1]):.6f}')

# ── Loss curves ──
loss_gd   = [f(*p) for p in path_gd]
loss_mom  = [f(*p) for p in path_mom]
loss_adam = [f(*p) for p in path_adam]

fig, axes = plt.subplots(1, 2, figsize=(14, 6), facecolor='#0f0f1a')

# Left: Rosenbrock contour + trajectories
ax = axes[0]
ax.set_facecolor('#0f0f1a')
xs = np.linspace(-2, 2, 400)
ys = np.linspace(-0.5, 3, 400)
X, Y = np.meshgrid(xs, ys)
Z = np.log1p(f(X, Y))
ax.contourf(X, Y, Z, levels=50, cmap='inferno', alpha=0.6)
ax.contour(X, Y, Z, levels=20, colors='white', alpha=0.15, linewidths=0.5)
ax.plot(path_gd[:, 0], path_gd[:, 1], 'o-', color='#f472b6', linewidth=1.2, markersize=1.5, label=f'GD ({len(path_gd)} steps)', alpha=0.8)
ax.plot(path_mom[:, 0], path_mom[:, 1], 's-', color='#34d399', linewidth=1.2, markersize=1.5, label=f'Momentum ({len(path_mom)} steps)', alpha=0.8)
ax.plot(path_adam[:, 0], path_adam[:, 1], '^-', color='#a78bfa', linewidth=1.2, markersize=1.5, label=f'Adam ({len(path_adam)} steps)', alpha=0.9)
ax.plot(1, 1, '*', color='yellow', markersize=12, label='Global min (1,1)')
ax.set_xlabel('x', color='#e2e8f0')
ax.set_ylabel('y', color='#e2e8f0')
ax.set_title('Rosenbrock: Optimizer Trajectories', color='#c4b5fd', fontsize=12, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# Right: Loss convergence
ax = axes[1]
ax.set_facecolor('#0f0f1a')
ax.semilogy(loss_gd, color='#f472b6', linewidth=2, label='Gradient Descent', alpha=0.9)
ax.semilogy(loss_mom, color='#34d399', linewidth=2, label='Momentum', alpha=0.9)
ax.semilogy(loss_adam, color='#a78bfa', linewidth=2.5, label='Adam', alpha=0.9)
ax.set_xlabel('Iteration', color='#e2e8f0')
ax.set_ylabel('f(x,y)  [log scale]', color='#e2e8f0')
ax.set_title('Loss Convergence Comparison', color='#c4b5fd', fontsize=12, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=9)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

plt.suptitle('Optimizers on the Rosenbrock Function  f(x,y) = (1-x)² + 100(y-x²)²',
             color='#ddd6fe', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print('Saved output.png')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

✓ The gradient is the steepest ascent direction; gradient descent takes steps of \(-\eta\nabla f\) derived from the first-order Taylor expansion.
✓ Convexity (positive semi-definite Hessian) guarantees every local minimum is global — gradient descent finds the true solution.
✓ Momentum reduces oscillations by accumulating a velocity; Adam adds per-parameter adaptive learning rates via bias-corrected moment estimates.
✓ KKT conditions are the necessary (and for convex problems, sufficient) conditions for constrained optimality — the backbone of the SVM dual.
✓ Complementary slackness \(\alpha_i g_i = 0\) identifies which constraints are active at the solution.

Share:X Reddit LinkedIn

← Probability & Statistics Next: Linear Regression