Part VII β€” Advanced Topics

Chapter 19: Reinforcement Learning

Reinforcement learning formalises the problem of an agent learning through interaction. We derive every key equation from first principles: the Bellman equation for policy evaluation, the optimal Bellman operator, Q-learning's off-policy TD update, and the policy gradient theorem via the log-derivative trick.

1. Markov Decision Processes

A Markov Decision Process is a tuple \( (\mathcal{S}, \mathcal{A}, P, R, \gamma) \) where:

  • \( \mathcal{S} \) β€” state space
  • \( \mathcal{A} \) β€” action space
  • \( P(s' \mid s, a) \) β€” transition probability kernel
  • \( R(s, a, s') \in \mathbb{R} \) β€” reward function (often abbreviated \( R(s,a) \))
  • \( \gamma \in [0,1) \) β€” discount factor

At each time step \( t \), the agent observes state \( s_t \), selects action\( a_t \sim \pi(\cdot \mid s_t) \), receives reward \( r_t \), and transitions to \( s_{t+1} \sim P(\cdot \mid s_t, a_t) \). The goal is to maximise expected discounted return:

\[ G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \]

The Markov property states that \( P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t) \), meaning the future depends only on the current state and action, not the full history.

Agent–Environment Loop

Agentpolicy Ο€(a|s)EnvironmentP(s'|s,a), Raction aβ‚œstate sβ‚œβ‚Šβ‚, reward rβ‚œinitial state sβ‚€ ~ ΞΌβ‚€Objective: maximise E[Gβ‚œ] = E[Ξ£ γᡏ rβ‚œβ‚Šβ‚–]

2. Value Functions

The state-value function under policy \( \pi \) is the expected discounted return starting from state \( s \):

\[ V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\Bigg|\; s_0 = s\right] \]

The action-value function (Q-function) conditions on both state and action:

\[ Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\Bigg|\; s_0 = s,\, a_0 = a\right] \]

The relationship between the two is:\( V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a) \).

3. Bellman Equation β€” Full Derivation

We derive the Bellman equation by expanding the definition of \( V^\pi(s) \) and exploiting the Markov property. Start from:

\[ V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\Big|\; s_0 = s\right] \]

Peel off the first reward:

\[ = \mathbb{E}_\pi\!\left[r_0 + \gamma \sum_{t=1}^{\infty} \gamma^{t-1} r_t \;\Big|\; s_0 = s\right] \]

Condition on the first action \( a_0 \sim \pi(\cdot|s) \) and next state \( s_1 \sim P(\cdot|s,a_0) \):

\[ = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[ R(s,a,s') + \gamma\, \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty}\gamma^t r_{t+1} \;\Big|\; s_1 = s'\right]\right] \]

The inner expectation is exactly \( V^\pi(s') \) by definition:

\[ \boxed{V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma\, V^\pi(s')\right]} \]

This is the Bellman expectation equation. It expresses V at the current state as a weighted sum over actions and next states of immediate reward plus discounted future value.

Optimal Bellman Equation

The optimal value function \( V^*(s) = \max_\pi V^\pi(s) \) satisfies the Bellman optimality equation:

\[ V^*(s) = \max_a \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma\, V^*(s')\right] \]

The corresponding optimal Q-function satisfies:

\[ Q^*(s,a) = \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right] \]

The optimal policy is then \( \pi^*(s) = \operatorname{argmax}_a Q^*(s,a) \). The Bellman optimality operator \( \mathcal{T}^* \) is a contraction with factor \( \gamma \) under the \( \ell^\infty \) norm, guaranteeing a unique fixed point.

4. Q-Learning

Q-learning (Watkins, 1989) is an off-policy temporal-difference algorithm that directly approximates \( Q^* \) without knowing the model \( P \). After observing transition \( (s, a, r, s') \), the update is:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha\!\underbrace{\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]}_{\delta_t\,\text{(TD error)}} \]

The TD error \( \delta_t = r + \gamma \max_{a'} Q(s',a') - Q(s,a) \) measures the discrepancy between the current Q-estimate and the bootstrapped target. Q-learning is off-policy because the max over \( a' \) follows the greedy policy regardless of which policy collected the data.

Tabular Q-Learning Algorithm

  1. Initialise \( Q(s,a) = 0 \) for all \( s,a \)
  2. For each episode, reset \( s \leftarrow s_0 \)
  3. Choose \( a \) via \( \varepsilon \)-greedy: with prob \( \varepsilon \) random, else \( \argmax_a Q(s,a) \)
  4. Take \( a \), observe \( r, s' \)
  5. Update: \( Q(s,a) \mathrel{+}= \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)] \)
  6. Set \( s \leftarrow s' \); repeat until terminal

Convergence Guarantee

Q-learning converges to \( Q^* \) with probability 1 provided: (1) all state-action pairs are visited infinitely often, (2) the learning rate satisfies \( \sum_t \alpha_t = \infty \) and \( \sum_t \alpha_t^2 < \infty \), and (3) rewards are bounded. The proof uses stochastic approximation theory and the contraction property of \( \mathcal{T}^* \).

5. Policy Gradient Theorem & REINFORCE

Instead of learning a value function and deriving a policy, policy gradient methods directly parameterise\( \pi_\theta \) and optimise \( J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0] \)by gradient ascent. We derive \( \nabla_\theta J(\theta) \) using the log-derivative trick.

Write the objective as an expectation over trajectories \( \tau = (s_0,a_0,\ldots) \):

\[ J(\theta) = \int p_\theta(\tau)\, G(\tau)\, d\tau \]

Take the gradient:

\[ \nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau)\, G(\tau)\, d\tau \]

Apply the log-derivative trick: \( \nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau) \):

\[ = \int p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)\, G(\tau)\, d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\nabla_\theta \log p_\theta(\tau)\, G(\tau)\right] \]

Expand the log-probability of a trajectory (transitions cancel because they don't depend on \( \theta \)):

\[ \log p_\theta(\tau) = \log \mu(s_0) + \sum_{t=0}^{T-1}\!\left[\log \pi_\theta(a_t \mid s_t) + \log P(s_{t+1}\mid s_t,a_t)\right] \]
\[ \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \]

Therefore:

\[ \boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\cdot G_t\right]} \]

This is the REINFORCE estimator. The causal form uses the return-to-go \( G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k \) (past rewards cannot be influenced by current parameters).

Variance Reduction: Baselines

Subtracting a baseline \( b(s_t) \) (e.g. \( V^\pi(s_t) \)) from \( G_t \)does not bias the gradient (the expected baseline term is zero by the likelihood ratio argument) but drastically reduces variance:

\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\cdot\underbrace{(G_t - b(s_t))}_{\text{advantage } A_t}\right] \]

6. Actor-Critic & PPO

Actor-Critic

Replace the Monte-Carlo return \( G_t \) with a bootstrapped TD estimate using a learned critic \( V_w(s) \):

\[ A_t^\text{TD} = r_t + \gamma V_w(s_{t+1}) - V_w(s_t) \]

The actor updates \( \theta \) via policy gradient; the critic updates \( w \) by minimising \( \|r_t + \gamma V_w(s_{t+1}) - V_w(s_t)\|^2 \).

PPO Clipped Objective

PPO (Schulman et al., 2017) prevents destructively large policy updates using a clipped surrogate:

\[ r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)} \]
\[ L^\text{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t A_t,\;\operatorname{clip}(r_t,1\!-\!\epsilon,1\!+\!\epsilon)A_t\right)\right] \]

The clip prevents \( r_t \) from moving beyond \( [1-\varepsilon, 1+\varepsilon] \), bounding the trust region without solving a constrained optimisation problem.

Python Simulation: Q-Learning GridWorld

We train a tabular Q-learning agent on a 5Γ—5 gridworld with an obstacle, then visualise the learned value function as a heatmap and the greedy policy as directional arrows.

Python
script.py154 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server