Part VII — Advanced Topics

Chapter 19: Reinforcement Learning

Reinforcement learning formalises the problem of an agent learning through interaction. We derive every key equation from first principles: the Bellman equation for policy evaluation, the optimal Bellman operator, Q-learning's off-policy TD update, and the policy gradient theorem via the log-derivative trick.

1. Markov Decision Processes

A Markov Decision Process is a tuple \( (\mathcal{S}, \mathcal{A}, P, R, \gamma) \) where:

\( \mathcal{S} \) — state space
\( \mathcal{A} \) — action space
\( P(s' \mid s, a) \) — transition probability kernel
\( R(s, a, s') \in \mathbb{R} \) — reward function (often abbreviated \( R(s,a) \))
\( \gamma \in [0,1) \) — discount factor

At each time step \( t \), the agent observes state \( s_t \), selects action\( a_t \sim \pi(\cdot \mid s_t) \), receives reward \( r_t \), and transitions to \( s_{t+1} \sim P(\cdot \mid s_t, a_t) \). The goal is to maximise expected discounted return:

\[ G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \]

The Markov property states that \( P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t) \), meaning the future depends only on the current state and action, not the full history.

Agent–Environment Loop

2. Value Functions

The state-value function under policy \( \pi \) is the expected discounted return starting from state \( s \):

\[ V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\Bigg|\; s_0 = s\right] \]

The action-value function (Q-function) conditions on both state and action:

\[ Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\Bigg|\; s_0 = s,\, a_0 = a\right] \]

The relationship between the two is:\( V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a) \).

3. Bellman Equation — Full Derivation

We derive the Bellman equation by expanding the definition of \( V^\pi(s) \) and exploiting the Markov property. Start from:

\[ V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\Big|\; s_0 = s\right] \]

Peel off the first reward:

\[ = \mathbb{E}_\pi\!\left[r_0 + \gamma \sum_{t=1}^{\infty} \gamma^{t-1} r_t \;\Big|\; s_0 = s\right] \]

Condition on the first action \( a_0 \sim \pi(\cdot|s) \) and next state \( s_1 \sim P(\cdot|s,a_0) \):

\[ = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[ R(s,a,s') + \gamma\, \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty}\gamma^t r_{t+1} \;\Big|\; s_1 = s'\right]\right] \]

The inner expectation is exactly \( V^\pi(s') \) by definition:

\[ \boxed{V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma\, V^\pi(s')\right]} \]

This is the Bellman expectation equation. It expresses V at the current state as a weighted sum over actions and next states of immediate reward plus discounted future value.

Optimal Bellman Equation

The optimal value function \( V^*(s) = \max_\pi V^\pi(s) \) satisfies the Bellman optimality equation:

\[ V^*(s) = \max_a \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma\, V^*(s')\right] \]

The corresponding optimal Q-function satisfies:

\[ Q^*(s,a) = \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right] \]

The optimal policy is then \( \pi^*(s) = \operatorname{argmax}_a Q^*(s,a) \). The Bellman optimality operator \( \mathcal{T}^* \) is a contraction with factor \( \gamma \) under the \( \ell^\infty \) norm, guaranteeing a unique fixed point.

4. Q-Learning

Q-learning (Watkins, 1989) is an off-policy temporal-difference algorithm that directly approximates \( Q^* \) without knowing the model \( P \). After observing transition \( (s, a, r, s') \), the update is:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha\!\underbrace{\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]}_{\delta_t\,\text{(TD error)}} \]

The TD error \( \delta_t = r + \gamma \max_{a'} Q(s',a') - Q(s,a) \) measures the discrepancy between the current Q-estimate and the bootstrapped target. Q-learning is off-policy because the max over \( a' \) follows the greedy policy regardless of which policy collected the data.

Tabular Q-Learning Algorithm

Initialise \( Q(s,a) = 0 \) for all \( s,a \)
For each episode, reset \( s \leftarrow s_0 \)
Choose \( a \) via \( \varepsilon \)-greedy: with prob \( \varepsilon \) random, else \( \argmax_a Q(s,a) \)
Take \( a \), observe \( r, s' \)
Update: \( Q(s,a) \mathrel{+}= \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)] \)
Set \( s \leftarrow s' \); repeat until terminal

Convergence Guarantee

Q-learning converges to \( Q^* \) with probability 1 provided: (1) all state-action pairs are visited infinitely often, (2) the learning rate satisfies \( \sum_t \alpha_t = \infty \) and \( \sum_t \alpha_t^2 < \infty \), and (3) rewards are bounded. The proof uses stochastic approximation theory and the contraction property of \( \mathcal{T}^* \).

5. Policy Gradient Theorem & REINFORCE

Instead of learning a value function and deriving a policy, policy gradient methods directly parameterise\( \pi_\theta \) and optimise \( J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0] \)by gradient ascent. We derive \( \nabla_\theta J(\theta) \) using the log-derivative trick.

Write the objective as an expectation over trajectories \( \tau = (s_0,a_0,\ldots) \):

\[ J(\theta) = \int p_\theta(\tau)\, G(\tau)\, d\tau \]

Take the gradient:

\[ \nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau)\, G(\tau)\, d\tau \]

Apply the log-derivative trick: \( \nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau) \):

\[ = \int p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)\, G(\tau)\, d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\nabla_\theta \log p_\theta(\tau)\, G(\tau)\right] \]

Expand the log-probability of a trajectory (transitions cancel because they don't depend on \( \theta \)):

\[ \log p_\theta(\tau) = \log \mu(s_0) + \sum_{t=0}^{T-1}\!\left[\log \pi_\theta(a_t \mid s_t) + \log P(s_{t+1}\mid s_t,a_t)\right] \]

\[ \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \]

Therefore:

\[ \boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\cdot G_t\right]} \]

This is the REINFORCE estimator. The causal form uses the return-to-go \( G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k \) (past rewards cannot be influenced by current parameters).

Variance Reduction: Baselines

Subtracting a baseline \( b(s_t) \) (e.g. \( V^\pi(s_t) \)) from \( G_t \)does not bias the gradient (the expected baseline term is zero by the likelihood ratio argument) but drastically reduces variance:

\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\cdot\underbrace{(G_t - b(s_t))}_{\text{advantage } A_t}\right] \]

6. Actor-Critic & PPO

Actor-Critic

Replace the Monte-Carlo return \( G_t \) with a bootstrapped TD estimate using a learned critic \( V_w(s) \):

\[ A_t^\text{TD} = r_t + \gamma V_w(s_{t+1}) - V_w(s_t) \]

The actor updates \( \theta \) via policy gradient; the critic updates \( w \) by minimising \( \|r_t + \gamma V_w(s_{t+1}) - V_w(s_t)\|^2 \).

PPO Clipped Objective

PPO (Schulman et al., 2017) prevents destructively large policy updates using a clipped surrogate:

\[ r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)} \]

\[ L^\text{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t A_t,\;\operatorname{clip}(r_t,1\!-\!\epsilon,1\!+\!\epsilon)A_t\right)\right] \]

The clip prevents \( r_t \) from moving beyond \( [1-\varepsilon, 1+\varepsilon] \), bounding the trust region without solving a constrained optimisation problem.

Python Simulation: Q-Learning GridWorld

We train a tabular Q-learning agent on a 5×5 gridworld with an obstacle, then visualise the learned value function as a heatmap and the greedy policy as directional arrows.

Python

script.py153 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.patches import FancyArrow

# Q-learning on a 5x5 GridWorld
# States: (row, col), Goal: (4,4), Obstacle: (2,2)
# Actions: 0=up, 1=right, 2=down, 3=left

GRID = 5
GOAL = (4, 4)
OBSTACLE = (2, 2)
ACTIONS = [(-1,0),(0,1),(1,0),(0,-1)]
ACTION_SYMBOLS = ['↑','→','↓','←']

def step(s, a):
    if s == GOAL:
        return s, 0.0, True
    nr, nc = s[0]+ACTIONS[a][0], s[1]+ACTIONS[a][1]
    if nr < 0 or nr >= GRID or nc < 0 or nc >= GRID:
        nr, nc = s
    if (nr, nc) == OBSTACLE:
        nr, nc = s
    r = 1.0 if (nr, nc) == GOAL else -0.01
    done = (nr, nc) == GOAL
    return (nr, nc), r, done

# Hyperparameters
alpha = 0.1
gamma = 0.99
epsilon_start = 1.0
epsilon_end = 0.05
n_episodes = 3000

Q = np.zeros((GRID, GRID, 4))
rewards_per_ep = []

for ep in range(n_episodes):
    eps = epsilon_end + (epsilon_start - epsilon_end) * np.exp(-ep / 600.0)
    s = (0, 0)
    total_r = 0.0
    for _ in range(200):
        if np.random.rand() < eps:
            a = np.random.randint(4)
        else:
            a = int(np.argmax(Q[s[0], s[1]]))
        ns, r, done = step(s, a)
        best_next = np.max(Q[ns[0], ns[1]])
        Q[s[0], s[1], a] += alpha * (r + gamma * best_next - Q[s[0], s[1], a])
        s = ns
        total_r += r
        if done:
            break
    rewards_per_ep.append(total_r)

# ---- Plots ----
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.patch.set_facecolor('#0a0a0f')

# Panel 1: Value function heatmap
ax1 = axes[0]
V = np.max(Q, axis=2)
V[OBSTACLE[0], OBSTACLE[1]] = np.nan
im = ax1.imshow(V, cmap='plasma', origin='upper', interpolation='nearest')
ax1.set_title('Value Function V(s) = max_a Q(s,a)', color='white', fontsize=12, fontweight='bold')
ax1.set_facecolor('#0a0a0f')
ax1.tick_params(colors='white')
for spine in ax1.spines.values():
    spine.set_color('#7c3aed')
for r in range(GRID):
    for c in range(GRID):
        if (r, c) == GOAL:
            ax1.text(c, r, 'GOAL', ha='center', va='center', color='white', fontsize=8, fontweight='bold')
        elif (r, c) == OBSTACLE:
            ax1.text(c, r, 'OBS', ha='center', va='center', color='#f87171', fontsize=8, fontweight='bold')
        else:
            ax1.text(c, r, f'{V[r,c]:.2f}', ha='center', va='center', color='white', fontsize=7)
cbar = plt.colorbar(im, ax=ax1)
cbar.ax.yaxis.set_tick_params(color='white')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='white')
ax1.set_xticks(range(GRID))
ax1.set_yticks(range(GRID))
ax1.set_xticklabels(range(GRID), color='white')
ax1.set_yticklabels(range(GRID), color='white')

# Panel 2: Greedy policy arrows
ax2 = axes[1]
ax2.set_facecolor('#0a0a0f')
ax2.set_xlim(-0.5, GRID - 0.5)
ax2.set_ylim(-0.5, GRID - 0.5)
ax2.set_aspect('equal')
ax2.invert_yaxis()
ax2.set_title('Learned Greedy Policy pi*(s)', color='white', fontsize=12, fontweight='bold')
ax2.tick_params(colors='white')
for spine in ax2.spines.values():
    spine.set_color('#7c3aed')
ax2.set_xticks(range(GRID))
ax2.set_yticks(range(GRID))
ax2.set_xticklabels(range(GRID), color='white')
ax2.set_yticklabels(range(GRID), color='white')
ax2.grid(True, color='#3730a3', alpha=0.3)

dx_map = [0, 0.35, 0, -0.35]
dy_map = [-0.35, 0, 0.35, 0]
for r in range(GRID):
    for c in range(GRID):
        if (r, c) == GOAL:
            ax2.text(c, r, 'G', ha='center', va='center', color='#86efac', fontsize=14, fontweight='bold')
        elif (r, c) == OBSTACLE:
            ax2.add_patch(plt.Rectangle((c-0.5, r-0.5), 1, 1, color='#7f1d1d', zorder=2))
            ax2.text(c, r, 'X', ha='center', va='center', color='#f87171', fontsize=14, fontweight='bold')
        else:
            best_a = int(np.argmax(Q[r, c]))
            ax2.annotate('', xy=(c + dx_map[best_a], r + dy_map[best_a]),
                         xytext=(c, r),
                         arrowprops=dict(arrowstyle='->', color='#a78bfa', lw=2))

# Panel 3: Training curve (smoothed)
ax3 = axes[2]
ax3.set_facecolor('#0a0a0f')
window = 100
smoothed = np.convolve(rewards_per_ep, np.ones(window)/window, mode='valid')
ax3.plot(range(len(smoothed)), smoothed, color='#8b5cf6', linewidth=2)
ax3.fill_between(range(len(smoothed)), smoothed, alpha=0.2, color='#7c3aed')
ax3.set_xlabel('Episode', color='white', fontsize=11)
ax3.set_ylabel('Total Reward (100-ep avg)', color='white', fontsize=11)
ax3.set_title('Training Curve', color='white', fontsize=12, fontweight='bold')
ax3.tick_params(colors='white')
for spine in ax3.spines.values():
    spine.set_color('#7c3aed')
ax3.grid(True, alpha=0.2, color='#7c3aed')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a0f')

# Print Q-learning stats
print("Q-learning converged after {} episodes".format(n_episodes))
print("Optimal path from (0,0) to (4,4):")
s = (0, 0)
path = [s]
for _ in range(20):
    a = int(np.argmax(Q[s[0], s[1]]))
    ns, _, done = step(s, a)
    path.append(ns)
    s = ns
    if done:
        break
print(" -> ".join(str(p) for p in path))
print("Path length: {} steps".format(len(path)-1))
print("Max V(s) at goal: {:.4f}".format(V[GOAL[0], GOAL[1]]))

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

Ch 18: Large Language Models Ch 20: Graph Neural Networks