Chapter 20: Information Bottleneck & Machine Learning

Part VII: Modern Applications

The Information Bottleneck

Naftali Tishby, Fernando Pereira, and William Bialek (1999) introduced the information bottleneck (IB) as a principled framework for finding the optimal compressed representation\(T\) of input \(X\) that retains maximum information about a relevant variable \(Y\).

\( \min_{p(t|x)} \; I(X;T) - \beta \cdot I(T;Y) \)

\(\beta\) controls the tradeoff: high \(\beta\) preserves relevance, low \(\beta\) maximizes compression

The IB method generalizes rate-distortion theory: instead of a hand-crafted distortion measure, the "distortion" is the loss of mutual information with the target \(Y\).

IB Self-Consistent Equations

The optimal encoder \(p(t|x)\) satisfies a set of self-consistent equations solved iteratively (like the Blahut-Arimoto algorithm):

\( p(t|x) \propto p(t) \exp\!\bigl(-\beta \, D_{KL}[p(y|x) \| p(y|t)]\bigr) \)

\( p(t) = \sum_x p(x) p(t|x) \)

\( p(y|t) = \sum_x p(y|x) p(x|t) \)

The KL divergence \(D_{KL}[p(y|x) \| p(y|t)]\) measures how much information about \(Y\) is lost when \(x\) is mapped to cluster \(t\). Soft assignments allow smooth interpolation along the information curve.

Deep Learning as Information Compression

Tishby and Schwartz-Ziv (2017) proposed the Information Bottleneck Theory of Deep Learning: each layer of a neural network compresses the input \(X\)into representations \(T_1, T_2, \ldots\) forming a Markov chain:

\( Y \leftarrow X \leftarrow T_1 \leftarrow T_2 \leftarrow \cdots \leftarrow T_L \)

During training, layers first increase \(I(T_\ell;Y)\) (learning phase), then decrease \(I(X;T_\ell)\) (compression phase). This traces a path on the information plane toward the IB optimal curve.

Note: This theory is debated; some results depend on the activation function and estimator used for mutual information. Recent work uses neural mutual information estimators (MINE, CLUB) to study this empirically.

PAC-Bayes & Generalization

PAC-Bayes bounds connect information theory to generalization in machine learning. For a posterior \(Q\) over hypotheses, trained from prior \(P\), the generalization gap is bounded by:

\( \mathbb{E}_Q[\text{error}] \leq \mathbb{E}_Q[\hat{\text{error}}] + \sqrt{\frac{D_{KL}(Q \| P) + \ln(2n/\delta)}{2n}} \)

The \(D_{KL}(Q\|P)\) term measures how much the posterior has shifted from the prior — equivalently, how much information was extracted from the training data. Models that memorize the training set have large KL, hence poor generalization bounds.

Python: Information Bottleneck Curve

Run the IB algorithm for multiple beta values on a synthetic joint distribution, plot the information plane I(X;T) vs I(T;Y), and visualize the compression-relevance tradeoff.

Python

script.py197 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy.special import rel_entr

np.random.seed(42)

# ── Information Bottleneck ────────────────────────────────────────────────────
# Given joint distribution p(x,y), find compressed representation T of X
# that preserves information about Y.
# Objective: min I(X;T) - beta * I(T;Y)
# IB algorithm (Tishby, Pereira, Bialek 1999): iterative BA-style updates.

def entropy(p):
    """Shannon entropy in bits."""
    p = p[p > 1e-15]
    return -np.sum(p * np.log2(p))

def mutual_information(pxy):
    """Mutual information I(X;Y) from joint distribution."""
    px = pxy.sum(axis=1)
    py = pxy.sum(axis=0)
    outer = np.outer(px, py)
    mask = (pxy > 1e-15) & (outer > 1e-15)
    mi = np.sum(pxy[mask] * np.log2(pxy[mask] / outer[mask]))
    return max(0.0, mi)

def ib_algorithm(pxy, n_T, beta, max_iter=200, tol=1e-6):
    """
    Information Bottleneck algorithm.
    pxy: joint distribution p(x,y), shape (|X|, |Y|)
    n_T: number of compression clusters
    beta: tradeoff parameter
    Returns: I(X;T), I(T;Y)
    """
    n_x, n_y = pxy.shape
    px = pxy.sum(axis=1)   # p(x)
    py = pxy.sum(axis=0)   # p(y)
    # Conditional p(y|x)
    py_given_x = pxy / (px[:, None] + 1e-15)
    # Initialize random p(t|x)
    pt_given_x = np.random.dirichlet(np.ones(n_T), size=n_x)  # shape (n_x, n_T)

for iteration in range(max_iter):
        pt_given_x_old = pt_given_x.copy()
        # p(t) = sum_x p(x) p(t|x)
        pt = (px[:, None] * pt_given_x).sum(axis=0)  # (n_T,)
        # p(y|t) = sum_x p(x|t) p(y|x)
        px_given_t = pt_given_x * px[:, None] / (pt[None, :] + 1e-15)  # (n_x, n_T)
        py_given_t = px_given_t.T @ py_given_x  # (n_T, n_y)
        py_given_t = py_given_t / (py_given_t.sum(axis=1, keepdims=True) + 1e-15)
        # Update p(t|x) via IB self-consistent equations
        # log p(t|x) ∝ log p(t) - beta * KL(p(y|x) || p(y|t))
        log_pt = np.log2(pt + 1e-15)  # (n_T,)
        # KL(p(y|x) || p(y|t)) for each (x, t)
        kl = np.zeros((n_x, n_T))
        for t in range(n_T):
            # KL divergence for each x
            py_t = py_given_t[t]  # (n_y,)
            for x in range(n_x):
                p_yx = py_given_x[x]  # (n_y,)
                kl_val = np.sum(rel_entr(p_yx + 1e-15, py_t + 1e-15)) / np.log(2)
                kl[x, t] = kl_val
        log_pt_given_x = log_pt[None, :] - beta * kl  # (n_x, n_T)
        # Normalize
        log_pt_given_x -= log_pt_given_x.max(axis=1, keepdims=True)
        pt_given_x = np.exp(log_pt_given_x * np.log(2))
        pt_given_x /= pt_given_x.sum(axis=1, keepdims=True)
        # Convergence
        if np.max(np.abs(pt_given_x - pt_given_x_old)) < tol:
            break
    # Compute I(X;T) and I(T;Y)
    # joint p(x,t)
    pxt = px[:, None] * pt_given_x  # (n_x, n_T)
    I_XT = mutual_information(pxt)
    # joint p(t,y)
    pt = pxt.sum(axis=0)
    py_given_t = (pxt.T @ py_given_x) / (pt[:, None] + 1e-15)
    pty = pt[:, None] * py_given_t  # (n_T, n_y)
    I_TY = mutual_information(pty)
    return I_XT, I_TY

# ── Create a simple joint distribution p(x,y) ─────────────────────────────────
# X = 4 input categories, Y = 2 output classes
# Meaningful structure: X in {0,1} -> Y=0 more likely; X in {2,3} -> Y=1 more likely
n_x, n_y = 6, 2
pxy = np.array([
    [0.12, 0.02],  # x=0 -> y=0
    [0.10, 0.03],  # x=1 -> y=0
    [0.08, 0.05],  # x=2 -> slightly mixed
    [0.03, 0.08],  # x=3 -> slightly mixed
    [0.02, 0.12],  # x=4 -> y=1
    [0.01, 0.14],  # x=5 -> y=1
])
pxy /= pxy.sum()  # normalize

px = pxy.sum(axis=1)
py = pxy.sum(axis=0)
I_XY = mutual_information(pxy)
H_X = entropy(px)
H_Y = entropy(py)

print("=== Information Bottleneck ===")
print(f"Joint distribution p(x,y): {n_x} input states, {n_y} output classes")
print(f"H(X) = {H_X:.4f} bits")
print(f"H(Y) = {H_Y:.4f} bits")
print(f"I(X;Y) = {I_XY:.4f} bits  (total relevant information)")
print()

# ── Sweep beta to trace the information plane ─────────────────────────────────
beta_values = np.concatenate([np.linspace(0.1, 1.0, 10), np.linspace(1.0, 8.0, 20)])
n_T = 3  # compression clusters

results = []
print(f"Running IB for {len(beta_values)} beta values (n_T={n_T})...")
for beta in beta_values:
    I_XT, I_TY = ib_algorithm(pxy, n_T, beta)
    results.append((beta, I_XT, I_TY))

I_XT_arr = np.array([r[1] for r in results])
I_TY_arr = np.array([r[2] for r in results])

print()
print(f"{'beta':>6} | {'I(X;T)':>8} | {'I(T;Y)':>8}")
print("-" * 30)
for beta, I_XT, I_TY in results[::6]:
    print(f"{beta:6.2f} | {I_XT:8.4f} | {I_TY:8.4f}")

print()
print(f"Information plane traces the optimal compression-relevance tradeoff.")
print(f"Top-right corner: I(X;T)=H(X), I(T;Y)=I(X;Y) — no compression, all info preserved.")
print(f"Bottom-left: I(X;T)->0, I(T;Y)->0 — maximum compression, all info lost.")

# ── Plotting ──────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.patch.set_facecolor('#0a0a1a')

# Plot 1: Information plane
ax1 = axes[0]
scatter = ax1.scatter(I_XT_arr, I_TY_arr, c=[r[0] for r in results],
                      cmap='plasma', s=50, zorder=5, edgecolors='white', linewidths=0.5)
ax1.plot(I_XT_arr, I_TY_arr, '-', color='#818cf8', linewidth=1.5, alpha=0.6, zorder=4)
cbar = plt.colorbar(scatter, ax=ax1)
cbar.set_label('beta (tradeoff)', color='white', fontsize=9)
cbar.ax.yaxis.set_tick_params(color='white')
plt.setp(cbar.ax.yaxis.get_ticklabels(), color='white')
ax1.axvline(H_X, color='#fbbf24', linestyle=':', linewidth=1, alpha=0.7, label=f'H(X)={H_X:.2f}')
ax1.axhline(I_XY, color='#34d399', linestyle=':', linewidth=1, alpha=0.7, label=f'I(X;Y)={I_XY:.2f}')
ax1.set_xlabel('I(X;T) — compression rate (bits)', color='white', fontsize=11)
ax1.set_ylabel('I(T;Y) — relevant information (bits)', color='white', fontsize=11)
ax1.set_title("Information Plane\n(Bottleneck Curve)", color='white', fontsize=12, fontweight='bold')
ax1.legend(fontsize=8, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax1.set_facecolor('#0a0a1a')
ax1.tick_params(colors='white')
ax1.grid(True, alpha=0.2, color='#818cf8')
for sp in ax1.spines.values(): sp.set_color('#818cf8')

# Plot 2: I(X;T) and I(T;Y) vs beta
ax2 = axes[1]
beta_arr = np.array([r[0] for r in results])
ax2.plot(beta_arr, I_XT_arr, color='#f87171', linewidth=2.5, label='I(X;T) — compression')
ax2.plot(beta_arr, I_TY_arr, color='#34d399', linewidth=2.5, label='I(T;Y) — relevance')
ax2.axhline(I_XY, color='#fbbf24', linestyle='--', linewidth=1.5, alpha=0.8, label=f'I(X;Y) = {I_XY:.3f}')
ax2.set_xlabel('beta (compression-relevance tradeoff)', color='white', fontsize=11)
ax2.set_ylabel('Mutual information (bits)', color='white', fontsize=11)
ax2.set_title("I(X;T) and I(T;Y) vs Beta", color='white', fontsize=12, fontweight='bold')
ax2.legend(fontsize=9, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax2.set_facecolor('#0a0a1a')
ax2.tick_params(colors='white')
ax2.grid(True, alpha=0.2, color='#818cf8')
for sp in ax2.spines.values(): sp.set_color('#818cf8')

# Plot 3: Joint distribution heatmap
ax3 = axes[2]
im = ax3.imshow(pxy, cmap='Blues', aspect='auto')
ax3.set_xlabel('Y class', color='white', fontsize=11)
ax3.set_ylabel('X input', color='white', fontsize=11)
ax3.set_title("Joint Distribution p(x,y)\n(input source)", color='white', fontsize=12, fontweight='bold')
ax3.set_xticks([0, 1])
ax3.set_xticklabels(['Y=0', 'Y=1'], color='white')
ax3.set_yticks(range(n_x))
ax3.set_yticklabels([f'X={i}' for i in range(n_x)], color='white')
for i in range(n_x):
    for j in range(n_y):
        ax3.text(j, i, f'{pxy[i,j]:.3f}', ha='center', va='center',
                 color='white', fontsize=10, fontweight='bold')
cbar2 = plt.colorbar(im, ax=ax3)
cbar2.ax.yaxis.set_tick_params(color='white')
plt.setp(cbar2.ax.yaxis.get_ticklabels(), color='white')
for sp in ax3.spines.values(): sp.set_color('#818cf8')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a1a')
print("\nHigh beta: preserve all relevant info (less compression). Low beta: maximum compression.")
print("Deep learning layers trace a path in the information plane during training.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← Ch 19: Data Compression Ch 21: Cryptography & Networks →

Share:X Reddit LinkedIn