Chapter 20: Information Bottleneck & Machine Learning

Part VII: Modern Applications

The Information Bottleneck

Naftali Tishby, Fernando Pereira, and William Bialek (1999) introduced the information bottleneck (IB) as a principled framework for finding the optimal compressed representation\(T\) of input \(X\) that retains maximum information about a relevant variable \(Y\).

\( \min_{p(t|x)} \; I(X;T) - \beta \cdot I(T;Y) \)

\(\beta\) controls the tradeoff: high \(\beta\) preserves relevance, low \(\beta\) maximizes compression

The IB method generalizes rate-distortion theory: instead of a hand-crafted distortion measure, the "distortion" is the loss of mutual information with the target \(Y\).

IB Self-Consistent Equations

The optimal encoder \(p(t|x)\) satisfies a set of self-consistent equations solved iteratively (like the Blahut-Arimoto algorithm):

\( p(t|x) \propto p(t) \exp\!\bigl(-\beta \, D_{KL}[p(y|x) \| p(y|t)]\bigr) \)

\( p(t) = \sum_x p(x) p(t|x) \)

\( p(y|t) = \sum_x p(y|x) p(x|t) \)

The KL divergence \(D_{KL}[p(y|x) \| p(y|t)]\) measures how much information about \(Y\) is lost when \(x\) is mapped to cluster \(t\). Soft assignments allow smooth interpolation along the information curve.

Deep Learning as Information Compression

Tishby and Schwartz-Ziv (2017) proposed the Information Bottleneck Theory of Deep Learning: each layer of a neural network compresses the input \(X\)into representations \(T_1, T_2, \ldots\) forming a Markov chain:

\( Y \leftarrow X \leftarrow T_1 \leftarrow T_2 \leftarrow \cdots \leftarrow T_L \)

During training, layers first increase \(I(T_\ell;Y)\) (learning phase), then decrease \(I(X;T_\ell)\) (compression phase). This traces a path on the information plane toward the IB optimal curve.

Note: This theory is debated; some results depend on the activation function and estimator used for mutual information. Recent work uses neural mutual information estimators (MINE, CLUB) to study this empirically.

PAC-Bayes & Generalization

PAC-Bayes bounds connect information theory to generalization in machine learning. For a posterior \(Q\) over hypotheses, trained from prior \(P\), the generalization gap is bounded by:

\( \mathbb{E}_Q[\text{error}] \leq \mathbb{E}_Q[\hat{\text{error}}] + \sqrt{\frac{D_{KL}(Q \| P) + \ln(2n/\delta)}{2n}} \)

The \(D_{KL}(Q\|P)\) term measures how much the posterior has shifted from the prior — equivalently, how much information was extracted from the training data. Models that memorize the training set have large KL, hence poor generalization bounds.

Python: Information Bottleneck Curve

Run the IB algorithm for multiple beta values on a synthetic joint distribution, plot the information plane I(X;T) vs I(T;Y), and visualize the compression-relevance tradeoff.

Python
script.py198 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server