Part I Β· Chapter 2

Source Coding Theorem

Entropy is not just an abstract measure of uncertainty β€” it is the precise minimum rate at which a source can be compressed. Shannon's first theorem makes this exact.

1. Lossless Compression: The Fundamental Question

Lossless compression asks: given a source producing symbols \(X_1, X_2, \ldots\)drawn i.i.d. from a distribution \(p(x)\), what is the shortest binary description we can produce such that the original can be recovered exactly?

The naΓ―ve approach is a fixed-length code: assign every symbol the same number of bits. An alphabet of size \(n\) needs\(\lceil \log_2 n \rceil\) bits per symbol. This ignores the probability structure entirely and is wasteful whenever the distribution is non-uniform.

A variable-length code assigns shorter codewords to more probable symbols. Morse code is an ancient example: β€œE” (the most common letter) is a single dot, while β€œZ” is dash-dash-dot-dot. Shannon asked: how short can we make the average codeword?

2. The Kraft Inequality

For a code to be uniquely decodable without needing delimiters, it must be prefix-free: no codeword is a prefix of another. Prefix-free codes can be decoded instantaneously symbol-by-symbol. Shannon (and independently Kraft) proved:

\[ \sum_{i=1}^{n} 2^{-\ell_i} \leq 1 \]

where \(\ell_i\) is the length of the codeword for symbol \(i\).

The Kraft inequality is both necessary and sufficient: a set of codeword lengths\(\{\ell_i\}\) corresponds to a valid prefix-free code if and only if the Kraft sum is at most 1.

Geometrically, each codeword of length \(\ell\) β€œuses up” a fraction \(2^{-\ell}\) of the unit interval on the binary tree, and the prefix-free constraint prevents these intervals from overlapping.

3. Shannon's Source Coding Theorem

Theorem (Shannon, 1948)

Let \(X\) be a discrete random variable with entropy \(H(X)\). For any uniquely decodable code with average length \(\bar{\ell} = \sum_i p_i \ell_i\):

\[ H(X) \leq \bar{\ell} \]

Furthermore, there exists a prefix-free code achieving:

\[ H(X) \leq \bar{\ell} < H(X) + 1 \]

The lower bound follows from the Kraft inequality via Lagrange optimization: minimizing \(\bar{\ell} = \sum p_i \ell_i\) subject to \(\sum 2^{-\ell_i} \leq 1\)yields the ideal lengths \(\ell_i^* = -\log_2 p_i\), giving\(\bar{\ell}^* = H(X)\).

The problem is that \(-\log_2 p_i\) is generally not an integer. Rounding up to \(\ell_i = \lceil -\log_2 p_i \rceil\) gives integer codeword lengths that satisfy Kraft (since \(\sum 2^{-\lceil -\log p_i \rceil} \leq \sum p_i = 1\)) and achieve average length strictly less than \(H(X) + 1\).

For block coding over \(n\) symbols at a time, the bound tightens to \(H(X) \leq \bar{\ell}/n < H(X) + 1/n\), converging to entropy as\(n \to \infty\).

4. Typical Sequences and the AEP

Shannon's deeper proof of the source coding theorem uses the Asymptotic Equipartition Property (AEP), the information-theoretic law of large numbers.

AEP

\[ -\frac{1}{n} \log_2 p(X_1, \ldots, X_n) \xrightarrow{P} H(X) \]

For large \(n\), most sequences have probability close to \(2^{-nH(X)}\). The set of such typical sequences has probability approaching 1 and contains roughly \(2^{nH(X)}\) elements.

This gives an elegant compression scheme:

  1. Enumerate the \(\approx 2^{nH(X)}\) typical sequences.
  2. Assign each a binary index of length \(\lceil nH(X) \rceil + 1\) bits.
  3. Encode typical sequences with this index; atypical sequences with a flag + raw bits.
  4. As \(n \to \infty\), atypical sequences occur with vanishing probability, so the average rate approaches \(H(X)\).

No compression scheme can do better: any code with rate below \(H(X)\) must cause reconstruction errors on a positive fraction of sequences.

5. Preview: Rate-Distortion Theory

The source coding theorem applies to lossless compression. What if we are willing to accept some distortion (as in JPEG images or MP3 audio)?

The rate-distortion function \(R(D)\)gives the minimum rate needed to describe the source with average distortion at most \(D\):

\[ R(D) = \min_{p(\hat{x}|x) : \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X}) \]

At \(D = 0\) (lossless), \(R(0) = H(X)\), recovering the source coding theorem. For Gaussian sources with squared-error distortion, \(R(D) = \max(0,\, \tfrac{1}{2}\log(\sigma^2/D))\), the famous β€œwater-filling” result for multi-dimensional sources. We will study rate-distortion theory in depth in Part III.

Fixed-Length vs Variable-Length Coding

Fixed-Length CodeAp=0.50β†’00Bp=0.25β†’01Cp=0.125β†’10Dp=0.125β†’11Avg length = 2.00 bitsEntropy = 1.75 bitsVariable-Length (Huffman)Ap=0.50β†’0Bp=0.25β†’10Cp=0.125β†’110Dp=0.125β†’111Avg length = 1.75 bitsMatches entropy exactly!

For this dyadic distribution, Huffman coding achieves the entropy bound exactly because all \(-\log_2 p_i\) are integers.

Python Simulation: Source Coding in Action

Four panels: (1) average code length converging to entropy with block size, (2) Kraft inequality for various code designs, (3) Huffman vs ideal lengths, (4) compression efficiency across code types.

Python
script.py205 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server