Channel Capacity
Every noisy channel has a maximum rate β its capacity β below which arbitrarily reliable communication is possible. Above it, errors are unavoidable. This is Shannon's most profound result.
1. Mutual Information
The mutual information \(I(X;Y)\) between a channel input \(X\) and output \(Y\) measures how much knowing\(Y\) reduces uncertainty about \(X\), or equivalently how much information is successfully transmitted:
\[ I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) \]
Alternatively, mutual information is a Kullback-Leibler divergence:
\[ I(X;Y) = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x)\,p(y)} = D_{\mathrm{KL}}(p(x,y) \,\|\, p(x)p(y)) \]
Since KL divergence is always non-negative, \(I(X;Y) \geq 0\) with equality iff\(X\) and \(Y\) are independent (the channel is completely useless). Mutual information is symmetric: \(I(X;Y) = I(Y;X)\).
Entropy Venn Diagram
Mutual information \(I(X;Y)\) is the βoverlapβ region: information shared between \(X\) and \(Y\).
2. Channel Capacity
A discrete memoryless channel (DMC) is specified by a conditional distribution \(p(y|x)\) over output alphabet \(\mathcal{Y}\)given input \(\mathcal{X}\). The channel is βmemorylessβ in that each use is independent of all others.
The channel capacity is defined as:
\[ C = \max_{p(x)} I(X;Y) \]
The maximum is over all input distributions \(p(x)\).
Capacity has units of bits per channel use. It is a property of the channel alone, independent of what we transmit. The maximizing input distribution is the one that best exploits the channel structure.
For symmetric channels (BSC, BEC), the maximum is achieved by the uniform input distribution, which makes the derivation particularly clean.
3. Binary Symmetric Channel (BSC)
In the BSC, each transmitted bit is flipped independently with probability \(p_e\). By symmetry, the capacity-achieving input is the uniform distribution \(p(X=0) = p(X=1) = 1/2\), giving:
\[ C_{\text{BSC}}(p_e) = 1 - H_b(p_e) = 1 + p_e \log_2 p_e + (1-p_e)\log_2(1-p_e) \]
Intuition: \(H(Y) = 1\) bit for the uniform output, but the channel adds \(H_b(p_e)\) bits of noise uncertainty (knowing the input, output is still random). Their difference is the information that got through.
p_e = 0
C = 1 bit
Perfect
p_e = 0.1
C β 0.531
Lossy
p_e = 0.25
C β 0.189
Very noisy
p_e = 0.5
C = 0
Useless
4. Binary Erasure Channel (BEC)
In the BEC, each transmitted bit is either received correctly or declared βerasedβ (with probability \(\varepsilon\)) β the receiver knows when a bit was lost but not what its value was. Unlike the BSC, errors are always detectable. The capacity is:
\[ C_{\text{BEC}}(\varepsilon) = 1 - \varepsilon \]
Derivation: \(H(Y) = H_b(\varepsilon/2 + (1-\varepsilon) \cdot p)\) (the ternary output distribution) and \(H(Y|X) = H_b(\varepsilon)\). After maximizing over\(p\), \(I(X;Y) = (1-\varepsilon)(H(X) )\), which is maximized at\(H(X) = 1\) giving \(C = 1-\varepsilon\).
The BEC is the canonical channel for LDPC and turbo codes: the erasure pattern is known, so decoding reduces to solving a linear system over \(\mathbb{F}_2\). Polar codes were the first family proved to achieve the BEC capacity efficiently.
5. Shannon's Noisy Channel Coding Theorem
Theorem (Shannon, 1948 β Noisy Channel Coding Theorem)
For a discrete memoryless channel with capacity \(C\):
- Achievability: For any rate \(R < C\)and any \(\epsilon > 0\), there exists a sequence of codes with rate \(\geq R\) and block error probability \(\leq \epsilon\).
- Converse: For any rate \(R > C\), every code sequence has block error probability bounded away from zero.
The proof of achievability uses random coding: pick a codebook by drawing \(2^{nR}\) codewords uniformly from the typical input sequences. With high probability over the choice of codebook, the maximum-likelihood decoder achieves vanishing error as \(n \to \infty\).
The proof of the converse uses Fano's inequality: if the error probability is small, then \(H(M|Y^n) \leq n\epsilon\) (the message \(M\) is nearly determined by the received sequence), and therefore \(nR \leq I(X^n; Y^n) + n\epsilon \leq nC + n\epsilon\). Letting \(\epsilon \to 0\) gives \(R \leq C\).
What makes this theorem astonishing is its existence nature: it guarantees that good codes exist at every rate below \(C\), but does not say how to construct them efficiently. Finding practical capacity-achieving codes (Turbo, LDPC, Polar) took another 45 years.
6. The Converse: Above C, Errors are Unavoidable
The converse to the noisy channel coding theorem is as important as achievability. It says there is a hard wall at capacity: no matter how clever the encoder and decoder, if the transmission rate \(R > C\), there is no escaping a positive error probability.
The key tool is Fano's inequality:
\[ H(X|Y) \leq H_b(P_e) + P_e \log_2(|\mathcal{X}| - 1) \]
This bounds the residual uncertainty about \(X\) given \(Y\) by a function of the error probability \(P_e\). If \(P_e \to 0\), then\(H(X|Y) \to 0\) β the receiver recovers the message. The converse then follows by applying data-processing and the definition of capacity.
Together, achievability and the converse give a complete operational characterization of capacity: \(C\) is the supremum of all achievable rates. This is one of the deepest single results in all of engineering science.
Python Simulation: BSC and BEC Capacity
Four panels: (1) BSC capacity vs crossover probability, (2) BEC capacity vs erasure probability, (3) error rate of repetition coding vs code rate on BSC, (4) mutual information I(X;Y) as a function of input distribution for various BSC parameters.
Click Run to execute the Python code
Code will be executed with Python 3 on the server