Chapter 19: Data Compression

Part VII: Modern Applications

Lossless vs Lossy Compression

Shannon's source coding theorem establishes the fundamental limit: a source with entropy \(H\) bits per symbol cannot be compressed below \(H\)bits per symbol without information loss. Practical algorithms approach this limit.

Lossless

Perfect reconstruction. Limited to removing statistical redundancy. Examples: ZIP, PNG, FLAC.

Lossy

Exploits perceptual limits. Rate-distortion tradeoff. Examples: JPEG, MP3, H.265 video.

Lempel-Ziv: LZ77 & LZ78

Abraham Lempel and Jacob Ziv (1977, 1978) introduced universal compression algorithms that adapt to the source without prior statistics. The Lempel-Ziv theorem guarantees asymptotic optimality: for ergodic sources, LZ achieves the entropy rate.

LZ77: Sliding Window

Maintains a sliding window of past data. Encodes each new symbol as a (offset, length, next-char) triple referencing the best match in the window.

"abracadabra" → (0,0,a),(0,0,b),(0,0,r),(3,1,c),(5,1,d),(7,4,∅)

DEFLATE (used in ZIP and PNG) combines LZ77 with Huffman coding — first remove repetition with LZ77, then encode the resulting tokens with static Huffman codes. Used in every ZIP file and PNG image.

JPEG: DCT + Quantization

JPEG exploits the human visual system's greater sensitivity to low-frequency information. The pipeline:

  1. Block division: 8×8 pixel blocks
  2. DCT: Discrete Cosine Transform maps spatial to frequency domain
  3. Quantization: Divide DCT coefficients by a quality-dependent matrix (lossy step)
  4. Entropy coding: Zigzag scan + run-length encoding + Huffman

\( F(u,v) = \frac{2}{N}\sum_{x=0}^{N-1}\sum_{y=0}^{N-1} f(x,y) \cos\!\left[\frac{(2x+1)u\pi}{2N}\right]\cos\!\left[\frac{(2y+1)v\pi}{2N}\right] \)

MP3 & Video Codecs

MP3 uses a psychoacoustic model to identify audio components below the threshold of audibility — masking effects from nearby louder sounds — and allocates fewer bits (or none) to inaudible components. At 128 kbps, roughly 90% of the original audio data is discarded, yet most listeners perceive minimal quality loss.

Modern video codecs (H.265/HEVC, AV1, H.266/VVC) achieve 50–60× compression over raw video through:

  • Intra-frame: DCT/integer transforms within each frame (like JPEG)
  • Inter-frame: Motion compensation — encode only differences between frames
  • Entropy coding: CABAC (Context-Adaptive Binary Arithmetic Coding)

Python: LZ77 Implementation & Analysis

Implement LZ77 from scratch, compress different data types (English text, repetitive, random, binary), and compare compression ratios with Shannon entropy limits.

Python
script.py172 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server