Tags

audio compression

sample rate * bit depth * number of channels * time

44,100 samples/second * 16 bits * 2 * 200

1. How much space does a song take?

2 bytes of dynamic range * 2 channels * 44,100 samples/second * 210 seconds per song

(* 2 2 44100 210)
37044000

2. How much space did the earliest MP3 players have?

128 MB.

3. How can we do better?

3.1. Dynamic range

16 bits of dynamic range is probably overkill.

  • The lowest-order bits are typically garbage.
  • You probably can't hear such minute differences in volume.
  • Your playback equipment probably isn't that precise.
  • Hopefully we have some headroom to avoid clipping.
  • This gives us some number of “free” bits.

3.1.1. μ-law

  • Human hearing is basically logarithmic:
  • you perceive a multiplicative increase in amplitude as an additive increase in volume.
  • Upshot: small differences in amplitude matter less at higher amplitudes.
  • So we could just throw away a lot of the resolution at higher amplitudes.

    https://en.wikipedia.org/wiki/Mu-law_algorithm

3.2. Stereo

  • In most songs, the two channels are going to be highly correlated.
  • We can instead encode them as a single channel plus a difference side-channel.
  • The difference channel is going to be mostly low-amplitude, so it can be highly compressed.
  • Optimistically, this gets us 2x compression.

3.2.1. Downsampling

Maybe some applications don't need 44,100 samples/second.

Plain ol' telephone service (POTS):

  • downsample to 8000 samples/second
  • µ-law encode to 8 bits

3.3. Information theory

Information theory tells us that

COMPRESSIBLE = NOT RANDOM

3.3.1. Is real-world audio random?

  • Random samples constitute “noise”.
  • Theoretically, this means that we ought to be able to produce a model that predicts sample values with less data.
  • (Probably not perfectly.)
  • (And probably in small chunks.)
  • If the model is good enough, we can just forget the original signal.
  • Otherwise, we can store the model + the residue (difference between model and signal).

3.3.2. Time domain

Maybe we just use splines?

This is roughly how lossless compression schemes like FLAC work.

3.3.3. Frequency domain

A loud sound will mask quieter sounds at nearby frequencies: you won't hear them (well).

MP3 (and related schemes like Ogg Vorbis) rely heavily on this.

  • Pass a frame of samples through a filter bank.
  • Determine which bands are masking nearby bands.
  • Quantize masked bands.
  • Also, store everything using Huffman coding.

https://arstechnica.com/features/2007/10/the-audiofile-understanding-mp3-compression/

Author: Nicholas Coltharp (mail@heraplem.xyz)

Last modified: 2026-06-07 Sun 16:39

Emacs 30.2 (Org mode 9.7.11)

Validate