Information Theory and Shannon Entropy: How Information Is Measured
Shannon entropy formula H = -Σp log p explained, with bits of information, channel capacity theorem, Huffman coding for data compression, and error-correcting codes.
The Paper That Created a New Science
Claude Shannon's 1948 paper "A Mathematical Theory of Communication," published in the Bell System Technical Journal, is one of the most consequential scientific works of the 20th century. It answered a question that engineers had struggled with since the invention of the telegraph: what is the fundamental limit of how much information can be transmitted reliably over a noisy channel? Shannon's answer was exact, general, and completely unexpected. He created a new science — information theory — in a single 77-page paper.
What Is a Bit?
Shannon defined information in terms of surprise. A message carries more information if it was less expected. A coin flip has maximum information: the outcome is completely uncertain before the flip. A coin that always lands heads carries zero information — you already know the outcome.
The bit (binary digit) is the unit Shannon chose. One bit is the amount of information that resolves a binary uncertainty — a yes/no question with equal probability of each outcome. A fair coin flip produces exactly 1 bit of information. The choice among four equally likely outcomes produces 2 bits (log2(4) = 2). Among 8 equally likely outcomes: 3 bits.
For a single event with probability p, the information content (surprisal) is −log2(p) bits. A 1-in-8 event carries log2(8) = 3 bits. A near-certain event with probability 0.99 carries −log2(0.99) ≈ 0.0145 bits — almost none.
Shannon Entropy: Measuring Uncertainty
Shannon entropy H measures the average information content of a random source — equivalently, the average uncertainty before observing an outcome. For a discrete random variable with outcomes having probabilities p1, p2, ..., pn:
H = −Σ pi log2(pi)
Properties:
- Maximum entropy: H is maximized when all outcomes are equally probable. A fair 8-sided die has H = log2(8) = 3 bits — each roll is completely unpredictable.
- Zero entropy: H = 0 when one outcome has probability 1. No uncertainty, no information.
- Non-negativity: H ≥ 0 always.
- Additivity: For independent events, entropy adds: rolling two fair dice has entropy H = H(die1) + H(die2).
| Source | Distribution | Shannon Entropy H |
|---|---|---|
| Fair coin | p(H)=0.5, p(T)=0.5 | 1.0 bit |
| Biased coin (p=0.9) | p(H)=0.9, p(T)=0.1 | 0.469 bits |
| Fair 8-sided die | Each p=0.125 | 3.0 bits |
| English text (letter frequency) | Non-uniform 26-letter distribution | ~4.1 bits/letter |
| English text (word context) | Accounting for language structure | ~1.0–1.3 bits/letter |
The difference between the two English text rows illustrates redundancy. English uses only about 1–1.3 bits of information per letter despite the 26-letter alphabet offering up to 4.7 bits/letter in theory. That redundancy — the predictability built into the language — is what makes crossword puzzles solvable and spelling errors detectable.
The Channel Capacity Theorem
Shannon's most surprising result was the noisy-channel coding theorem: for any communication channel with a given noise level, there exists a rate (in bits per second) below which perfectly reliable communication is possible, and above which errors are unavoidable. This rate is the channel capacity C.
For an additive white Gaussian noise (AWGN) channel — the standard model for radio and wire communications — capacity is given by the Shannon-Hartley theorem:
C = B · log2(1 + S/N)
Where B is bandwidth in hertz, S is signal power, and N is noise power. Doubling bandwidth doubles capacity. Quadrupling the signal-to-noise ratio adds only two bits per channel use — demonstrating that throwing more power at a noisy channel faces rapidly diminishing returns.
The theorem is constructive: it guarantees that error-correcting codes can achieve near-zero error rates at any rate below capacity, even on noisy channels. It says nothing about how to construct such codes — finding practical codes that approach the Shannon limit became a 50-year engineering challenge.
Data Compression: Huffman Coding
If English text carries only ~1.3 bits of actual information per letter but is transmitted as 8 bits per ASCII character, it is being transmitted at roughly 6× the necessary rate. Data compression exploits this redundancy.
David Huffman, then an MIT graduate student, invented optimal prefix-free coding in 1952 as a term paper. Huffman coding assigns shorter codewords to more frequent symbols and longer codewords to rarer symbols, minimizing the average codeword length.
- In English, 'e' occurs ~12.7% of the time — Huffman assigns it a short code
- 'z' occurs ~0.07% of the time — Huffman assigns it a long code
- The average code length approaches Shannon entropy H if the symbol probabilities are powers of 1/2; otherwise it is at most H + 1
- Huffman coding underlies JPEG (for the quantized DCT coefficients), MP3 (audio encoding), and ZIP file compression
Arithmetic coding, developed later, achieves even closer approach to the Shannon entropy bound by treating the entire message as a single fractional number rather than encoding symbol-by-symbol.
Error-Correcting Codes: Reliable Communication on Noisy Channels
Shannon's theorem promised reliable communication below capacity but didn't specify how. The engineering solution is error-correcting codes — deliberately adding structured redundancy to data so that errors introduced by the channel can be detected and corrected.
| Code | Year | Method | Application |
|---|---|---|---|
| Hamming code | 1950 | Parity bits at power-of-2 positions | Early computer memory |
| Reed-Solomon code | 1960 | Polynomial evaluation over finite fields | CDs, DVDs, QR codes, deep-space comm |
| Convolutional codes | 1955 | Sequential sliding-window encoding | Mobile communications, satellite |
| Turbo codes | 1993 | Two recursive convolutional codes + iterative decoding | 3G/4G cellular, deep space |
| LDPC codes | 1960/1996 | Sparse parity-check matrix + belief propagation | 5G, Wi-Fi 6, DVB-S2 |
| Polar codes | 2009 | Channel polarization; proven to achieve capacity | 5G NR control channels |
Turbo codes and LDPC codes approached the Shannon limit to within 0.5 dB in the 1990s, solving the engineering challenge 45 years after Shannon defined the theoretical target. Polar codes, introduced by Erdal Arıkan in 2009, were the first codes proven to achieve the Shannon capacity with efficient encoding and decoding — a theoretical milestone. They are now deployed in 5G new radio (NR) control channel coding.
Shannon entropy and channel capacity are not confined to digital communications. The same mathematics describes the efficiency of chemical signaling in cells, the capacity of neural circuits in the brain, and the fundamental limits of storage density in hard drives — wherever information is measured, transmitted, or stored, Shannon's 1948 paper provides the governing equations.
Related Articles
mathematics
Bayesian Inference: Priors, Posteriors, and Updating Beliefs with Data
How Bayes' theorem works, what prior and posterior distributions mean, the base rate neglect problem in medical testing, the Bayesian vs. frequentist debate, and MCMC computation.
9 min read
mathematics
Fermat's Last Theorem: 358 Years From Margin Note to Proof
How Fermat's 1637 claim sat unproven for 358 years until Andrew Wiles secretly spent 7 years connecting elliptic curves, modular forms, and the Shimura-Taniyama-Weil conjecture.
9 min read
mathematics
Gödel's Incompleteness Theorems: The Limits of Mathematical Truth
Gödel's two incompleteness theorems explained, the self-referential proof method, what they mean for formal systems, and their influence on mathematics and computing.
9 min read
mathematics
Prime Number Distribution: From the Theorem to the Riemann Hypothesis
How primes thin out according to the Prime Number Theorem (π(x) ≈ x/ln x), prime gaps, twin prime conjecture, and the Riemann Hypothesis connection, plus the largest known prime in 2024.
9 min read