Information Theory and Shannon Entropy: How Information Is Measured

Shannon entropy formula H = -Σp log p explained, with bits of information, channel capacity theorem, Huffman coding for data compression, and error-correcting codes.

The InfoNexus Editorial TeamMay 23, 20269 min read

The Paper That Created a New Science

Claude Shannon's 1948 paper "A Mathematical Theory of Communication," published in the Bell System Technical Journal, is one of the most consequential scientific works of the 20th century. It answered a question that engineers had struggled with since the invention of the telegraph: what is the fundamental limit of how much information can be transmitted reliably over a noisy channel? Shannon's answer was exact, general, and completely unexpected. He created a new science — information theory — in a single 77-page paper.

What Is a Bit?

Shannon defined information in terms of surprise. A message carries more information if it was less expected. A coin flip has maximum information: the outcome is completely uncertain before the flip. A coin that always lands heads carries zero information — you already know the outcome.

The bit (binary digit) is the unit Shannon chose. One bit is the amount of information that resolves a binary uncertainty — a yes/no question with equal probability of each outcome. A fair coin flip produces exactly 1 bit of information. The choice among four equally likely outcomes produces 2 bits (log2(4) = 2). Among 8 equally likely outcomes: 3 bits.

For a single event with probability p, the information content (surprisal) is −log2(p) bits. A 1-in-8 event carries log2(8) = 3 bits. A near-certain event with probability 0.99 carries −log2(0.99) ≈ 0.0145 bits — almost none.

Shannon Entropy: Measuring Uncertainty

Shannon entropy H measures the average information content of a random source — equivalently, the average uncertainty before observing an outcome. For a discrete random variable with outcomes having probabilities p1, p2, ..., pn:

H = −Σ pi log2(pi)

Properties:

  • Maximum entropy: H is maximized when all outcomes are equally probable. A fair 8-sided die has H = log2(8) = 3 bits — each roll is completely unpredictable.
  • Zero entropy: H = 0 when one outcome has probability 1. No uncertainty, no information.
  • Non-negativity: H ≥ 0 always.
  • Additivity: For independent events, entropy adds: rolling two fair dice has entropy H = H(die1) + H(die2).
SourceDistributionShannon Entropy H
Fair coinp(H)=0.5, p(T)=0.51.0 bit
Biased coin (p=0.9)p(H)=0.9, p(T)=0.10.469 bits
Fair 8-sided dieEach p=0.1253.0 bits
English text (letter frequency)Non-uniform 26-letter distribution~4.1 bits/letter
English text (word context)Accounting for language structure~1.0–1.3 bits/letter

The difference between the two English text rows illustrates redundancy. English uses only about 1–1.3 bits of information per letter despite the 26-letter alphabet offering up to 4.7 bits/letter in theory. That redundancy — the predictability built into the language — is what makes crossword puzzles solvable and spelling errors detectable.

The Channel Capacity Theorem

Shannon's most surprising result was the noisy-channel coding theorem: for any communication channel with a given noise level, there exists a rate (in bits per second) below which perfectly reliable communication is possible, and above which errors are unavoidable. This rate is the channel capacity C.

For an additive white Gaussian noise (AWGN) channel — the standard model for radio and wire communications — capacity is given by the Shannon-Hartley theorem:

C = B · log2(1 + S/N)

Where B is bandwidth in hertz, S is signal power, and N is noise power. Doubling bandwidth doubles capacity. Quadrupling the signal-to-noise ratio adds only two bits per channel use — demonstrating that throwing more power at a noisy channel faces rapidly diminishing returns.

The theorem is constructive: it guarantees that error-correcting codes can achieve near-zero error rates at any rate below capacity, even on noisy channels. It says nothing about how to construct such codes — finding practical codes that approach the Shannon limit became a 50-year engineering challenge.

Data Compression: Huffman Coding

If English text carries only ~1.3 bits of actual information per letter but is transmitted as 8 bits per ASCII character, it is being transmitted at roughly 6× the necessary rate. Data compression exploits this redundancy.

David Huffman, then an MIT graduate student, invented optimal prefix-free coding in 1952 as a term paper. Huffman coding assigns shorter codewords to more frequent symbols and longer codewords to rarer symbols, minimizing the average codeword length.

  • In English, 'e' occurs ~12.7% of the time — Huffman assigns it a short code
  • 'z' occurs ~0.07% of the time — Huffman assigns it a long code
  • The average code length approaches Shannon entropy H if the symbol probabilities are powers of 1/2; otherwise it is at most H + 1
  • Huffman coding underlies JPEG (for the quantized DCT coefficients), MP3 (audio encoding), and ZIP file compression

Arithmetic coding, developed later, achieves even closer approach to the Shannon entropy bound by treating the entire message as a single fractional number rather than encoding symbol-by-symbol.

Error-Correcting Codes: Reliable Communication on Noisy Channels

Shannon's theorem promised reliable communication below capacity but didn't specify how. The engineering solution is error-correcting codes — deliberately adding structured redundancy to data so that errors introduced by the channel can be detected and corrected.

CodeYearMethodApplication
Hamming code1950Parity bits at power-of-2 positionsEarly computer memory
Reed-Solomon code1960Polynomial evaluation over finite fieldsCDs, DVDs, QR codes, deep-space comm
Convolutional codes1955Sequential sliding-window encodingMobile communications, satellite
Turbo codes1993Two recursive convolutional codes + iterative decoding3G/4G cellular, deep space
LDPC codes1960/1996Sparse parity-check matrix + belief propagation5G, Wi-Fi 6, DVB-S2
Polar codes2009Channel polarization; proven to achieve capacity5G NR control channels

Turbo codes and LDPC codes approached the Shannon limit to within 0.5 dB in the 1990s, solving the engineering challenge 45 years after Shannon defined the theoretical target. Polar codes, introduced by Erdal Arıkan in 2009, were the first codes proven to achieve the Shannon capacity with efficient encoding and decoding — a theoretical milestone. They are now deployed in 5G new radio (NR) control channel coding.

Shannon entropy and channel capacity are not confined to digital communications. The same mathematics describes the efficiency of chemical signaling in cells, the capacity of neural circuits in the brain, and the fundamental limits of storage density in hard drives — wherever information is measured, transmitted, or stored, Shannon's 1948 paper provides the governing equations.

mathematicsinformation theorycomputer science

Related Articles