Bayesian Probability: How to Update Beliefs With New Evidence
Learn how Bayesian probability provides a mathematical framework for updating beliefs based on evidence, with applications in medicine, machine learning, and law.
A Reverend's Posthumous Revolution
Thomas Bayes never published his most famous work. The Presbyterian minister died in 1761, and his friend Richard Price found the manuscript among his papers. Price edited and submitted it to the Royal Society in 1763. That paper introduced a method for calculating the probability of causes given observed effects — inverting the standard direction of probabilistic reasoning. Two centuries later, Bayesian methods power spam filters, medical diagnostics, self-driving cars, and search engines. The approach Bayes sketched in the 1700s became the mathematical backbone of modern artificial intelligence.
The Formula and Its Components
Bayes' theorem expresses the probability of a hypothesis given observed evidence. The formula is: P(H|E) = P(E|H) × P(H) / P(E). Each component carries specific meaning.
| Term | Name | Meaning |
|---|---|---|
| P(H|E) | Posterior probability | Probability of the hypothesis after observing evidence |
| P(E|H) | Likelihood | Probability of the evidence if the hypothesis is true |
| P(H) | Prior probability | Probability of the hypothesis before observing evidence |
| P(E) | Marginal likelihood | Total probability of the evidence under all hypotheses |
The theorem is mathematically uncontroversial. It follows directly from the definition of conditional probability. The controversy lies in interpretation — specifically, in what counts as a valid prior probability.
Frequentist vs. Bayesian: Two Philosophies
Statistics has two dominant schools. They disagree about fundamental questions.
Frequentists define probability as the long-run frequency of events. A coin has a 50 percent probability of heads because, over thousands of flips, roughly half will be heads. Frequentists reject assigning probabilities to hypotheses — a parameter either has a value or it does not. There is no meaningful sense in which a physical constant has a 95 percent probability of falling in some range.
Bayesians define probability as a degree of belief. Probability quantifies uncertainty about any proposition, including one-time events and unknown parameters. It is perfectly meaningful to say there is a 70 percent probability that it will rain tomorrow or a 90 percent probability that a defendant is guilty. Evidence updates these beliefs systematically through Bayes' theorem.
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability definition | Long-run frequency | Degree of belief |
| Parameters | Fixed but unknown | Random variables with distributions |
| Prior information | Not formally incorporated | Encoded as prior distribution |
| Confidence intervals | 95% of intervals contain the true value | 95% probability the value is in the interval |
| Sample size requirements | Often large | Works with small samples |
The Medical Testing Problem
Bayesian reasoning reveals counterintuitive truths about diagnostic testing. Suppose a disease affects 1 in 1,000 people. A test for this disease has 99 percent sensitivity (correctly identifies 99 percent of sick people) and 95 percent specificity (correctly identifies 95 percent of healthy people). A patient tests positive. Intuition suggests they almost certainly have the disease. Bayes' theorem says otherwise.
- Prior probability of disease — 0.001 (1 in 1,000)
- Probability of positive test given disease — 0.99
- Probability of positive test given no disease — 0.05 (false positive rate)
- Total probability of positive test — (0.001 × 0.99) + (0.999 × 0.05) = 0.05094
- Posterior probability of disease given positive test — 0.00099 / 0.05094 ≈ 0.019 or 1.9%
A positive result means the patient has less than a 2 percent chance of actually having the disease. The low base rate overwhelms the test accuracy. False positives vastly outnumber true positives. This result has profound implications for mass screening programs, criminal forensics, and any domain where rare events are tested for in large populations.
Real-World Applications
Machine Learning and AI
Naive Bayes classifiers use Bayes' theorem to categorize text, detect spam, and analyze sentiment. Despite their simplicity, they perform surprisingly well. Bayesian neural networks assign probability distributions to network weights rather than fixed values, providing uncertainty estimates alongside predictions. This matters in safety-critical applications like autonomous driving and medical diagnosis.
Legal Reasoning
Courts implicitly use Bayesian reasoning when evaluating evidence. DNA evidence is expressed as likelihood ratios — the probability of the DNA match if the defendant is the source versus if a random person is. Forensic statisticians advocate for explicit Bayesian frameworks to prevent common errors like the prosecutor's fallacy, where the rarity of a DNA profile is confused with the probability of innocence.
Search and Rescue
The U.S. Coast Guard uses Bayesian search theory to locate missing vessels. Prior probabilities are assigned to grid squares based on last known position, drift patterns, and weather data. Each unsuccessful search updates the probability map, concentrating subsequent efforts on the most likely remaining areas. This method located the wreckage of Air France Flight 447 in 2011 after two years of searching.
Choosing Priors: The Contentious Step
The prior probability is both Bayesian analysis's greatest strength and its most criticized element. Critics argue that priors inject subjectivity into scientific analysis. Supporters counter that all statistical methods embed assumptions — Bayesian methods simply make them explicit.
- Informative priors — Based on previous research or expert knowledge. A prior for average human body temperature centers on 37°C because centuries of measurement support that value.
- Weakly informative priors — Broad distributions that constrain parameters to physically plausible ranges without strongly favoring specific values.
- Non-informative (flat) priors — Assign equal probability to all parameter values, letting the data dominate. These can be mathematically problematic in some contexts.
- Conjugate priors — Mathematical convenience choices that produce posterior distributions in the same family as the prior, simplifying computation.
From Controversy to Consensus
For most of the twentieth century, Bayesian methods were marginalized in academic statistics. Computational limitations made Bayesian calculations intractable for complex problems. The development of Markov Chain Monte Carlo (MCMC) algorithms in the 1990s changed everything. MCMC methods allow computers to approximate posterior distributions for models with thousands of parameters. Software packages like Stan, PyMC, and BUGS made Bayesian analysis accessible to researchers across disciplines.
Today the frequentist-Bayesian divide has softened considerably. Many statisticians use both approaches depending on the problem. Bayesian methods dominate in machine learning, signal processing, and any field where incorporating prior knowledge improves predictions. The framework Thomas Bayes outlined in the eighteenth century has become the standard language for reasoning under uncertainty in the twenty-first.
Related Articles
mathematics
Bayesian Inference: Priors, Posteriors, and Updating Beliefs with Data
How Bayes' theorem works, what prior and posterior distributions mean, the base rate neglect problem in medical testing, the Bayesian vs. frequentist debate, and MCMC computation.
9 min read
mathematics
Fractal Geometry: Mandelbrot, Coastlines, and Infinite Complexity
How fractals work, from the Mandelbrot set to the coastline paradox and Hausdorff dimension. Explore self-similarity, Richardson's measurement problem, and fractals in nature and finance.
9 min read
mathematics
Information Theory and Shannon Entropy: How Information Is Measured
Shannon entropy formula H = -Σp log p explained, with bits of information, channel capacity theorem, Huffman coding for data compression, and error-correcting codes.
9 min read
mathematics
Prime Number Distribution: From the Theorem to the Riemann Hypothesis
How primes thin out according to the Prime Number Theorem (π(x) ≈ x/ln x), prime gaps, twin prime conjecture, and the Riemann Hypothesis connection, plus the largest known prime in 2024.
9 min read