Bayesian Inference: Priors, Posteriors, and Updating Beliefs with Data

How Bayes' theorem works, what prior and posterior distributions mean, the base rate neglect problem in medical testing, the Bayesian vs. frequentist debate, and MCMC computation.

The InfoNexus Editorial TeamMay 23, 20269 min read

Reverend Bayes' Counterintuitive Gift

Thomas Bayes, an 18th-century English minister, never published the theorem bearing his name. His manuscript "An Essay Towards Solving a Problem in the Doctrine of Chances" was submitted to the Royal Society posthumously in 1763 by his friend Richard Price. The theorem it contained is now the foundation of an entire school of statistical reasoning — and the source of one of the most commonly made errors in medicine, law, and science: ignoring how rare a condition is when interpreting a positive test result.

The Theorem, Stated Plainly

Bayes' theorem relates the probability of a hypothesis given observed data to the probability of the data given the hypothesis:

P(H|D) = [P(D|H) × P(H)] / P(D)

Where:

  • P(H|D) is the posterior — the probability the hypothesis is true given the data
  • P(D|H) is the likelihood — the probability of observing the data if the hypothesis were true
  • P(H) is the prior — the probability the hypothesis was true before seeing the data
  • P(D) is the marginal likelihood — the total probability of the data under all hypotheses

The prior is the philosophically contested element. Where does it come from? Bayesians answer: from previous knowledge, expert judgment, or principled reasoning about the problem. Frequentists answer: it should not exist — a probability that reflects subjective belief rather than long-run frequency has no place in science.

Base Rate Neglect: The Medical Testing Example

The most accessible demonstration of Bayesian reasoning involves medical screening tests. Consider a disease affecting 1% of a population. A diagnostic test has 99% sensitivity (correctly identifies 99% of true positives) and 99% specificity (correctly classifies 99% of true negatives). Someone tests positive. What is the probability they actually have the disease?

Most people answer "99%." The correct answer, via Bayes' theorem, is approximately 50%.

Population of 10,000Has DiseaseDoes Not Have DiseaseTotal
Test Positive99 (true positive)99 (false positive)198
Test Negative1 (false negative)9,801 (true negative)9,802
Total1009,90010,000

Of 198 positive tests, only 99 are true positives — a 50% positive predictive value. The 1% disease prevalence (the prior) overwhelms even a 99%-accurate test when the condition is rare. This is why screening healthy populations with highly sensitive tests generates enormous numbers of false positives, and why additional confirmatory testing is standard clinical protocol.

The mathematical principle generalizes beyond medicine. Conviction rates for rare crimes, attribution of authorship of disputed texts, and the interpretation of scientific results all require accounting for prior probabilities to avoid systematically overestimating the likelihood of rare events.

Prior Distributions: Subjective but Not Arbitrary

In full Bayesian analysis, beliefs are represented not as single probability values but as probability distributions over all possible values of a parameter. These prior distributions encode uncertainty in a structured way. Several approaches exist:

  • Informative priors: Incorporate genuine domain knowledge. A prior for human height centered at 170 cm with narrow spread reflects real biological knowledge, not mere opinion.
  • Weakly informative priors: Constrain implausible values (negative heights, impossibly large effects) without strongly favoring specific values. Recommended by statistician Andrew Gelman as a principled default.
  • Non-informative/flat priors: Attempt to express complete prior ignorance. These are philosophically problematic — a flat prior on a parameter often becomes non-flat when the parameter is transformed.
  • Jeffreys prior: A non-informative prior defined by the Fisher information matrix, invariant under parameter transformations. Used widely as a technically principled default.

Bayesian vs. Frequentist: A Practical Comparison

DimensionBayesianFrequentist
What is probability?Degree of belief; can apply to single eventsLong-run frequency of repeatable events
ParametersRandom variables with distributionsFixed unknown constants
DataFixed (what was observed)One sample from infinite hypothetical samples
InferencePosterior distribution of parameterPoint estimates and confidence intervals
Hypothesis testingBayes factors; posterior oddsp-values; null hypothesis significance testing
Prior required?Yes — explicitly specifiedNo — considered a weakness of Bayesian approach

The debate is not merely academic. A 95% Bayesian credible interval directly means: "there is 95% probability the parameter lies in this range." A 95% frequentist confidence interval means: "if we repeated this experiment many times, 95% of constructed intervals would contain the true parameter." The second statement is frequently misinterpreted as the first — a significant source of error in scientific communication.

MCMC: Making Bayesian Computation Practical

For simple problems with conjugate priors (mathematically compatible prior-likelihood pairs), Bayesian posteriors have closed-form solutions. For real-world models with many parameters and complex structure, the posterior distribution must be approximated numerically.

Markov Chain Monte Carlo (MCMC) methods sample from the posterior distribution without computing it analytically. The idea: construct a Markov chain whose stationary distribution is the posterior. Run the chain long enough, and the samples approximate the posterior to arbitrary precision.

  • Metropolis-Hastings (1953/1970): The original general MCMC algorithm. Proposes candidate samples and accepts or rejects based on the ratio of posterior probabilities. Simple to implement; slow to converge in high dimensions.
  • Gibbs sampling: Samples each parameter conditional on all others. Requires conditional distributions in closed form; efficient when they are available.
  • Hamiltonian Monte Carlo (HMC): Uses gradient information to make efficient proposals in high dimensions. The basis of Stan, the leading probabilistic programming language used by researchers worldwide.
  • No-U-Turn Sampler (NUTS): An adaptive version of HMC that automatically tunes step sizes and trajectory lengths. Default in Stan and PyMC, enabling Bayesian analysis of complex models without hand-tuning.

Modern probabilistic programming — Stan, PyMC, TensorFlow Probability, Pyro — has made Bayesian analysis accessible to practitioners without deep statistical expertise. Hierarchical models, time series analysis, and causal inference frameworks that once required specialized mathematical derivation can now be specified in tens of lines of code and sampled via NUTS in minutes. This computational democratization is arguably the most practically significant development in applied statistics since the introduction of the general linear model.

mathematicsstatisticsprobability

Related Articles