Bayesian Statistics: Updating Beliefs with Evidence
Bayesian statistics treats probability as degrees of belief, updating prior beliefs with evidence via Bayes' theorem to produce posterior distributions used in science, AI, and medicine.
Probability as a State of Knowledge
A drug company tests a new compound. The trial shows a positive result. How confident should regulators be that the drug actually works? Classical frequentist statistics gives one answer. Bayesian statistics gives a different, arguably more intuitive one — by explicitly incorporating prior knowledge about how often drug candidates actually work, and updating that knowledge with the trial evidence.
Bayesian statistics treats probability not as a long-run frequency of events but as a quantified degree of belief — a measure of how confident a reasoning agent is that something is true, given available information. This philosophical difference has profound practical consequences. It allows statisticians to make direct probability statements about hypotheses ("there is a 94% probability the drug is effective"), include prior information systematically, and update beliefs incrementally as new data arrives — in a framework grounded in mathematics rather than informal judgment.
The Bayesian Framework
The entire framework rests on one equation — Bayes' theorem — applied to the relationship between data and hypotheses:
P(θ | data) = P(data | θ) × P(θ) / P(data)
Where:
- P(θ): Prior probability — beliefs about parameter θ before observing data; encodes existing knowledge or uncertainty
- P(data | θ): Likelihood — the probability of observing the data given parameter θ; comes from the statistical model
- P(data): Marginal likelihood (evidence) — a normalizing constant ensuring the posterior integrates to 1
- P(θ | data): Posterior probability — updated beliefs about θ after incorporating the observed data; the goal of Bayesian analysis
The posterior combines prior and likelihood multiplicatively. A strong likelihood can overwhelm a diffuse prior. A strong prior can moderate an extreme likelihood. This balance corresponds naturally to how scientific knowledge accumulates: as data accumulate, posteriors become increasingly concentrated around the true parameter value, regardless of the initial prior (under standard regularity conditions).
Choosing Priors
The prior distribution is simultaneously the most powerful and most controversial aspect of Bayesian analysis. Critics argue it introduces subjectivity; proponents argue it provides a principled mechanism for incorporating genuine prior knowledge and that all statistical analyses make assumptions, whether explicit or not.
| Prior Type | Description | When to Use |
|---|---|---|
| Informative prior | Reflects genuine prior knowledge (e.g., from previous studies) | When reliable prior information exists |
| Weakly informative prior | Provides some regularization without strongly constraining posterior | Default choice in many practical analyses |
| Non-informative (flat) prior | Assigns roughly equal probability to all parameter values | When prior knowledge is genuinely absent |
| Conjugate prior | From a family where prior × likelihood yields posterior in same family | Analytical convenience; enables closed-form solutions |
| Jeffreys prior | Invariant to reparameterization; based on Fisher information | Objective Bayesian analysis |
Conjugate priors simplify computation dramatically. For a binomial likelihood (counts of successes), the Beta distribution is the conjugate prior — yielding a Beta posterior. For a normal likelihood with known variance, the normal prior is conjugate. These analytically tractable cases allow exact Bayesian inference without numerical methods, though they apply only to a limited set of models.
Bayesian vs. Frequentist Inference
The two schools of statistical thought produce different answers to similar questions, and the differences matter in practice:
| Concept | Frequentist | Bayesian |
|---|---|---|
| Probability | Long-run frequency of repeatable events | Degree of belief; applies to unique events |
| Parameter status | Fixed unknown constant | Random variable with a distribution |
| Uncertainty interval | Confidence interval: 95% of such intervals contain the true parameter | Credible interval: 95% posterior probability that parameter lies in interval |
| Hypothesis testing | p-value: prob. of data this extreme under H₀ | Bayes factor: ratio of evidence for H₁ vs H₀ |
| Prior information | Not formally incorporated | Explicitly incorporated via prior distribution |
The frequentist confidence interval is routinely misinterpreted as a Bayesian credible interval. A 95% confidence interval does not mean there is a 95% probability the true parameter lies in that specific interval — it means the procedure, if repeated many times, would produce intervals containing the true parameter in 95% of cases. The Bayesian credible interval is the direct probability statement most users want: given the data and prior, there is a 95% posterior probability the parameter lies in this range.
Markov Chain Monte Carlo: Making Bayesian Analysis Practical
For all but the simplest models, computing the posterior analytically is impossible. The marginal likelihood P(data) — the normalizing constant — requires integrating the likelihood over all possible parameter values, often a high-dimensional integral with no closed form.
Markov Chain Monte Carlo (MCMC) methods solve this computationally. Rather than computing the posterior distribution analytically, MCMC generates samples from it by constructing a Markov chain that has the posterior as its stationary distribution. As the chain runs, samples accumulate from the posterior, enabling estimation of any posterior quantity (means, credible intervals, predictive distributions) from the empirical distribution of samples.
- Metropolis-Hastings algorithm (1953, 1970): The foundational MCMC method; proposes moves through parameter space and accepts or rejects each move probabilistically to ensure samples come from the correct target distribution
- Gibbs sampling: A special case applicable when full conditional distributions (the distribution of each parameter given all others and the data) are available in closed form; samples each parameter in turn
- Hamiltonian Monte Carlo (HMC): Uses gradient information (as in physical Hamiltonian mechanics) to propose distant moves efficiently, dramatically reducing autocorrelation in chains; implemented in Stan and PyMC
- No-U-Turn Sampler (NUTS): An adaptive extension of HMC that automatically tunes simulation length; the default sampler in Stan
Applications Across Disciplines
Bayesian methods now permeate scientific practice across fields:
- Medical diagnosis: Bayesian reasoning is essential for interpreting diagnostic tests; a test's sensitivity and specificity combine with disease prevalence (the prior) via Bayes' theorem to give the posterior probability of disease given a positive test result
- Clinical trials: Bayesian adaptive trial designs adjust sample allocation based on accumulating evidence, potentially reducing trial size and exposing fewer patients to inferior treatments
- Gravitational wave detection: LIGO data analysis uses Bayesian inference to estimate parameters of merging black holes or neutron stars from noisy signals
- Machine learning: Bayesian neural networks quantify prediction uncertainty; Gaussian processes provide a fully Bayesian approach to regression; variational inference enables scalable approximate Bayesian methods in large models
- Spam filtering: Naive Bayes classifiers update word-probability estimates as new spam examples are observed, adapting to changing spam patterns
Bayesian Model Comparison
Bayesian statistics provides a natural solution to model selection: the Bayes factor. For two competing models M₁ and M₂, the Bayes factor BF₁₂ = P(data | M₁) / P(data | M₂) quantifies how much more probable the observed data is under M₁ than under M₂. Unlike frequentist model comparison metrics, the Bayes factor automatically penalizes model complexity — a more complex model must fit the data substantially better to overcome the prior probability spread across a larger parameter space. This built-in Occam's Razor makes Bayesian model comparison a principled approach to choosing between competing scientific theories.
Related Articles
applied mathematics
Bayes' Theorem: How to Update Beliefs With New Evidence
Bayes' theorem describes how to rationally update probability estimates when new evidence arrives. Learn the formula, its intuition, and its applications in medicine and AI.
9 min read
applied mathematics
Game Theory Explained: Nash Equilibria, Prisoner's Dilemma, and Strategic Decision-Making
A comprehensive introduction to game theory — the mathematics of strategic decision-making — covering the Prisoner's Dilemma, Nash equilibria, dominant strategies, cooperative vs. non-cooperative games, auctions, evolutionary game theory, and real-world applications from economics to nuclear deterrence.
9 min read
applied mathematics
How Bayesian Statistics Updates Beliefs With New Evidence
Bayesian statistics provides a mathematical framework for updating beliefs as evidence arrives. From spam filters to medical screening, Bayes' theorem shapes modern inference.
9 min read
applied mathematics
How Compound Interest Works: The Math Behind Exponential Growth
Compound interest grows exponentially because interest earns interest over time. Learn the formula, the Rule of 72, and why starting early makes such an enormous financial difference.
8 min read