Bayesian Statistics: Updating Beliefs with Evidence

Bayesian statistics treats probability as degrees of belief, updating prior beliefs with evidence via Bayes' theorem to produce posterior distributions used in science, AI, and medicine.

The InfoNexus Editorial TeamMay 13, 20269 min read

Probability as a State of Knowledge

A drug company tests a new compound. The trial shows a positive result. How confident should regulators be that the drug actually works? Classical frequentist statistics gives one answer. Bayesian statistics gives a different, arguably more intuitive one — by explicitly incorporating prior knowledge about how often drug candidates actually work, and updating that knowledge with the trial evidence.

Bayesian statistics treats probability not as a long-run frequency of events but as a quantified degree of belief — a measure of how confident a reasoning agent is that something is true, given available information. This philosophical difference has profound practical consequences. It allows statisticians to make direct probability statements about hypotheses ("there is a 94% probability the drug is effective"), include prior information systematically, and update beliefs incrementally as new data arrives — in a framework grounded in mathematics rather than informal judgment.

The Bayesian Framework

The entire framework rests on one equation — Bayes' theorem — applied to the relationship between data and hypotheses:

P(θ | data) = P(data | θ) × P(θ) / P(data)

Where:

  • P(θ): Prior probability — beliefs about parameter θ before observing data; encodes existing knowledge or uncertainty
  • P(data | θ): Likelihood — the probability of observing the data given parameter θ; comes from the statistical model
  • P(data): Marginal likelihood (evidence) — a normalizing constant ensuring the posterior integrates to 1
  • P(θ | data): Posterior probability — updated beliefs about θ after incorporating the observed data; the goal of Bayesian analysis

The posterior combines prior and likelihood multiplicatively. A strong likelihood can overwhelm a diffuse prior. A strong prior can moderate an extreme likelihood. This balance corresponds naturally to how scientific knowledge accumulates: as data accumulate, posteriors become increasingly concentrated around the true parameter value, regardless of the initial prior (under standard regularity conditions).

Choosing Priors

The prior distribution is simultaneously the most powerful and most controversial aspect of Bayesian analysis. Critics argue it introduces subjectivity; proponents argue it provides a principled mechanism for incorporating genuine prior knowledge and that all statistical analyses make assumptions, whether explicit or not.

Prior TypeDescriptionWhen to Use
Informative priorReflects genuine prior knowledge (e.g., from previous studies)When reliable prior information exists
Weakly informative priorProvides some regularization without strongly constraining posteriorDefault choice in many practical analyses
Non-informative (flat) priorAssigns roughly equal probability to all parameter valuesWhen prior knowledge is genuinely absent
Conjugate priorFrom a family where prior × likelihood yields posterior in same familyAnalytical convenience; enables closed-form solutions
Jeffreys priorInvariant to reparameterization; based on Fisher informationObjective Bayesian analysis

Conjugate priors simplify computation dramatically. For a binomial likelihood (counts of successes), the Beta distribution is the conjugate prior — yielding a Beta posterior. For a normal likelihood with known variance, the normal prior is conjugate. These analytically tractable cases allow exact Bayesian inference without numerical methods, though they apply only to a limited set of models.

Bayesian vs. Frequentist Inference

The two schools of statistical thought produce different answers to similar questions, and the differences matter in practice:

ConceptFrequentistBayesian
ProbabilityLong-run frequency of repeatable eventsDegree of belief; applies to unique events
Parameter statusFixed unknown constantRandom variable with a distribution
Uncertainty intervalConfidence interval: 95% of such intervals contain the true parameterCredible interval: 95% posterior probability that parameter lies in interval
Hypothesis testingp-value: prob. of data this extreme under H₀Bayes factor: ratio of evidence for H₁ vs H₀
Prior informationNot formally incorporatedExplicitly incorporated via prior distribution

The frequentist confidence interval is routinely misinterpreted as a Bayesian credible interval. A 95% confidence interval does not mean there is a 95% probability the true parameter lies in that specific interval — it means the procedure, if repeated many times, would produce intervals containing the true parameter in 95% of cases. The Bayesian credible interval is the direct probability statement most users want: given the data and prior, there is a 95% posterior probability the parameter lies in this range.

Markov Chain Monte Carlo: Making Bayesian Analysis Practical

For all but the simplest models, computing the posterior analytically is impossible. The marginal likelihood P(data) — the normalizing constant — requires integrating the likelihood over all possible parameter values, often a high-dimensional integral with no closed form.

Markov Chain Monte Carlo (MCMC) methods solve this computationally. Rather than computing the posterior distribution analytically, MCMC generates samples from it by constructing a Markov chain that has the posterior as its stationary distribution. As the chain runs, samples accumulate from the posterior, enabling estimation of any posterior quantity (means, credible intervals, predictive distributions) from the empirical distribution of samples.

  • Metropolis-Hastings algorithm (1953, 1970): The foundational MCMC method; proposes moves through parameter space and accepts or rejects each move probabilistically to ensure samples come from the correct target distribution
  • Gibbs sampling: A special case applicable when full conditional distributions (the distribution of each parameter given all others and the data) are available in closed form; samples each parameter in turn
  • Hamiltonian Monte Carlo (HMC): Uses gradient information (as in physical Hamiltonian mechanics) to propose distant moves efficiently, dramatically reducing autocorrelation in chains; implemented in Stan and PyMC
  • No-U-Turn Sampler (NUTS): An adaptive extension of HMC that automatically tunes simulation length; the default sampler in Stan

Applications Across Disciplines

Bayesian methods now permeate scientific practice across fields:

  • Medical diagnosis: Bayesian reasoning is essential for interpreting diagnostic tests; a test's sensitivity and specificity combine with disease prevalence (the prior) via Bayes' theorem to give the posterior probability of disease given a positive test result
  • Clinical trials: Bayesian adaptive trial designs adjust sample allocation based on accumulating evidence, potentially reducing trial size and exposing fewer patients to inferior treatments
  • Gravitational wave detection: LIGO data analysis uses Bayesian inference to estimate parameters of merging black holes or neutron stars from noisy signals
  • Machine learning: Bayesian neural networks quantify prediction uncertainty; Gaussian processes provide a fully Bayesian approach to regression; variational inference enables scalable approximate Bayesian methods in large models
  • Spam filtering: Naive Bayes classifiers update word-probability estimates as new spam examples are observed, adapting to changing spam patterns

Bayesian Model Comparison

Bayesian statistics provides a natural solution to model selection: the Bayes factor. For two competing models M₁ and M₂, the Bayes factor BF₁₂ = P(data | M₁) / P(data | M₂) quantifies how much more probable the observed data is under M₁ than under M₂. Unlike frequentist model comparison metrics, the Bayes factor automatically penalizes model complexity — a more complex model must fit the data substantially better to overcome the prior probability spread across a larger parameter space. This built-in Occam's Razor makes Bayesian model comparison a principled approach to choosing between competing scientific theories.

mathematicsstatisticsprobability

Related Articles