What Is Bayesian Statistics: Prior Beliefs, Evidence, and Updating Probability

Bayesian statistics treats probability as a degree of belief that updates as new evidence arrives. Learn how Bayes' theorem, prior distributions, and posterior inference work in science, medicine, and machine learning.

The InfoNexus Editorial TeamMay 15, 202610 min read

Probability as Belief

Bayesian statistics is a framework for inference in which probability represents a degree of belief or confidence rather than a long-run frequency of events. Named after Reverend Thomas Bayes, whose posthumously published 1763 essay introduced the foundational theorem, Bayesian statistics provides a mathematically coherent system for reasoning under uncertainty and updating beliefs as new evidence becomes available. This philosophical starting point distinguishes it fundamentally from frequentist statistics, the dominant paradigm in most introductory courses, in which probability is defined as the limiting frequency of an event in an infinitely repeated experiment.

The Bayesian perspective permits statements that frequentist statistics formally forbids. A Bayesian can say "there is a 73 percent probability that this drug is effective" or "I am 95 percent confident the true parameter lies between these values" — direct probability statements about hypotheses and parameters. A frequentist confidence interval carries a more convoluted interpretation: that if the experiment were repeated infinitely many times and the interval recalculated each time, 95 percent of such intervals would contain the true parameter. The Bayesian credible interval says simply that, given the data, there is a 95 percent probability the parameter is in the interval — the intuitive interpretation that most scientists mistakenly apply to frequentist confidence intervals.

Bayes' Theorem: The Engine of Inference

The mathematical foundation of Bayesian statistics is Bayes' theorem: P(H|E) = P(E|H) × P(H) / P(E). Here, H represents a hypothesis (or an unknown parameter value) and E represents observed evidence (data). P(H) is the prior probability — what we believed about H before seeing the evidence. P(E|H) is the likelihood — how probable the observed evidence is if H were true. P(H|E) is the posterior probability — our updated belief about H after incorporating the evidence. P(E) is the marginal likelihood or evidence — the total probability of the observed data averaged over all hypotheses.

The theorem can be understood as a learning machine: start with a prior belief, observe evidence, compute how probable that evidence is under different hypotheses, and update the prior proportionally to arrive at a posterior. If the evidence strongly supports H, the posterior is much higher than the prior. If the evidence contradicts H, the posterior is much lower. With enough data, the prior eventually becomes irrelevant — two analysts starting with very different priors will converge to nearly the same posterior as evidence accumulates, a property called consistency.

Priors: Encoding Background Knowledge

The prior distribution — P(H) in Bayes' theorem — encodes what is known or believed about a parameter or hypothesis before collecting data. Choosing the prior is one of the most discussed and sometimes controversial aspects of Bayesian statistics. Informative priors encode genuine prior knowledge: a clinical researcher analyzing a new drug might use results from earlier trials as the prior, substantially reducing the amount of data needed to reach confident conclusions. This ability to formally incorporate prior knowledge is a major advantage of Bayesian methods in fields like medicine and engineering where substantial background information exists.

When prior information is absent or investigators want results to be as data-driven as possible, weakly informative or non-informative priors are used. Jeffreys' prior is a principled approach to constructing non-informative priors that are invariant to reparameterization — so the conclusions do not change artificially if you restate the problem in different mathematical terms. In practice, weakly informative priors that mildly constrain parameters to plausible ranges — for example, specifying that a regression coefficient is unlikely to be astronomically large — are widely used in modern applied Bayesian statistics because they provide regularization without strong assumptions. The debate over prior choice has moderated as practitioners recognized that with sufficient data, reasonable alternative priors lead to similar posteriors, and that the transparency of stating one's assumptions explicitly is a feature rather than a bug.

Posterior Distributions and Credible Intervals

The output of Bayesian inference is not a single point estimate but a full probability distribution over possible parameter values — the posterior distribution. This distribution captures both the most probable parameter value (the posterior mode or mean) and the uncertainty around that estimate. A 95 percent credible interval is simply the range containing 95 percent of the posterior probability mass, and it carries exactly the intuitive interpretation: there is a 95 percent probability the true parameter lies in this range, given the data and the prior.

Posterior distributions can be combined and propagated through subsequent analyses in ways that frequentist estimates cannot. If you have posterior distributions for multiple parameters, you can derive the posterior for any function of those parameters — for example, computing the posterior probability that drug A is better than drug B by simply counting the fraction of posterior draws where A's parameter exceeds B's. This predictive approach extends naturally: a Bayesian posterior predictive distribution gives a full probability distribution over future observations, incorporating both parameter uncertainty and sampling variability.

Computational Methods: MCMC and Variational Inference

For simple models, the posterior can sometimes be computed analytically — particularly when prior and likelihood are conjugate pairs that produce a posterior in the same family as the prior (the Beta-Binomial and Normal-Normal models are classic examples). But most real-world models produce posteriors that have no closed-form expression. For decades this computational barrier limited Bayesian methods to simple models. The revolution in computational statistics that began in the late 1980s changed this dramatically.

Markov Chain Monte Carlo (MCMC) methods generate samples from the posterior distribution by constructing a Markov chain that converges to the target distribution. The Metropolis-Hastings algorithm, Gibbs sampling, and the No-U-Turn Sampler (NUTS) — implemented in probabilistic programming languages like Stan, PyMC, and NumPyro — have made complex Bayesian models accessible to applied researchers across disciplines. Variational inference offers a faster but approximate alternative: instead of sampling, it fits a parameterized family of distributions to approximate the posterior, enabling scalable Bayesian inference in machine learning applications. Modern Bayesian deep learning uses variational methods to place posterior distributions over neural network weights, enabling uncertainty quantification in predictions — an increasingly important capability in high-stakes applications like medical diagnosis and autonomous driving.

Bayesian Methods in Practice

Bayesian statistics has become increasingly central across many fields. In medicine, adaptive clinical trial designs use Bayesian updating to allocate more patients to more promising treatments as the trial progresses, improving efficiency and ethics compared to fixed classical designs. The FDA has issued guidance supporting Bayesian methods for medical device trials. Pharmaceutical companies routinely use Bayesian analysis to synthesize evidence across multiple studies and compute the probability that a drug meets efficacy thresholds.

In machine learning, Bayesian optimization efficiently searches hyperparameter spaces by maintaining a probabilistic model of the objective function and selecting experiments to maximize information about the global optimum — the approach underlying state-of-the-art hyperparameter tuning systems. Gaussian processes, a Bayesian nonparametric method, model complex functions with calibrated uncertainty and have applications in spatial statistics, computer vision, and scientific emulation. In everyday applications, Bayesian spam filters update the probability that a message is spam based on the presence of specific words, updating these probabilities as users label emails. The versatility of Bayesian reasoning — its ability to incorporate prior knowledge, quantify uncertainty, and update beliefs rationally as evidence accumulates — has made it one of the foundational frameworks of modern data science.

mathematicsstatisticsdata science

Related Articles