The Scientific Method: Falsifiability, Hypotheses, and the Replication Crisis

Popper's falsifiability criterion, the hypothetico-deductive method, the 2015 Open Science Collaboration replication study (36% replicated), p-hacking, and how science self-corrects.

The InfoNexus Editorial TeamMay 24, 20269 min read

Popper's Radical Criterion

In 1934, Karl Popper published Logik der Forschung (translated into English as The Logic of Scientific Discovery in 1959) and proposed a deceptively simple criterion for distinguishing science from non-science: a theory is scientific if and only if it is falsifiable — if there exists, at least in principle, an observation that could prove it wrong. Freudian psychoanalysis, Popper argued, was not scientific because its practitioners could explain any human behavior as confirmation of the theory; no observation could refute it. Newtonian gravity, by contrast, made precise numerical predictions that would be falsified if planets deviated from calculated orbits. The difference was not that gravity was "proven true" — Popper explicitly rejected the possibility of final proof — but that it was permanently at risk of refutation.

This asymmetry between verification and falsification is the logical heart of Popper's position. No finite number of confirming observations can prove a universal theory true (a billion white swans don't prove all swans are white), but a single genuine counter-instance proves it false (one black swan disproves it). Science progresses not by accumulating proof but by surviving repeated attempts at refutation.

The Hypothetico-Deductive Method in Practice

The standard account of scientific practice involves:

  1. Observation of a phenomenon requiring explanation
  2. Formation of a hypothesis that could explain the phenomenon
  3. Deduction of testable predictions from the hypothesis
  4. Experimental design to test those predictions under controlled conditions
  5. Data collection and statistical analysis
  6. Acceptance or rejection of the null hypothesis
  7. Replication by independent researchers

The null hypothesis (H0) states that there is no effect, no relationship, no difference. The researcher tries to reject H0 by demonstrating that the observed data would be extremely unlikely to occur if H0 were true. The threshold for rejection — the p-value — is conventionally set at 0.05: a result is considered statistically significant if there is less than a 5% probability of observing data as extreme as measured, assuming the null hypothesis is true.

StepExample (Drug Trial)Key Tool
Hypothesis"Drug X reduces blood pressure"Theory + prior literature
Prediction"Treatment group will have 10mmHg lower BP than placebo"Prior effect size estimates
ExperimentRandomized controlled trial, double-blindRandom assignment, blinding
Analysist-test comparing group meansStatistical significance (p < 0.05)
ReplicationIndependent labs repeat the trialPre-registration, open data

The Replication Crisis

In 2015, the Open Science Collaboration published a landmark study in Science: 270 researchers had attempted to replicate 100 published psychology studies. Only 36% of the replications produced a statistically significant result in the same direction as the original. The average effect size in replications was about half that of the originals. The study sent shockwaves through social psychology, behavioral economics, and medicine.

The causes were multiple and are now well-documented:

  • Publication bias: Academic journals preferentially publish positive results (p < 0.05) and reject null results, creating a literature that overstates the reliability of findings
  • P-hacking (or "fishing"): Researchers who collect data and then try multiple statistical approaches, subgroup analyses, or variable combinations until p < 0.05 is achieved — without pre-registering the analysis plan — inflate the false positive rate far above 5%
  • HARKing: Hypothesizing After Results are Known — presenting post-hoc interpretations as if they were pre-specified predictions
  • Small sample sizes: Studies with 20–40 participants have low statistical power; genuine effects may be missed and false positives amplified
  • Researcher degrees of freedom: Countless small decisions in data collection, processing, and analysis (called the "garden of forking paths" by statistician Andrew Gelman) allow motivated reasoning to influence results without overt fraud

P-Hacking: A Quantitative Illustration

If a researcher runs 20 independent statistical tests on unrelated data, each at the p = 0.05 threshold, the probability of obtaining at least one "significant" result by chance is 1 − (0.95)^20 = 64%. The family-wise error rate — the chance of at least one false positive across a collection of tests — climbs rapidly with the number of comparisons. Properly correcting for multiple comparisons (Bonferroni correction, Benjamini-Hochberg procedure) reduces this rate but also reduces statistical power, making genuine effects harder to detect.

The journalist John Bohannon demonstrated p-hacking's dangers in 2015 by running a genuinely underpowered clinical trial on chocolate and weight loss, obtaining a p < 0.05 result through multiple measurements, and getting the paper published in a peer-reviewed journal. Media coverage of his "chocolate helps you lose weight" finding reached 20+ countries. The paper was a deliberate hoax to expose the problem — but the methods it used were standard practice in published research.

Corrective Measures: Pre-registration and Open Science

The scientific community's response has been substantial. Pre-registration — depositing the specific hypothesis, sample size calculation, and analysis plan in a public registry (OSF.io, ClinicalTrials.gov, AsPredicted.org) before data collection begins — removes the researcher's ability to present post-hoc findings as predictions. Registered Reports, offered by over 300 journals, accept papers for publication based on the introduction and methods section before results are known, eliminating publication bias entirely.

Open data mandates (requiring raw data to be publicly accessible), open materials (sharing stimuli and code), and adversarial collaboration (where researchers who disagree on a finding design a joint study to resolve it) have transformed norms rapidly since 2015. The replication crisis, despite its name, may be better understood as a self-corrective mechanism working — science exposing and addressing its own systematic errors.

scientific methodphilosophy of scienceepistemology

Related Articles