The Scientific Method: Falsifiability, Hypotheses, and the Replication Crisis

Popper's Radical Criterion

In 1934, Karl Popper published Logik der Forschung (translated into English as The Logic of Scientific Discovery in 1959) and proposed a deceptively simple criterion for distinguishing science from non-science: a theory is scientific if and only if it is falsifiable — if there exists, at least in principle, an observation that could prove it wrong. Freudian psychoanalysis, Popper argued, was not scientific because its practitioners could explain any human behavior as confirmation of the theory; no observation could refute it. Newtonian gravity, by contrast, made precise numerical predictions that would be falsified if planets deviated from calculated orbits. The difference was not that gravity was "proven true" — Popper explicitly rejected the possibility of final proof — but that it was permanently at risk of refutation.

This asymmetry between verification and falsification is the logical heart of Popper's position. No finite number of confirming observations can prove a universal theory true (a billion white swans don't prove all swans are white), but a single genuine counter-instance proves it false (one black swan disproves it). Science progresses not by accumulating proof but by surviving repeated attempts at refutation.

The Hypothetico-Deductive Method in Practice

The standard account of scientific practice involves:

Observation of a phenomenon requiring explanation
Formation of a hypothesis that could explain the phenomenon
Deduction of testable predictions from the hypothesis
Experimental design to test those predictions under controlled conditions
Data collection and statistical analysis
Acceptance or rejection of the null hypothesis
Replication by independent researchers

The null hypothesis (H0) states that there is no effect, no relationship, no difference. The researcher tries to reject H0 by demonstrating that the observed data would be extremely unlikely to occur if H0 were true. The threshold for rejection — the p-value — is conventionally set at 0.05: a result is considered statistically significant if there is less than a 5% probability of observing data as extreme as measured, assuming the null hypothesis is true.

Step	Example (Drug Trial)	Key Tool
Hypothesis	"Drug X reduces blood pressure"	Theory + prior literature
Prediction	"Treatment group will have 10mmHg lower BP than placebo"	Prior effect size estimates
Experiment	Randomized controlled trial, double-blind	Random assignment, blinding
Analysis	t-test comparing group means	Statistical significance (p < 0.05)
Replication	Independent labs repeat the trial	Pre-registration, open data

The Replication Crisis

In 2015, the Open Science Collaboration published a landmark study in Science: 270 researchers had attempted to replicate 100 published psychology studies. Only 36% of the replications produced a statistically significant result in the same direction as the original. The average effect size in replications was about half that of the originals. The study sent shockwaves through social psychology, behavioral economics, and medicine.

The causes were multiple and are now well-documented:

Publication bias: Academic journals preferentially publish positive results (p < 0.05) and reject null results, creating a literature that overstates the reliability of findings
P-hacking (or "fishing"): Researchers who collect data and then try multiple statistical approaches, subgroup analyses, or variable combinations until p < 0.05 is achieved — without pre-registering the analysis plan — inflate the false positive rate far above 5%
HARKing: Hypothesizing After Results are Known — presenting post-hoc interpretations as if they were pre-specified predictions
Small sample sizes: Studies with 20–40 participants have low statistical power; genuine effects may be missed and false positives amplified
Researcher degrees of freedom: Countless small decisions in data collection, processing, and analysis (called the "garden of forking paths" by statistician Andrew Gelman) allow motivated reasoning to influence results without overt fraud

P-Hacking: A Quantitative Illustration

If a researcher runs 20 independent statistical tests on unrelated data, each at the p = 0.05 threshold, the probability of obtaining at least one "significant" result by chance is 1 − (0.95)^20 = 64%. The family-wise error rate — the chance of at least one false positive across a collection of tests — climbs rapidly with the number of comparisons. Properly correcting for multiple comparisons (Bonferroni correction, Benjamini-Hochberg procedure) reduces this rate but also reduces statistical power, making genuine effects harder to detect.

The journalist John Bohannon demonstrated p-hacking's dangers in 2015 by running a genuinely underpowered clinical trial on chocolate and weight loss, obtaining a p < 0.05 result through multiple measurements, and getting the paper published in a peer-reviewed journal. Media coverage of his "chocolate helps you lose weight" finding reached 20+ countries. The paper was a deliberate hoax to expose the problem — but the methods it used were standard practice in published research.

Corrective Measures: Pre-registration and Open Science

The scientific community's response has been substantial. Pre-registration — depositing the specific hypothesis, sample size calculation, and analysis plan in a public registry (OSF.io, ClinicalTrials.gov, AsPredicted.org) before data collection begins — removes the researcher's ability to present post-hoc findings as predictions. Registered Reports, offered by over 300 journals, accept papers for publication based on the introduction and methods section before results are known, eliminating publication bias entirely.

Open data mandates (requiring raw data to be publicly accessible), open materials (sharing stimuli and code), and adversarial collaboration (where researchers who disagree on a finding design a joint study to resolve it) have transformed norms rapidly since 2015. The replication crisis, despite its name, may be better understood as a self-corrective mechanism working — science exposing and addressing its own systematic errors.

The Scientific Method: Falsifiability, Hypotheses, and the Replication Crisis

Popper's Radical Criterion

The Hypothetico-Deductive Method in Practice

The Replication Crisis

P-Hacking: A Quantitative Illustration

Corrective Measures: Pre-registration and Open Science

Related Articles

Ethics of Artificial Intelligence: Alignment, Risks, and Regulation

Scientific Consensus: How It Forms and How It Gets Attacked

The Svalbard Seed Vault: Humanitys Backup Plan for Agriculture

Ancient DNA and Paleogenomics: How Bone and Teeth Are Rewriting Human History