The Scientific Method: Falsifiability, Hypotheses, and the Replication Crisis
Popper's falsifiability criterion, the hypothetico-deductive method, the 2015 Open Science Collaboration replication study (36% replicated), p-hacking, and how science self-corrects.
Popper's Radical Criterion
In 1934, Karl Popper published Logik der Forschung (translated into English as The Logic of Scientific Discovery in 1959) and proposed a deceptively simple criterion for distinguishing science from non-science: a theory is scientific if and only if it is falsifiable — if there exists, at least in principle, an observation that could prove it wrong. Freudian psychoanalysis, Popper argued, was not scientific because its practitioners could explain any human behavior as confirmation of the theory; no observation could refute it. Newtonian gravity, by contrast, made precise numerical predictions that would be falsified if planets deviated from calculated orbits. The difference was not that gravity was "proven true" — Popper explicitly rejected the possibility of final proof — but that it was permanently at risk of refutation.
This asymmetry between verification and falsification is the logical heart of Popper's position. No finite number of confirming observations can prove a universal theory true (a billion white swans don't prove all swans are white), but a single genuine counter-instance proves it false (one black swan disproves it). Science progresses not by accumulating proof but by surviving repeated attempts at refutation.
The Hypothetico-Deductive Method in Practice
The standard account of scientific practice involves:
- Observation of a phenomenon requiring explanation
- Formation of a hypothesis that could explain the phenomenon
- Deduction of testable predictions from the hypothesis
- Experimental design to test those predictions under controlled conditions
- Data collection and statistical analysis
- Acceptance or rejection of the null hypothesis
- Replication by independent researchers
The null hypothesis (H0) states that there is no effect, no relationship, no difference. The researcher tries to reject H0 by demonstrating that the observed data would be extremely unlikely to occur if H0 were true. The threshold for rejection — the p-value — is conventionally set at 0.05: a result is considered statistically significant if there is less than a 5% probability of observing data as extreme as measured, assuming the null hypothesis is true.
| Step | Example (Drug Trial) | Key Tool |
|---|---|---|
| Hypothesis | "Drug X reduces blood pressure" | Theory + prior literature |
| Prediction | "Treatment group will have 10mmHg lower BP than placebo" | Prior effect size estimates |
| Experiment | Randomized controlled trial, double-blind | Random assignment, blinding |
| Analysis | t-test comparing group means | Statistical significance (p < 0.05) |
| Replication | Independent labs repeat the trial | Pre-registration, open data |
The Replication Crisis
In 2015, the Open Science Collaboration published a landmark study in Science: 270 researchers had attempted to replicate 100 published psychology studies. Only 36% of the replications produced a statistically significant result in the same direction as the original. The average effect size in replications was about half that of the originals. The study sent shockwaves through social psychology, behavioral economics, and medicine.
The causes were multiple and are now well-documented:
- Publication bias: Academic journals preferentially publish positive results (p < 0.05) and reject null results, creating a literature that overstates the reliability of findings
- P-hacking (or "fishing"): Researchers who collect data and then try multiple statistical approaches, subgroup analyses, or variable combinations until p < 0.05 is achieved — without pre-registering the analysis plan — inflate the false positive rate far above 5%
- HARKing: Hypothesizing After Results are Known — presenting post-hoc interpretations as if they were pre-specified predictions
- Small sample sizes: Studies with 20–40 participants have low statistical power; genuine effects may be missed and false positives amplified
- Researcher degrees of freedom: Countless small decisions in data collection, processing, and analysis (called the "garden of forking paths" by statistician Andrew Gelman) allow motivated reasoning to influence results without overt fraud
P-Hacking: A Quantitative Illustration
If a researcher runs 20 independent statistical tests on unrelated data, each at the p = 0.05 threshold, the probability of obtaining at least one "significant" result by chance is 1 − (0.95)^20 = 64%. The family-wise error rate — the chance of at least one false positive across a collection of tests — climbs rapidly with the number of comparisons. Properly correcting for multiple comparisons (Bonferroni correction, Benjamini-Hochberg procedure) reduces this rate but also reduces statistical power, making genuine effects harder to detect.
The journalist John Bohannon demonstrated p-hacking's dangers in 2015 by running a genuinely underpowered clinical trial on chocolate and weight loss, obtaining a p < 0.05 result through multiple measurements, and getting the paper published in a peer-reviewed journal. Media coverage of his "chocolate helps you lose weight" finding reached 20+ countries. The paper was a deliberate hoax to expose the problem — but the methods it used were standard practice in published research.
Corrective Measures: Pre-registration and Open Science
The scientific community's response has been substantial. Pre-registration — depositing the specific hypothesis, sample size calculation, and analysis plan in a public registry (OSF.io, ClinicalTrials.gov, AsPredicted.org) before data collection begins — removes the researcher's ability to present post-hoc findings as predictions. Registered Reports, offered by over 300 journals, accept papers for publication based on the introduction and methods section before results are known, eliminating publication bias entirely.
Open data mandates (requiring raw data to be publicly accessible), open materials (sharing stimuli and code), and adversarial collaboration (where researchers who disagree on a finding design a joint study to resolve it) have transformed norms rapidly since 2015. The replication crisis, despite its name, may be better understood as a self-corrective mechanism working — science exposing and addressing its own systematic errors.
Related Articles
philosophy of science
Ethics of Artificial Intelligence: Alignment, Risks, and Regulation
The AI alignment problem and specification gaming, Bostrom's instrumental convergence thesis, value loading challenges, the EU AI Act of 2024, and IEEE Ethically Aligned Design.
9 min read
philosophy of science
Scientific Consensus: How It Forms and How It Gets Attacked
How scientific consensus forms through evidence and meta-analysis, Naomi Oreskes's 97% climate finding, the tobacco industry's doubt strategy, and legitimate dissent vs manufactured controversy.
9 min read
agriculture
The Svalbard Seed Vault: Humanitys Backup Plan for Agriculture
Inside the Svalbard Global Seed Vault on a Norwegian Arctic island, over 1.2 million seed samples safeguard crop diversity against war, climate change, and natural disasters.
9 min read
anthropology
Ancient DNA and Paleogenomics: How Bone and Teeth Are Rewriting Human History
Paleogenomics extracts DNA from ancient bones to track human migrations. Learn extraction methods, the Yamnaya expansion 5,000 years ago, Anatolian farmer displacement, and haplogroup tracking.
9 min read