How Statistics Can Lie: Simpson's Paradox, Cherry-Picking, and Misleading Charts
Statistics can mislead even without outright falsehoods. Learn how Simpson's paradox, cherry-picked data, and deceptive visualizations distort our understanding of reality.
Why Statistics Can Mislead
The aphorism often attributed to Mark Twain, "There are three kinds of lies: lies, damned lies, and statistics," captures a genuine problem. Numbers feel authoritative. A statistic quoted in a report or headline carries an air of objectivity that prose claims do not. But statistics can mislead profoundly, sometimes through deliberate deception and sometimes through honest mistakes made by people who do not understand the tools they are using.
This article surveys the most common and consequential ways statistical reasoning goes wrong: from mathematical paradoxes to sampling bias to graphical tricks. Recognizing these patterns is one of the most practically useful skills a modern reader can develop.
Simpson's Paradox
Simpson's paradox occurs when a trend that appears in several separate groups of data disappears or reverses when the groups are combined. It is not a trick or a lie, it is a genuine mathematical phenomenon that has fooled researchers and policymakers.
A classic example: suppose a drug appears to improve recovery rates for men and also for women when the groups are analyzed separately, but when the groups are combined, the drug appears harmful. How is this possible? It happens when the groups have very different baseline sizes and the confounding variable, group membership, correlates with both treatment and outcome. The famous UC Berkeley admissions study of 1973 showed apparent gender bias against women in overall admissions, but when broken down by department, most departments admitted women at higher rates than men. The paradox arose because women disproportionately applied to more competitive departments.
Cherry-Picking and Confirmation Bias
Cherry-picking means selectively presenting data that supports a predetermined conclusion while ignoring contradictory evidence. In practice, this takes many forms:
- Reporting only the time window in which a trend favors your argument (e.g., "crime has fallen 30% since 2010" when crime rose sharply before 2010).
- Highlighting the few studies that support a claim while ignoring a larger body of contrary evidence.
- Reporting a subgroup finding as if it applied to the whole population.
Cherry-picking is often difficult to detect without access to the full dataset or systematic review of the literature. The antidote is meta-analysis and pre-registration of study hypotheses, practices that make selective reporting harder to conceal.
Misleading Charts and Visualizations
Visual representations of data carry enormous persuasive power, and they are easily manipulated. Several techniques are widely used to distort perception:
- Truncated y-axis: Starting the vertical axis at a value other than zero exaggerates differences between bars or lines. A 2% change can be made to look like a doubling.
- Dual axes: Plotting two different variables on the same graph with separate scales can imply a correlation or relationship that is coincidental or driven by the scaling choice.
- 3D charts: Three-dimensional pie charts or bar charts introduce visual distortion that makes some segments appear larger than they are.
- Area vs. length: When circle sizes are used to represent quantities, using radius instead of area to scale the circles systematically overstates differences.
The classic reference for these techniques is Darrell Huff's How to Lie with Statistics (1954), still one of the most readable introductions to statistical deception.
Correlation, Causation, and Confounding
Perhaps the most repeated statistical warning is that correlation does not imply causation, yet conflating them remains extraordinarily common in media coverage of research. Two variables can be strongly correlated for several reasons: A causes B, B causes A, a third variable C causes both, or the correlation is purely coincidental (a spurious correlation).
The website Spurious Correlations by Tyler Vigen catalogs absurd but statistically real correlations, such as the near-perfect correlation between US per-capita cheese consumption and deaths by bedsheet tangling. These examples illustrate that with enough variables and enough time points, you will find correlations by chance alone, a problem called p-hacking or data dredging in scientific research.
Base Rates and Misleading Percentages
Ignoring base rates is one of the most common statistical errors, and one with serious real-world consequences. A medical test that is 99% accurate sounds reassuring. But if the condition being tested for affects only 1 in 10,000 people, then among 10,000 people tested, about 100 will test positive (99 false positives plus roughly 1 true positive). The probability that a positive test actually indicates the disease is around 1%, not 99%.
This base rate neglect affects medical diagnosis, security screening, and legal evidence. The prosecutor's fallacy, confusing the probability of a match given innocence with the probability of innocence given a match, has contributed to wrongful convictions. Understanding conditional probability and Bayes' theorem is the corrective.
How to Think More Clearly About Statistics
Developing statistical literacy does not require advanced mathematics. A few habits dramatically improve your ability to evaluate statistical claims:
- Always ask: who collected this data, and how? Self-selected surveys, volunteer samples, and convenience samples are rarely representative.
- Look for the absolute numbers behind relative claims. A "50% increase" in risk from a very rare event may be trivial in absolute terms.
- Check whether comparisons are apples-to-apples. Are the groups comparable in other relevant ways?
- Look at the full axis range when reading charts.
- Ask whether the result has been independently replicated.
Statistical deception, whether intentional or accidental, thrives in the gap between the complexity of data and the audience's ability to interrogate it. The most powerful defense is a population that knows what questions to ask.
Related Articles
applied mathematics
Bayes' Theorem: How to Update Beliefs With New Evidence
Bayes' theorem describes how to rationally update probability estimates when new evidence arrives. Learn the formula, its intuition, and its applications in medicine and AI.
9 min read
applied mathematics
Game Theory Explained: Nash Equilibria, Prisoner's Dilemma, and Strategic Decision-Making
A comprehensive introduction to game theory — the mathematics of strategic decision-making — covering the Prisoner's Dilemma, Nash equilibria, dominant strategies, cooperative vs. non-cooperative games, auctions, evolutionary game theory, and real-world applications from economics to nuclear deterrence.
9 min read
applied mathematics
How Bayesian Statistics Updates Beliefs With New Evidence
Bayesian statistics provides a mathematical framework for updating beliefs as evidence arrives. From spam filters to medical screening, Bayes' theorem shapes modern inference.
9 min read
applied mathematics
How Compound Interest Works: The Math Behind Exponential Growth
Compound interest grows exponentially because interest earns interest over time. Learn the formula, the Rule of 72, and why starting early makes such an enormous financial difference.
8 min read