Simpson's Paradox: When Statistics Lie by Telling the Truth
How Simpson's Paradox reverses statistical trends when data is aggregated, real-world examples from Berkeley admissions and kidney stones, and how to detect and avoid it.
Treatment A Beat Treatment B in Every Subgroup — and Lost Overall
Simpson's Paradox occurs when a trend appears in multiple groups of data but reverses — or disappears — when those groups are combined. This is not a statistical error or measurement mistake; it is a mathematically valid consequence of how weighted averages interact with group size differences. The paradox was formally described by British statistician E.H. Simpson in a 1951 paper in the Journal of the Royal Statistical Society, though similar observations had appeared earlier in work by Karl Pearson (1899) and Udny Yule (1903). It is among the most practically consequential phenomena in statistics because it allows data to tell true stories in subgroups that point in the opposite direction of the true story in aggregate.
The Classic Numerical Example
Suppose a hospital compares two kidney stone treatments:
| Group | Treatment A Success Rate | Treatment B Success Rate |
|---|---|---|
| Small stones | 93% (81/87) | 87% (234/270) |
| Large stones | 73% (192/263) | 69% (55/80) |
| Combined | 78% (273/350) | 83% (289/350) |
Treatment A outperforms Treatment B in both subgroups — for small stones (93% vs 87%) and for large stones (73% vs 69%). Yet combined, Treatment B appears to win (83% vs 78%). This is not an error. Treatment A was preferentially given to the harder cases (large stones). Large stones have lower success rates across the board, dragging down Treatment A's combined average. Treatment B was used more for easy cases (small stones), inflating its combined average. The group sizes are unbalanced, and that imbalance creates the reversal. This exact dataset appears in a 1986 paper by C.R. Charig et al. in the British Medical Journal and is the most-cited real example of Simpson's Paradox in medical literature.
The Berkeley Admissions Case
The most famous real-world instance involved UC Berkeley's graduate admissions in 1973. Aggregate data appeared to show significant bias against women: 44% of male applicants were admitted versus 35% of female applicants — a gap large enough to prompt a discrimination investigation. Peter Bickel, a statistician at Berkeley, disaggregated the data by department and found the opposite: in most individual departments, women were admitted at equal or slightly higher rates than men. The paradox arose because women disproportionately applied to highly competitive departments (like English and social sciences) with low admission rates for all applicants, while men disproportionately applied to less competitive departments (like engineering) with higher admission rates. Bickel published the analysis in Science in 1975. The study became a landmark case for the importance of disaggregated analysis in discrimination research.
Why It Happens: The Role of Confounding Variables
Simpson's Paradox is a manifestation of confounding — a third variable that is associated with both the predictor and the outcome and creates spurious relationships when ignored. In the kidney stone example, stone size is the confounder. In the Berkeley case, department choice and department selectivity are confounders.
- A confounder creates an imbalance in how data is distributed across groups
- When groups have different sizes and the confounder has different base rates across groups, aggregation produces a weighted average that misrepresents subgroup patterns
- The paradox cannot occur when groups are equal in size — the imbalance is structurally necessary for the reversal
- Adding the confounding variable to the analysis (stratification) restores the correct picture; omitting it generates the misleading aggregate
Mathematical Conditions for the Paradox
The paradox requires a specific algebraic condition to occur. For two groups A₁ and A₂ where Treatment X beats Treatment Y in each group:
If group A₁ contributes a much larger fraction of the total to Treatment Y's denominator than to Treatment X's denominator, and A₁ has a lower base success rate than A₂, the aggregated rates can reverse. Formally, if a/b < c/d and e/f < g/h, it is still possible that (a+e)/(b+f) > (c+g)/(d+h) when the weights b, d, f, h are sufficiently different.
- The paradox is structurally impossible with equal group representation — unequal weighting is a prerequisite
- Judea Pearl's causal hierarchy framework (from his 2018 book "The Book of Why") distinguishes association (seeing data), intervention (doing), and counterfactuals (imagining) — Simpson's Paradox lives at the association level and requires causal reasoning to resolve
- The resolution depends on which causal direction is meaningful: in the medical case, stone size causally precedes treatment choice, so stratifying by stone size reveals the true treatment effect
Detection and Prevention
Detecting Simpson's Paradox in real data requires deliberate effort since the paradox is invisible in aggregate summaries.
| Warning Sign | What to Look For |
|---|---|
| Unequal group sizes | One subgroup dominates another in certain conditions |
| Known confounders | Variables that affect both group membership and outcome simultaneously |
| Counterintuitive aggregate result | Aggregate finding that contradicts domain knowledge or prior research |
| Missing stratification | Reports presenting only combined totals without subgroup breakdown |
Standard defenses include stratified analysis (computing effects separately for each level of the potential confounder), regression analysis with confounders included as covariates, and propensity score matching to create balanced comparison groups. In clinical trials, randomization automatically distributes confounders across treatment groups, making Simpson's Paradox structurally unlikely — but observational studies remain vulnerable indefinitely.
Related Articles
mathematics
Bayesian Inference: Priors, Posteriors, and Updating Beliefs with Data
How Bayes' theorem works, what prior and posterior distributions mean, the base rate neglect problem in medical testing, the Bayesian vs. frequentist debate, and MCMC computation.
9 min read
mathematics
Fermat's Last Theorem: 358 Years From Margin Note to Proof
How Fermat's 1637 claim sat unproven for 358 years until Andrew Wiles secretly spent 7 years connecting elliptic curves, modular forms, and the Shimura-Taniyama-Weil conjecture.
9 min read
mathematics
Gödel's Incompleteness Theorems: The Limits of Mathematical Truth
Gödel's two incompleteness theorems explained, the self-referential proof method, what they mean for formal systems, and their influence on mathematics and computing.
9 min read
mathematics
P vs NP: The Million-Dollar Problem at the Heart of Computer Science
What P vs NP means, examples of P and NP problems, why the question matters, Cook's theorem, and the implications if P equals NP or does not.
9 min read