Simpson's Paradox: When Statistics Lie by Telling the Truth

Treatment A Beat Treatment B in Every Subgroup — and Lost Overall
Simpson's Paradox occurs when a trend appears in multiple groups of data but reverses — or disappears — when those groups are combined. This is not a statistical error or measurement mistake; it is a mathematically valid consequence of how weighted averages interact with group size differences. The paradox was formally described by British statistician E.H. Simpson in a 1951 paper in the Journal of the Royal Statistical Society, though similar observations had appeared earlier in work by Karl Pearson (1899) and Udny Yule (1903). It is among the most practically consequential phenomena in statistics because it allows data to tell true stories in subgroups that point in the opposite direction of the true story in aggregate.

The Classic Numerical Example

Suppose a hospital compares two kidney stone treatments:

Group	Treatment A Success Rate	Treatment B Success Rate
Small stones	93% (81/87)	87% (234/270)
Large stones	73% (192/263)	69% (55/80)
Combined	78% (273/350)	83% (289/350)

Treatment A outperforms Treatment B in both subgroups — for small stones (93% vs 87%) and for large stones (73% vs 69%). Yet combined, Treatment B appears to win (83% vs 78%). This is not an error. Treatment A was preferentially given to the harder cases (large stones). Large stones have lower success rates across the board, dragging down Treatment A's combined average. Treatment B was used more for easy cases (small stones), inflating its combined average. The group sizes are unbalanced, and that imbalance creates the reversal. This exact dataset appears in a 1986 paper by C.R. Charig et al. in the British Medical Journal and is the most-cited real example of Simpson's Paradox in medical literature.

The Berkeley Admissions Case

The most famous real-world instance involved UC Berkeley's graduate admissions in 1973. Aggregate data appeared to show significant bias against women: 44% of male applicants were admitted versus 35% of female applicants — a gap large enough to prompt a discrimination investigation. Peter Bickel, a statistician at Berkeley, disaggregated the data by department and found the opposite: in most individual departments, women were admitted at equal or slightly higher rates than men. The paradox arose because women disproportionately applied to highly competitive departments (like English and social sciences) with low admission rates for all applicants, while men disproportionately applied to less competitive departments (like engineering) with higher admission rates. Bickel published the analysis in Science in 1975. The study became a landmark case for the importance of disaggregated analysis in discrimination research.

Why It Happens: The Role of Confounding Variables

Simpson's Paradox is a manifestation of confounding — a third variable that is associated with both the predictor and the outcome and creates spurious relationships when ignored. In the kidney stone example, stone size is the confounder. In the Berkeley case, department choice and department selectivity are confounders.

A confounder creates an imbalance in how data is distributed across groups
When groups have different sizes and the confounder has different base rates across groups, aggregation produces a weighted average that misrepresents subgroup patterns
The paradox cannot occur when groups are equal in size — the imbalance is structurally necessary for the reversal
Adding the confounding variable to the analysis (stratification) restores the correct picture; omitting it generates the misleading aggregate

Mathematical Conditions for the Paradox

The paradox requires a specific algebraic condition to occur. For two groups A₁ and A₂ where Treatment X beats Treatment Y in each group:

If group A₁ contributes a much larger fraction of the total to Treatment Y's denominator than to Treatment X's denominator, and A₁ has a lower base success rate than A₂, the aggregated rates can reverse. Formally, if a/b < c/d and e/f < g/h, it is still possible that (a+e)/(b+f) > (c+g)/(d+h) when the weights b, d, f, h are sufficiently different.

The paradox is structurally impossible with equal group representation — unequal weighting is a prerequisite
Judea Pearl's causal hierarchy framework (from his 2018 book "The Book of Why") distinguishes association (seeing data), intervention (doing), and counterfactuals (imagining) — Simpson's Paradox lives at the association level and requires causal reasoning to resolve
The resolution depends on which causal direction is meaningful: in the medical case, stone size causally precedes treatment choice, so stratifying by stone size reveals the true treatment effect

Detection and Prevention

Detecting Simpson's Paradox in real data requires deliberate effort since the paradox is invisible in aggregate summaries.

Warning Sign	What to Look For
Unequal group sizes	One subgroup dominates another in certain conditions
Known confounders	Variables that affect both group membership and outcome simultaneously
Counterintuitive aggregate result	Aggregate finding that contradicts domain knowledge or prior research
Missing stratification	Reports presenting only combined totals without subgroup breakdown

Standard defenses include stratified analysis (computing effects separately for each level of the potential confounder), regression analysis with confounders included as covariates, and propensity score matching to create balanced comparison groups. In clinical trials, randomization automatically distributes confounders across treatment groups, making Simpson's Paradox structurally unlikely — but observational studies remain vulnerable indefinitely.

Simpson's Paradox: When Statistics Lie by Telling the Truth

The Classic Numerical Example

The Berkeley Admissions Case

Why It Happens: The Role of Confounding Variables

Mathematical Conditions for the Paradox

Detection and Prevention

Related Articles

Bayesian Inference: Priors, Posteriors, and Updating Beliefs with Data

Fermat's Last Theorem: 358 Years From Margin Note to Proof

Gödel's Incompleteness Theorems: The Limits of Mathematical Truth

P vs NP: The Million-Dollar Problem at the Heart of Computer Science