3.2 Simpson’s Paradox

As in the Two-Way Tables tutorial, let’s say a drug company is interested in evaluating the performance of two new drugs in development, New Drug 1 (D1) and New Drug 2 (D2), in alleviating Disease Y symptoms. They want to test it against the current standard drug (ST). They enroll 1000 people in a large clinical trial, and found that:

  1. out of the 400 people put on D1, 200 found their health status improve,
  2. out of the 200 people put on D2, 150 found their health status improve, and
  3. out of the 400 people put on ST, 240 found their health status improve.

You can create the table directly in R:

drug <- matrix(c(200, 200, 150, 50, 240, 160), ncol = 2, byrow = TRUE)
colnames(drug) <- c("Improved", "NotImproved")
rownames(drug) <- c("D1", "D2", "ST")
drug
##    Improved NotImproved
## D1      200         200
## D2      150          50
## ST      240         160

Recall that we decided to drop the D1 drug treatment from consideration because it appeared to be less effective than the standard treatment. Here we will take another look at that result.

Let’s create the 2x2 table for D1 and ST:

smallDrug <- drug[c(1,3),]
smallDrug
##    Improved NotImproved
## D1      200         200
## ST      240         160

What happens if we break this down by sex:

smallDrug.f <- matrix(c(80, 20, 210, 90), ncol = 2, byrow=TRUE)
colnames(smallDrug.f) <- c("Improved", "NotImproved")
rownames(smallDrug.f) <- c("D1", "ST")
smallDrug.f
##    Improved NotImproved
## D1       80          20
## ST      210          90
smallDrug.m <- smallDrug - smallDrug.f
smallDrug.m
##    Improved NotImproved
## D1      120         180
## ST       30          70

We calculate the row conditional probabilities for each table:

prop.table(smallDrug.f, margin = 1)
##    Improved NotImproved
## D1      0.8         0.2
## ST      0.7         0.3
prop.table(smallDrug.m, margin = 1)
##    Improved NotImproved
## D1      0.4         0.6
## ST      0.3         0.7
prop.table(smallDrug, margin = 1)
##    Improved NotImproved
## D1      0.5         0.5
## ST      0.6         0.4

Thus, 80% (80/100) of females improved on the new drug in comparison with 70% (210/300) for the standard drug. For males, 40% (120/300) improved on the new drug and 30% (30/100) on the standard drug. So both sexes did better on the new drug (D1) than on the standard drug (ST).

But when we combine the data and ignore the sex variable, 50% (200/400) improved on the new drug (D1) in comparison with 60% (240/400) on the standard drug (ST). In contrast to the sex-specific results, the overall result is that the patients did WORSE on the new drug (D1).

These seemingly conflicting results might be summarized for a lay person by saying that females did better on the new drug, that males also did better on the new drug, but that humans as a whole did worse. Thus, this is an example of Simpson’s Paradox.

This is a “composition effect” driven by two key factors.

First, it seemed that the condition was harder to treat in males regardless of which drug was used, so that the success rate for males would always tend to be lower than for females.

Second, the new drug (D1) was given to 300 men compared to 100 women, while the standard drug (ST) was given to 100 men compared to 300 women. Thus, if we don’t break down the results by sex, D1 will look worse than it should since the condition is harder to treat in males for any drug. Similarly, ST will look better than it should since the condition is easier to treat in females for any drug. This is why when we don’t separate by sex, it looks like ST is doing better, when really D1 is actually performing better for each sex.

In this case, sex plays the role of a “lurking/excluded” variable. Simpson’s Paradox is one reason why it is so important to identify potential lurking variables to make sure that the conclusions that are drawn from the data are valid.

Finally, it’s possible to see Simpson’s paradox at work in continuous data also. Check out this example on the median wage: Median wage Simpson’s paradox example