5.25 Multiple testing

When carrying out a confirmatory analysis, you test the null hypothesis that the primary predictor regression coefficient is zero and, typically, you carry out this test at the $\alpha$ = .05 level of significance. This limits the probability of a Type I error to 5%. That is, if the null hypothesis is true (no association) then there is only a 5% chance that your sample will result (incorrectly) in a statistically significant result. But what happens if you want to confirm multiple hypotheses using multiple tests? Suppose you have $V$ independent outcomes you want to test for association with an exposure. If you carry out $V$ confirmatory regression analyses, each at the $\alpha$ = .05 level of significance, then your actual Type I error rate over all $V$ tests is larger than 5%. If there are no associations, the chance of at least one Type I error is expressed by the following.

$P(\textrm{At least 1 Type I error among V tests}) = 1 – P(\textrm{no Type I errors}) = 1 - (1 - .05)^V$

If you have two tests, this probability is $1 – (0.95)^2 = 0.0975$ , almost twice the target Type I error rate. If you have five tests, it is $1 – (0.95)^5 = 0.2262$ . Carrying out multiple statistical tests with no adjustment for the inflated Type I error results in a greater risk of spurious findings.

The same issue arises in a regression where you have a categorical predictor with more than 2 levels. Although you can use an F-test to get an overall p-value for such a predictor (using car::Anova()), you may also want to know which pairs of levels are significantly different. This is called making post-hoc pairwise comparisons, or multiple comparisons. If you have a categorical predictor with $L$ levels, there are $L(L-1)/2$ possible pairwise comparisons.

How do we solve these problems? By using a multiple testing adjustment. The simplest adjustment is the Bonferroni correction which divides $\alpha$ by the number of tests. For example, if you have three tests, then you test each one at the .05/3 = .0167 level of significance. Equivalently, multiply each p-value by the number of tests to get an “adjusted p-value” which can be directly compared to .05. This results in an overall Type I error no greater than .05 and we express this by saying the multiple testing adjustment preserves a familywise Type I error rate of $\alpha$ = .05. The term “familywise” means “over the set of tests under consideration” (in this case, three tests). The Bonferroni correction can be carried out easily in R using p.adust(, method = "bonferroni").

The Bonferroni correction, however, is overly conservative. There are more powerful alternatives available in p.adjust(). The Hommel method (Hommel 1988) is more powerful and is valid when the tests are independent or when they are positively associated (Sarkar 1998; Sarkar and Chang 1997) (see ?p.adjust). While the Hommel method is effective with mildly correlated outcomes, in cases where tests are carried out on outcomes that are highly related to each other, it is recommended to use the more powerful (but more complicated to carry out) resampling-based stepdown MinP method (Blakesley et al. 2009; Westfall and Young 1993) (see, for example ?NPC::FWE).

NOTE: Simply having many tests does not always imply the need for multiple testing. See, for example, Kenneth J. Rothman (1990), Greenland and Hofman (2019), and Kenneth J. Rothman, Greenland, and Lash (2008) (p237). If one has a set of associations of interest, each of interest on their own, then it is appropriate to test each association at the $\alpha$ = .05 level with no adjustment for multiple testing. For example, suppose that, using data from a large national survey, an investigator decides to study ten different exposure-outcome relationships. These each can be tested on their own with no adjustment since each is of interest on its own. On the other hand, suppose an investigator is interested in the association between a particular exposure and a set of ten outcomes and wants to claim the exposure is harmful if any of the ten associations are statistically significant. In this case, it is necessary to adjust for multiple testing. Otherwise, if the exposure is actually benign then the chance of a spurious conclusion that the exposure is harmful is much larger than 5%.

Example 5.9: Carry out a confirmatory analysis to test the association between smoking status (exposure) and any of the following outcomes: body mass index (BMXBMI), systolic blood pressure (sbp), HDL cholesterol (LBDHDDSI), and triglycerides (LBDTRSI), adjusted for confounding due to age, gender, and race/ethnicity. Assume that we are interested in screening these outcomes for further study based on the significance of their associations with the exposure. In this setting, it is appropriate to adjust for multiple testing over these four tests since we will conclude smoking is harmful if it is associated with any of these outcomes. An adjustment is necessary to preserve a familywise .05 Type I error rate.

# Fit each model
fit1 <- lm(BMXBMI   ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3, data = nhanes)
fit2 <- lm(sbp      ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3, data = nhanes)
fit3 <- lm(LBDHDDSI ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3, data = nhanes)
fit4 <- lm(LBDTRSI  ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3, data = nhanes)

# Store each p-value
p1   <- car::Anova(fit1, type = 3)["smoker", "Pr(>F)"]
p2   <- car::Anova(fit2, type = 3)["smoker", "Pr(>F)"]
p3   <- car::Anova(fit3, type = 3)["smoker", "Pr(>F)"]
p4   <- car::Anova(fit4, type = 3)["smoker", "Pr(>F)"]

# Compute adjusted p-values
p.unadj  <- c(p1, p2, p3, p4)
p.bonf   <- p.adjust(p.unadj, method = "bonferroni")
p.hommel <- p.adjust(p.unadj, method = "hommel")

# Compare the results
DF   <- data.frame(Unadjusted = round(p.unadj,  3),
                   Bonferroni = round(p.bonf,   3),
                   Hommel     = round(p.hommel, 3))
rownames(DF) <- c("BMI", "SBP", "HDL", "Triglycerides")
DF

##               Unadjusted Bonferroni Hommel
## BMI                0.014      0.057  0.043
## SBP                0.853      1.000  0.853
## HDL                0.213      0.854  0.427
## Triglycerides      0.002      0.007  0.007

If we based our conclusions on the raw p-values, two of the four tests would be statistically significant (BMI and triglycerides). After adjusting for multiple testing, only one of the four is statistically significant when using the Bonferroni correction (which is overly conservative) but both remain significant after using the Hommel adjustment. The Hommel method is more powerful, resulting in smaller adjusted p-values. The only reason to use the Bonferroni adjustment is in a situation where you want to keep things very simple and easy to understand and if more powerful methods give the same result. Otherwise, use a more powerful method such as Hommel. In this example, our conclusions about statistical significance (that smoking status is significantly associated with BMI and triglycerides) were the same when using unadjusted p-values and after adjusting for multiple testing using the Hommel method, but this will not always be the case.

Example 5.10: Carry out a confirmatory analysis to test the association between smoking status (the exposure of interest) and triglycerides (LBDTRSI), adjusted for confounding due to age, gender, and race/ethnicity. Compare the mean outcome between the three levels of smoking status. There are three pairwise comparisons (Past vs. Never, Current vs. Never, Current vs. Past) and we will conclude there is a significant difference between levels if any of these three are different. Thus, it is appropriate to adjust for multiple comparisons.

fit.ex5.10 <- lm(LBDTRSI ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3, data = nhanes)
round(summary(fit.ex5.10)$coef, 4)

##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  1.0977     0.1587  6.9173   0.0000
## smokerPast                   0.1196     0.0907  1.3187   0.1880
## smokerCurrent                0.3820     0.1068  3.5760   0.0004
## RIDAGEYR                     0.0029     0.0022  1.3004   0.1942
## RIAGENDRFemale              -0.1359     0.0778 -1.7461   0.0815
## RIDRETH3Other Hispanic       0.0030     0.1676  0.0179   0.9857
## RIDRETH3Non-Hispanic White  -0.0343     0.1209 -0.2838   0.7767
## RIDRETH3Non-Hispanic Black  -0.2585     0.1792 -1.4424   0.1499
## RIDRETH3Non-Hispanic Asian  -0.0034     0.2027 -0.0166   0.9867
## RIDRETH3Other/Multi         -0.0222     0.2013 -0.1103   0.9122

summary(fit.ex5.10)$coef[c("smokerPast",
                    "smokerCurrent"), "Pr(>|t|)"]

##    smokerPast smokerCurrent 
##     0.1879633     0.0003881

# To get Current vs. Past, re-level smoker
tmpdat <- nhanes %>%
  mutate(smoker = relevel(smoker, ref = "Past"))
fit.ex5.10b <- lm(LBDTRSI ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3, data=tmpdat)
round(summary(fit.ex5.10b)$coef, 4)

##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  1.2173     0.1682  7.2365   0.0000
## smokerNever                 -0.1196     0.0907 -1.3187   0.1880
## smokerCurrent                0.2624     0.1188  2.2088   0.0277
## RIDAGEYR                     0.0029     0.0022  1.3004   0.1942
## RIAGENDRFemale              -0.1359     0.0778 -1.7461   0.0815
## RIDRETH3Other Hispanic       0.0030     0.1676  0.0179   0.9857
## RIDRETH3Non-Hispanic White  -0.0343     0.1209 -0.2838   0.7767
## RIDRETH3Non-Hispanic Black  -0.2585     0.1792 -1.4424   0.1499
## RIDRETH3Non-Hispanic Asian  -0.0034     0.2027 -0.0166   0.9867
## RIDRETH3Other/Multi         -0.0222     0.2013 -0.1103   0.9122

summary(fit.ex5.10b)$coef["smokerCurrent", "Pr(>|t|)"]

## [1] 0.02771

Pulling out the three p-values of interest (comparing the mean outcome between levels of smoking status), we get the following adjustment for multiple comparisons.

# Multiple comparisons adjustment
p.unadj <- c(0.187963329, 0.000388096, 0.02770697)
p.hommel <- p.adjust(p.unadj, method = "hommel")
DF   <- data.frame(Unadjusted = round(p.unadj,  3),
                   Hommel     = round(p.hommel, 3))
rownames(DF) <- c("Past vs. Never",
                  "Current vs. Never",
                  "Current vs. Past")
DF

##                   Unadjusted Hommel
## Past vs. Never         0.188  0.188
## Current vs. Never      0.000  0.001
## Current vs. Past       0.028  0.055

Before adjusting for multiple comparisons, two comparisons (Current vs. Never, Current vs. Past) appear to be statistically significant. However, after adjusting for multiple comparisons, only Current vs. Never remains significant (adjusted p < .05).

5.25.1 Primary vs. secondary tests

When you have many tests or many comparisons, you lose power when using a multiple testing adjustment. That is, if there really is an association, you have less of a chance of finding it since with more tests you have a stricter criterion to meet to conclude statistical significance. However, if some tests are less important than others you can, when designing a study, split them up into two groups: primary tests (confirmatory) and secondary tests (exploratory, or “hypothesis generating”). Then, adjust for multiple testing over just the primary tests.

This step in research design requires careful thought – there is a tradeoff between maximizing the chance of confirming certain hypotheses of primary interest (by limiting the number of primary tests) and maximizing the number of hypotheses that you have the opportunity to confirm (by increasing the number of primary tests). For tests that are in the “secondary” group, if any have a low p-value, you cannot claim confirmation, only that they are interesting results that warrant further research.

NOTE: Decide which tests are primary before looking at the results. You cannot look at the p-values and decide how many tests are primary in order to confirm the most tests; your choice of primary tests or comparisons must be a priori – before you have seen the results.

References

Blakesley, Richard E., Sati Mazumdar, Mary Amanda Dew, Patricia R. Houck, Gong Tang, Charles F. Reynolds, and Meryl A. Butters. 2009. “Comparisons of Methods for Multiple Hypothesis Testing in Neuropsychological Research.” Neuropsychology 23 (2): 255–64. https://doi.org/10.1037/a0012850.

Greenland, Sander, and Albert Hofman. 2019. “Multiple Comparisons Controversies Are about Context and Costs, Not Frequentism Versus Bayesianism.” European Journal of Epidemiology 34 (9): 801–8. https://doi.org/10.1007/s10654-019-00552-z.

Hommel, G. 1988. “A Stagewise Rejective Multiple Test Procedure Based on a Modified Bonferroni Test.” Biometrika 75 (2): 383–86. https://doi.org/10.1093/biomet/75.2.383.

Rothman, Kenneth J. 1990. “No Adjustments Are Needed for Multiple Comparisons.” Epidemiology 1 (1): 43–46. http://www.ncbi.nlm.nih.gov/pubmed/2081237.

Rothman, Kenneth J., Sander Greenland, and Timothy L. Lash. 2008. Modern Epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins.

Sarkar, Sanat K. 1998. “Some Probability Inequalities for Ordered MTP 2 Random Variables: A Proof of the Simes Conjecture.” The Annals of Statistics 26 (2): 494–504. https://doi.org/10.1214/aos/1028144846.

Sarkar, Sanat K., and Chung-Kuei Chang. 1997. “The Simes Method for Multiple Hypothesis Testing with Positively Dependent Test Statistics.” Journal of the American Statistical Association 92 (440): 1601–8. https://doi.org/10.1080/01621459.1997.10473682.

Westfall, Peter H, and Stanley S Young. 1993. Resampling-Based Multiple Testing, Examples and Methods for p-Value Adjustment. New York, NY: John Wiley & Sons, Inc.