Chapter 10 Subgroup analysis and multiple outcomes

Up to now we have focused on the comparison of two or more groups of patients, usually intervention vs. control. It is natural to investigate whether a treatment effect exists only in a certain subgroup of patients. Multiplicity issues arise if there are more than one subgroup-defining variables.

Multiplicity issues also arise if more than one outcome are considered in a randomized trial. Here we distinguish methods that adjust the outcome-specific \(p\)-values from multivariate methods, which combine the different outcomes in a single analyses.

10.1 Subgroup analysis

In a subgroup analysis, the question is whether the treatment effect is different for different types of patients. Possible subgroups are Male vs. Female, Children vs. Adults etc. Subgroups may also be disease-related, for example patients with/without diabetes or with/without a certain disease or disease status.

Problems that occur with this type of question are:

  • RCTs are usually powered to detect an overall difference between the two treatment groups. Analyses of subgroups will then have less power to detect a significant difference of the treatment effect between subgroups.
  • If subgroups are defined by many variables, the risk of a false positive (significant) finding increases.
  • The choice of subgroups is difficult. Subgroup analyses are more credible if they were {} specified before the data were collected rather than {}.

10.1.1 Comparing subgroups

Example 10.1 The Neonatal Hypocalcemia trial is a placebo-controlled trial of vitamin D supplementation of expectant mothers for the prevention of neonatal hypocalcemia J. N. S. Matthews (2006) (Section 9.2). The primary endpoint is serum calcium level of the babies at 1 week of age. Table 10.1 shows the summary-level data for the serum calcium levels in the treatment and placebo groups. It shows it separately for the breast-fed and bottle-fed babies as the question is whether the treatment has a different effect in these two subgroups.

Table 10.1: Table 10.2: Summary-level data for the Neonatal Hypocalcemia Trial (J. N. S. Matthews 2006).
Group Type n Mean Var SE
Breast-fed Supplement 64 2.445 0.0853 0.0365
Breast-fed Placebo 102 2.408 0.0987 0.0311
Bottle-fed Supplement 169 2.300 0.0752 0.0211
Bottle-fed Placebo 285 2.195 0.1018 0.0189
## data
n <- c(64, 102, 169, 285)
mean <- c(2.445, 2.408, 2.300, 2.195)  
var <- c(0.0853, 0.0987, 0.0752, 0.1018)
se <- sqrt(var/n)

To estimate the treatment effects in these two subgroups, we use the function printWaldCI() from biostatUZH.

library(biostatUZH)

# Breast-fed babies:
theta.breast <- mean[1] - mean[2]
se.breast <- sqrt(se[1]^2 + se[2]^2)
printWaldCI(theta.breast, se.breast)
##      Effect 95% Confidence Interval P-value
## [1,] 0.037  from -0.057 to 0.131    0.44
# Bottle-fed babies:
theta.bottle <- mean[3] - mean[4]
se.bottle <- sqrt(se[3]^2 + se[4]^2)
printWaldCI(theta.bottle, se.bottle)
##      Effect 95% Confidence Interval P-value
## [1,] 0.105  from 0.049 to 0.161     0.0002
reatment effects and $p$-values for each subgroup in the Neonatal Hypocalcemia Trial

Figure 10.1: reatment effects and \(p\)-values for each subgroup in the Neonatal Hypocalcemia Trial

The estimated treatment effects are shown in Figure 10.1 together with confidence intervals and \(p\)-values.

10.1.1.1 Test for interaction

The correct approach to investigate whether there is a subgroup effect is a test for interaction. Suppose we have two subgroups defined by males (M) and females (F). To investigate if the treatment effect \(\theta\) differs between subgroups we consider the %% truct a test of interaction with subgroup difference \[ \Delta \hat {\theta} = \hat \theta^{\tiny{\mbox{M}}} - \hat \theta^{\tiny{\mbox{F}}} \] where \(\hat \theta^{\tiny{\mbox{M}}}\) and \(\hat \theta^{\tiny{\mbox{F}}}\) are the estimated treatment effects for males and females, respectively. Based on the standard error \[ \mbox{se}(\Delta \hat {\theta}) = \sqrt{\mbox{se}^2(\hat \theta^{\tiny{\mbox{M}}})+\mbox{se}^2(\hat \theta^{\tiny{\mbox{F}}})}, \] tests and confidence intervals can be constructed.

The test for interaction is known in meta-analysis as the \(Q\)-test for heterogeneity and can also be used for more than two groups. Alternatively, a regression model with between treatment and subgroup could be used, if individual-level data are available.

Example 10.2 The test for interaction can be performed with the function printWaldCI() from biostatUZH.

theta.diff <- theta.breast - theta.bottle
se.diff <- sqrt(se.breast^2 + se.bottle^2)
printWaldCI(theta.diff, se.diff)
##      Effect 95% Confidence Interval P-value
## [1,] -0.068 from -0.177 to 0.041    0.22

In the Neonatal Hypocalcemia trial, there is no evidence for a different treatment effect in the two subgroups (\(p=0.22\)). Perhaps the study sample size and specifically the breast-fed group (\(n=166\)) was too small to provide such evidence.

It is tempting to consider the treatment effect separately in each subgroup and to assess the existence of a treatment interaction based on an informal comparison of the \(p\)-values in each subgroup. As shown in Figure 10.1, for breast-fed and bottle-fed babies we obtain \(p=0.44\) and \(p=0.0002\), respectively, so one may argue that the two \(p\)-values are very different and conclude that there is a subgroup difference. However, this constitutes an erroneous and flawed reasoning. This becomes obvious in the following example: Suppose the effect estimates in the two groups are identical but the sample sizes are different such that one of the subgroups has a significant treatment effect whereas the treatment effect is non-significant in the other group. The difference in the \(p\)-values is then only a result of different sample sizes, without any difference in effect sizes. Indeed, with equal effect sizes in the two groups, the test for interaction will return a \(p\)-value of 1.0, so no evidence for a subgroup difference whatsoever. This example clearly illustrates, that we need to compare effect sizes and not \(p\)-values to establish evidence for a difference between subgroups (J. N. Matthews and Altman 1996).

The test for interaction described above assumes normality of the effect estimates \(\hat \theta^{\tiny{\mbox{M}}}\) and \(\hat \theta^{\tiny{\mbox{F}}}\). If we have binary outcomes and want to compare relative risks or odds ratios, the test for interaction should be performed on the log scale because the distributions of log ratios tend to be closer to normal then (Douglas G. Altman and Bland 2003).

10.1.2 Selecting subgroups

The risk of false positive findings increases with increasing number of subgroup comparisons. For example, if we compare treatment effects in 20 different subgroups, then we would expect one of the analyses to yield a significant result (at \(\alpha=5\%\)), even if the treatment effect is the same in all subgroups (\(H_0\)). %% This result follows from the (approximate) uniform %% distribution of a (two-sided) \(p\)-value under \(H_0\) %% and holds even if the \(p\)-values are dependent. If the subgroups have been selected before the data were collected based on biological or clinical reasoning, then positive (significant) findings are more trustworthy than if subgroups have been defined post hoc.

10.1.3 Qualitative and quantitative interaction

The test for interaction tests the null hypothesis \(H_0\): \(\theta_1 = \theta_2 = \ldots = \theta_I\) of equal treatment effects \(\theta_i\) in \(i=1,\ldots,I\) subgroups. This is known as the \(Q\)-test in meta-analysis. A useful distinction between quantitative and qualitative interactions (subgroup differences) has been made.

Definition 10.1

  1. For a quantitative interaction, the size of the treatment effect \(\theta_i\) varies between subgroups \(i=1,\ldots,I\), but is always in the same direction.
  2. For a qualitative interaction, also the direction of the treatment effect varies between subgroups.

It has been argued that quantitative interactions are not surprising, whereas qualitative interactions are of greater clinical importance.

10.1.3.1 The method of Gail and Simon

The Gail-Simon test is a modified test for interaction with null hypothesis that there is no qualitative interaction (Gail and Simon 1985), \[ H_0: \theta_i \geq 0 \mbox{ for all $i$ or } \theta_i \leq 0 \mbox{ for all $i=1,\ldots,I$.} \]

This test for qualitative interaction is based on the test statistic

\[\begin{equation*} T = \min\{Q^+,Q^{-}\}, \end{equation*}\]

where

\[\begin{eqnarray*} Q^+ = \sum_{i: \hat \theta_i \geq 0} T_i^2 & \mbox{ and } & Q^- = \sum_{i: \hat \theta_i < 0} T_i^2 \end{eqnarray*}\]

with \(T_i=\hat \theta_i/\mbox{se}(\hat \theta_i)\) being the test statistics in each subgroup \(i\). The reference distribution for \(T\) is a mixture of \(I-1\) \(\chi^2\)-distributions, so we use statistical software to calculate a \(p\)-value, as implemented in biostatUZH::gailSimon().

An alternative method is based on pairwise comparisons of simultaneous confidence intervals (Pan and Wolfe 1997).

Example 10.3 In the National Surgical Adjuvant Breast and Bowel Project Trial J. N. S. Matthews (2006) (Section 9.2), the outcome was the proportion disease-free survival at 3 years, comparing chemotherapy (PF) only versus chemotherapy plus tamoxifen (PFT) for the treatment of Breast Cancer. The proportion difference \(\theta\) is positive if disease free survival is more likely under PF. Four subgroups based on the progesterone receptor status (PR) and age are considered, see Table 10.3 and Figure 10.2. There appears to be an advantage to PF in one of the four subgroups while PFT seems preferable in the other three subgroups.

Table 10.3: Table 10.4: Subgroups by age and progesterone receptor status (PR) in the National Surgical Adjuvant Breast and Bowel Project Trial (Matthews, 2023).
PR < 10 fmol
PR >= 10 fmol
Metric Age…50 Age….50 Age…50.1 Age….50.1
Difference in proportions \(\widehat{\theta}\) 0.163 -0.114 -0.047 -0.151
\(\mbox{se}(\widehat{\theta})\) 0.0788 0.0689 0.0614 0.0547
\(\widehat{\theta}/\mbox{se}(\widehat{\theta})\) 2.07 -1.65 -0.77 -2.76
\(p\)-value 0.04 0.10 0.44 0.0058
Treatment effects with confidence intervals and (two-sided) $p$-values for each subgroup in the National Surgical Adjuvant Breast and Bowel Project Trial [@matthews].

Figure 10.2: Treatment effects with confidence intervals and (two-sided) \(p\)-values for each subgroup in the National Surgical Adjuvant Breast and Bowel Project Trial (J. N. S. Matthews 2006).

The standard test for interaction gives \(p=0.0096\), as application of the function shows. For the Gail-Simon test, we obtain \(T=2.07^2=4.28\) and \(p=0.09\). So, there is moderate to strong evidence for a quantitative interaction (\(p=0.0096\)), but only weak evidence for a qualitative interaction (\(p=0.09\)).

## data
rd <- c(0.163, -0.114, -0.047, -0.151)
rd.se <- c(0.0788, 0.0689, 0.0614, 0.0547)

## Q-test for interaction
library(meta)
mymeta <- metagen(TE=rd, seTE=rd.se)
print(formatPval(mymeta$pval.Q))
## [1] "0.0096"
## Gail and Simon test
pGS <- gailSimon(thetahat = rd, se = rd.se)
print(formatPval(pGS))
## [1] "0.09" "4.28"

10.2 Methods for multiple outcomes

Many clinical trials measure numerous outcome variables (primary and secondary) relevant to the condition studied. This raises multiplicity problems similar to those encountered when several subgroups are compared.

In what follows we assume that no distinction between primary and secondary outcomes has been made. In practice the problem of multiplicity is often avoided by selecting just one primary outcome. If there is more than one primary outcome, multiplicity adjustments need to be applied. Multiplicity adjustments can also be applied to secondary outcomes, but this is not always required ICH Expert Working Group (1999).

Example 10.4 Additionally to the primary endpoint (Epworth scale), the Didgeridoo Study (see Example 7.1) had also secondary endpoints:

  • Pittsburgh quality of sleep index (range 0-21)
  • Partner-rating of sleep disturbance (range 0-10)
  • Apnoea-hypopnoea index (range 0-36)

Figure 10.3 shows the baseline and follow-up (4 months) measurements for the primary and secondary outcomes, compared between the treatment and control group. Change score analyses (using two-sample \(t\)-tests) have been performed for each of these outcomes, see Table 10.5 or also the published results in Figure ??.

Puhan et al. (2005) have selected the Epworth index as primary outcome, but in what follows we assume that no distinction between primary and secondary outcomes has been made.

Baseline and follow-up measurements for primary and secondary outcomes in the Didgeridoo Study.

Figure 10.3: Baseline and follow-up measurements for primary and secondary outcomes in the Didgeridoo Study.

Table 10.5: Table 10.6: Treatment effects for primary and secondary endpoints in the Didgeridoo Study using change score analyses.
Endpoint Estimate SE t.value p.value
Epworth scale -3.0 1.3 -2.3 0.03
Pittsburgh index -0.7 0.7 -1.0 0.31
Partner rating -2.8 0.9 -3.3 0.0035
Apnoea index -6.2 3.0 -2.1 0.05

The next subsections explain two different approaches how to prevent multiplicity problems: \(p\)-value adjustments and multivariate methods.

10.2.1 Adjusting for multiplicity

The problem with multiple outcomes is that the chance that at least one of them will provide a siginficant result will be larger than the nominal \(\alpha\) level. Several adjustments for this are described in the following.

Definition 10.2 The family-wise error rate (FWER) is the probability of at least one false significant result. If \(k>1\) outcomes are considered, the FWER will be larger than the Type I error rate \(\alpha\). Under independence, we have

\[\begin{equation*} \mbox{FWER} = 1-(1-\alpha)^k. \end{equation*}\]

For small \(k\), the FWER under independence is close to the Bonferroni bound \(k \cdot \alpha\) as can be seen in Figure 10.4.

Family-wise error rate (FWER) under independence of the outcomes for a type I error rate $\alpha = 0.05$ (black line) compared to the Bonferroni bound (green dotted line) as functions of the number of outcomes $k$.

Figure 10.4: Family-wise error rate (FWER) under independence of the outcomes for a type I error rate \(\alpha = 0.05\) (black line) compared to the Bonferroni bound (green dotted line) as functions of the number of outcomes \(k\).

10.2.1.1 Bonferroni method

The Bonferroni method compares the \(p\)-values \(p_1, \ldots, p_k\), obtained from different outcomes, not with \(\alpha\), but with threshold \(\alpha/k\). If at least one of the \(p\)-values is smaller than \(\alpha/k\), then the trial result is considered significant at the \(\alpha\) level. This procedure controls FWER at level \(\alpha\). An equivalent procedure is to compare the adjusted \(p\)-values \(k \, p_i\) with the standard \(\alpha\) threshold.

Example 10.5 In the Didgeridoo Study, there is still evidence for a treatment effect after Bonferroni adjustments:

## unadjusted p-values
printPval(p)
## [1] 0.03   0.31   0.0035 0.05
## Bonferroni-adjusted p-values
printPval(p.adjust(p, method = "bonferroni"))
## [1] 0.13 1.00 0.01 0.19

However, the Bonferroni method is unsatisfactory as it takes only the smallest \(p\)-value into account, e.g. 

  • \(p=\) 0.01, 0.7, 0.8, 0.9 is ``significant’’,
  • \(p=\) 0.03, 0.04, 0.05, 0.05 is ``not significant’’.

As a result, the Bonferroni method is very conservative and has low power.

10.2.1.2 Holm method

The Holm method is based on all ordered \(p\)-values \(p_{(1)} \leq \ldots \leq p_{(k)}\). A significant result is obtained, if \[ k \, p_{(1)} \leq \alpha , \mbox{ or } (k-1) \, p_{(2)} \leq \alpha, \ldots \mbox{ or } p_{(k)} \leq \alpha. %p_{(j)} \leq \alpha / (k-j+1) \]

Example 10.6 In the Didgeridoo Study:

## Holm-adjusted p-values
printPval(p.adjust(p, method = "holm"))
## [1] 0.10 0.31 0.01 0.10

10.2.1.3 Simes’ method

Simes’ method (Simes 1986) is also based on the ordered \(p\)-values \(p_{(1)} \leq \ldots \leq p_{(k)}\). A significant result is obtained, if \[ k \, p_{(1)} \leq \alpha, \mbox{ or } k/2 \, p_{(2)} \leq \alpha , \ldots \mbox{ or } p_{(k)} \leq \alpha. %p_{(j)} \leq \alpha / (k-j+1) \]

Variations of it are the Hochberg (1988) and Hommel (1988) method.

Example 10.7 In the Didgeridoo Study:

## Simes-adjusted p-values and variants
printPval(p*(4/rank(p)))
## [1] 0.07 0.31 0.01 0.06
printPval(p.adjust(p, method = "hochberg"))
## [1] 0.09 0.31 0.01 0.09
printPval(p.adjust(p, method = "hommel"))
## [1] 0.07 0.31 0.01 0.09

10.2.1.4 Comparison of methods

Multiplicative factors for multiplicity-adjustment of `r k` $p$-values.

Figure 10.5: Multiplicative factors for multiplicity-adjustment of r k \(p\)-values.

To obtain multiplicity-adjusted \(p\)-values, the ordinary \(p\)-values are multiplied with a factor, with subsequent truncation at 1 if necessary. The multiplicative factors for 15 \(p\)-values are compared in Figure 10.5 between the Bonferroni, Holm and Simes’ methods. Important differences between the three methods are the following:

  • The Bonferroni method controls FWER, but has low power.
  • The Holm method also controls FWER, but with a lower decrease of power than the classical Bonferroni method.
  • If the \(p\)-values are independent or positively correlated, then Simes’ method also controls FWER. For {positively correlated} \(p\)-values, Simes’ method is even closer to the nominal \(\alpha\) level than the Holm method.
  • Application of Holm and Simes requires knowledge of the \(p\)-value rank, whereas the Bonferroni method requires only knowledge of the number of \(p\)-values.

10.2.2 Multivariate methods

Multiplicity issues can also be avoided by considering the whole \(k\)-dimensional outcome vector simultaneously rather than separately. #### Hotelling’s \(T\)-test If all outcomes are continuous, then Hotelling’s \(T\)-test, a generalization of the \(t\)-Test, could be used.

But this test accepts deviations from the null hypothesis in any direction, so typically lacks power.

Example 10.8 Hotelling’s \(T\)-test for the Didgeridoo Study:

library(Hotelling)
print(hot.test <- hotelling.test(ep + pi + pa + ap ~ group, 
                                 data=alphorn))
## Test stat:  16.226 
## Numerator df:  4 
## Denominator df:  18 
## P-value:  0.02848

10.2.2.1 Combining outcomes in a summary score

Under the assumption that all outcomes are continuous and are orientated in the same direction, the following procedure can be used to combine multiple outcomes into one summary score:

  1. Standardize endpoints separately to a common scale: \[ \mbox{$z$-score} = {(\mbox{Observation} - \mbox{M})}/{\mbox{SD}} \] Here \(\mbox{M}\) and \({\mbox{SD}}\) are mean and standard deviation of the respective endpoint (ignoring group membership).
  2. Compute a summary score as the average of the different \(z\)-scores for each individual.
  3. Compare summary scores between treatment groups with a two-sample \(t\)-test.

Example 10.9 Summary scores in the Didgeridoo Study are shown in Figure 10.6. Below we repeat the analysis. There are slight differences in the results, presumably due to a different handling of missing values (there was one patient with missing partner rating information and another one with missing Pittsburgh index measurement).

## z-scores
ep.z <- scale(alphorn$ep)
pi.z <- scale(alphorn$pi)
pa.z <- scale(alphorn$pa)
ap.z <- scale(alphorn$ap)

## summary scores
z.summary <- (ep.z + pi.z + pa.z + ap.z)/4

## t-test
print(summary.test <- t.test(z.summary ~ group, var.equal = TRUE))
## 
##  Two Sample t-test
## 
## data:  z.summary by group
## t = -3.228, df = 21, p-value = 0.004033
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.3205570 -0.2857226
## sample estimates:
## mean in group 0 mean in group 1 
##      -0.3829617       0.4201781
print(DifferenceInMeans <- mean(summary.test$conf.int))  
## [1] -0.8031398

Figure 10.6: Summary score analysis combining the \(z\)-scores of all outcomes. Published figure from the Didgeridoo Study.

10.2.2.2 O’Brien’s rank-based method

O’Brien’s rank-based method (O’Brien 1984) is useful if some or all outcome variables are not normally distributed, such as non-normal continuous or categorical with a sufficiently large number of categories. The procedure starts by replacing each observation with its within-outcome rank. The test is accomplished by performing a \(t\)-test to compare the within-patient sums of ranks between the two treatment groups.

Example 10.10 O’Brien’s rank-based method in the Didgeridoo Study:

## ranks and rank sums
ep.r <- rank(alphorn$ep, na.last="keep")
pi.r <- rank(alphorn$pi, na.last="keep")
pa.r <- rank(alphorn$pa, na.last="keep")
ap.r <- rank(alphorn$ap, na.last="keep")
rank.sum <- ep.r + pi.r + pa.r + ap.r

## t-test for rank sums
(obrian.test <- t.test(rank.sum ~ group, var.equal=TRUE))
## 
##  Two Sample t-test
## 
## data:  rank.sum by group
## t = -3.1392, df = 21, p-value = 0.004954
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -36.791720  -7.469818
## sample estimates:
## mean in group 0 mean in group 1 
##        40.26923        62.40000
(DifferenceInMeans <- mean(obrian.test$conf.int))  
## [1] -22.13077

Figure 10.7 shows the within-outcome ranks for the primary outcome Epworth score and the within-patient sums of ranks for all outcomes. We can see a difference in the distribution with tendency for smaller ranks respectively rank sums in the Didgeridoo group. The \(t\)-test above identifies evidence for a difference in the mean rank sums (\(p=0.005\)), very similar to the \(p\)-value from the summary score analysis (\(p=0.004\)).

Ranks of the observations of the primary outcome in the Didgeridoo Study (left plot) and sum of the ranks of all outcomes (primary plus 3 secondary).

Figure 10.7: Ranks of the observations of the primary outcome in the Didgeridoo Study (left plot) and sum of the ranks of all outcomes (primary plus 3 secondary).

10.3 Additional references

The topics of this chapter are discussed in M. Bland (2015) (Ch. 9.10 Multiple Significance Tests) and J. N. S. Matthews (2006) (Ch. 9 Subgroups and Multiple Outcomes). Interactions are discussed in the Statistics Notes J. N. Matthews and Altman (1996) and Douglas G. Altman and Bland (2003).

More details on subgroup analyses in RCTs are given in Rothwell (2005). Studies where the methods from this chapter are used in practice are for example Pinkney et al. (2013), Bennewith et al. (2002) and Fine Olivarius et al. (2001).

References

———. 2003. Statistics Notes: Interaction revisited: the difference between two estimates.” BMJ 326 (7382): 219.
Bennewith, Olive, Nigel Stocks, David Gunnell, Tim J Peters, Mark O Evans, and Deborah J Sharp. 2002. General practice based intervention to prevent repeat episodes of deliberate self harm: cluster randomised controlled trial.” BMJ 324 (7348).
Bland, Martin. 2015. An Introduction to Medical Statistics. Fourth. Oxford University Press.
Fine Olivarius, Niels de, Henning Beck-Nielsen, Anne Helms Andreasen, Mogens Hørder, and Poul A Pedersen. 2001. Randomised controlled trial of structured personal care of type 2 diabetes mellitus.” BMJ 323 (7319).
Gail, M, and Richard Simon. 1985. “Testing for Qualitative Interactions Between Treatment Effects and Patient Subsets.” Biometrics 41 (2): 361–72.
Hochberg, Y. 1988. “A Sharper Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 75: 800–802.
Hommel, G. 1988. “A Stagewise Rejective Multiple Test Procedure Based on a Modified Bonferroni Test.” Biometrika 75: 383–86.
ICH Expert Working Group. 1999. “Statistical Principles for Clinical Trials.” Stat Med 18: 1905–42.
Matthews, John N. S. 2006. Introduction to Randomized Controlled Clinical Trials. Second. Chapman & Hall/CRC.
Matthews, John NS, and Douglas G Altman. 1996. Statistics Notes: Interaction 2: compare effect sizes not P values.” BMJ 313 (7060): 808.
O’Brien, P. C. 1984. “Procedures for Comparing Samples with Multiple Endpoints.” Biometrics 40: 1079–87.
Pan, Guohua, and Douglas A Wolfe. 1997. “Test for Qualitative Interaction of Clinical Significance.” Statistics in Medicine 16 (14): 1645–52.
Pinkney, Thomas D, Melanie Calvert, David C Bartlett, Adrian Gheorghe, Val Redman, George Dowswell, William Hawkins, et al. 2013. Impact of wound edge protection devices on surgical site infection after laparotomy: multicentre randomised controlled trial (ROSSINI Trial).” BMJ 347.
Puhan, Milo A, Alex Suarez, Christian Lo Cascio, Alfred Zahn, Markus Heitz, and Otto Braendli. 2005. Didgeridoo playing as alternative treatment for obstructive sleep apnoea syndrome: randomised controlled trial.” BMJ 332: 1–5.
Rothwell, Peter M. 2005. “Subgroup Analysis in Randomised Controlled Trials: Importance, Indications, and Interpretation.” The Lancet 365 (9454): 176–86.
Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73: 751–54.