Chapter 13 Some special designs

This chapter gives an introduction to some special randomized controlled trial designs. Cluster-randomized trials no longer randomize on individual (patient) level. Equivalence/Non-inferiority trials differ from superiority trials in the sense that the objective of the trial is no longer to show that the intervention is better than control, but to show that two treatments are equivalent respectively one is non-inferior to the other one.

13.1 Cluster-randomized trials

Up to now, we have seen randomization of patients. Cluster-randomized trials allocate groups of patients en bloc to the same treatment. For example, Kinmonth et al. (1998) is an intervention study in primary care about whether additional training of nurses and general practitioners (GPs) in a general practice improves the care of patients with newly diagnosed type II diabetes mellitus. In this study, \(41\) practices were randomized to the status quo or to receive additional training for their staff.

13.1.1 Analysis

Standard methods can no longer be used, as patient responses from the same cluster are dependent. The intraclass correlation coefficient,

the correlation between outcomes within a cluster, needs to be taken into account. In the following, we discuss two methods of analysis:

  1. A simple approach is to construct a summary measure for each cluster and then analyze these summary values.
  2. To analyze data on patient level, a mixed model formulation, a regression model with cluster-specific random effects, is needed. An alternative approach are so-called generalized estimating equations (GEEs).

These methods will be applied to the following example.

Example 13.1 Oakeshott, Kerry, and Williams (1994), see also Kerry and Bland (1998a), report on the effect of guidelines for radiological referral on the referral practice of GPs where \(17\) practices in the intervention group received guidelines and 17 control practices were not sent anything. Outcome measure was the percentage of x-ray examinations requested that conformed to the guidelines. The data of the first 10 practices in the intervention group are shown below:

head(CRT, 10)
##           Group Practice Total Conforming Percentage
## 1  Intervention        1    20         20  100.00000
## 2  Intervention        2     7          7  100.00000
## 3  Intervention        3    16         15   93.75000
## 4  Intervention        4    31         28   90.32258
## 5  Intervention        5    20         18   90.00000
## 6  Intervention        6    24         21   87.50000
## 7  Intervention        7     7          6   85.71429
## 8  Intervention        8     6          5   83.33333
## 9  Intervention        9    30         25   83.33333
## 10 Intervention       10    66         53   80.30303

13.1.1.1 Summary measure analysis

Summary measure analysis is an analysis on practice level. This avoids the issue of correlated outcomes within a cluster by directly analyzing the percentages of x-ray examinations in each practice. For simplicity we follow Kerry and Bland (1998a) and treat the percentages as a continuous outcome.

Example 13.2 Results in the study about radiological referral guidelines (Example 13.1):

(mytTest <- t.test(Percentage ~ Group, var.equal = TRUE, data = CRT))
## 
##  Two Sample t-test
## 
## data:  Percentage by Group
## t = 1.8445, df = 32, p-value = 0.07438
## alternative hypothesis: true difference in means between group Intervention and group Control is not equal to 0
## 95 percent confidence interval:
##  -0.8310135 16.7625640
## sample estimates:
## mean in group Intervention      mean in group Control 
##                   81.52687                   73.56109
(DifferenceInMeans <- mean(mytTest$conf.int))
## [1] 7.965775

The same results can be obtained with regression analysis, but we can also use the number of referrals as weight. The latter is the preferred analysis on practice level, as it takes into account the different sample sizes in the different practices.

result <- lm(Percentage ~ Group, data = CRT) 
result.w <- lm(Percentage ~ Group, data = CRT, weight = Total)
knitr::kable(tableRegression(result, latex = FALSE))
% latex table generated in R 4.3.3 by xtable 1.8-4 package % Tue Oct 15 15:51:23 2024
Coefficient 95%-confidence interval \(p\)-value
Intercept 73.56 from 67.34 to 79.78 < 0.0001
GroupIntervention 7.97 from -0.83 to 16.76 0.074
knitr::kable(tableRegression(result.w, latex = FALSE))
% latex table generated in R 4.3.3 by xtable 1.8-4 package % Tue Oct 15 15:51:23 2024
Coefficient 95%-confidence interval \(p\)-value
Intercept 72.51 from 68.30 to 76.72 < 0.0001
GroupIntervention 6.98 from 0.14 to 13.82 0.046

13.1.1.2 Mixed model

The recommended analysis on patient level for a binary outcome is logistic regression with random effects for practices. The random effects account for possible correlation between patients treated in the same practice. We could in principle analyse the data in a long format, with a row for each patient and a binary outcome reflecting whether the x-ray from that particular patient conformed to the guidelines. However, as in standard logistic regression we can aggregate the data to binomial counts in each practice \(\times\) treatment group combination, but keep the random effects on practice level.

Example 13.3 Results in the study about radiological referral guidelines (Example 13.1):

library(lme4)
CRT$Outcome <- cbind(CRT$Conforming, CRT$Total-CRT$Conforming)
result.glmm <- glmer(Outcome ~ Group + (1|Practice), 
                     family = binomial, data = CRT)
(summary(result.glmm)$varcor)
##  Groups   Name        Std.Dev.
##  Practice (Intercept) 0.30862
(ttable <- coef(summary(result.glmm)))
##                    Estimate Std. Error  z value
## (Intercept)       1.0191021  0.1259360 8.092219
## GroupIntervention 0.4089998  0.1964084 2.082395
##                       Pr(>|z|)
## (Intercept)       5.858742e-16
## GroupIntervention 3.730643e-02
## treatment effect on patient level (odds ratio)
printWaldCI(ttable[2,1], ttable[2,2], FUN = exp)
##      Effect 95% Confidence Interval P-value
## [1,] 1.505  from 1.024 to 2.212     0.037

Example 13.4 Just for illustration, we show an analysis of Example 13.1 on patient level where we ignore the clustering and aggregate the data to a 2x2 table.

## aggregate data to 2x2 table ignoring cluster membership
Sums <- lapply(split(CRT, CRT$Group), 
               function(x) c(Sums = colSums(x[,3:4]))
               )
Total <- c(Sums$Intervention["Sums.Total"], Sums$Control["Sums.Total"])
Conf <- c(Sums$Intervention["Sums.Conforming"], 
          Sums$Control["Sums.Conforming"])
notConf <- Total-Conf
Group <- factor(c("Intervention", "Control"), 
                levels = c("Intervention", "Control")
                )
tab <- xtabs(cbind(Conf, notConf) ~ Group)
print(tab)
##               
## Group          Conf notConf
##   Intervention  341      88
##   Control       509     193
twoby2(tab)
## 2 by 2 table analysis: 
## ------------------------------------------------------ 
## Outcome   : Conf 
## Comparing : Intervention vs. Control 
## 
##              Conf notConf    P(Conf) 95% conf. interval
## Intervention  341      88     0.7949    0.7540   0.8305
## Control       509     193     0.7251    0.6908   0.7568
## 
##                                    95% conf. interval
##              Relative Risk: 1.0963    1.0260   1.1713
##          Sample Odds Ratio: 1.4693    1.1027   1.9577
## Conditional MLE Odds Ratio: 1.4688    1.0935   1.9828
##     Probability difference: 0.0698    0.0182   0.1191
## 
##              Exact P-value: 0.0087 
##         Asymptotic P-value: 0.0086 
## ------------------------------------------------------

This analysis is wrong, as it ignores the cluster structure and acts as if patients had been randomly assigned to treatment groups. This results in confidence intervals which are too narrow and \(P\)-values which are too small.

The mixed model formulation for a continuous outcome \(X_{ij}\) from patient \(j\) in cluster \(i\) is

Table 13.1: Models for treatments A and B in the study.
Treatment Model
A \(X_{ij} = \alpha + \xi_i + \epsilon_{ij}\)
B \(X_{ij} = \alpha + \Delta + \xi_i + \epsilon_{ij}\)

where

  • \(\Delta\) is the treatment effect of B vs. A,
  • \(\xi_i\) is a cluster-specific random effect with variance \(\sigma_b^2\),
  • the errors \(\epsilon_{ij}\) have variance \(\sigma_w^2\).

The total variance is then \(\mathop{\mathrm{Var}}(X_{ij}) = \sigma^2 = \sigma_b^2 + \sigma_w^2\) and the intraclass/intracluster correlation is \(\rho= \sigma_b^2/\sigma^2\). The analysis can be performed with the .

13.1.2 Sample size calculation

13.1.2.1 Number of patients

In addition to a standard sample size calculation, we need the average cluster size \(\bar n_c\) and the intraclass correlation \(\rho\). The total number of patients receiving each treatment should then be

\[\begin{equation*} n = \underbrace{\frac{ 2 \sigma^2 (u+v)^2}{ \Delta^2}}_{\scriptsize \mbox{standard RCT sample size}} \times \quad \underbrace{\alert{(1+ \rho(\bar n_c-1))}}_{\scriptsize \mbox{\alert{design effect $\Deff$}}}. \end{equation*}\]

For example, suppose \(\bar n_c=7\) and \(\rho=0.5\), then \(\Deff=4\), so 4-times as many patients are needed than for a standard RCT on patient-level.

13.1.2.2 Number of clusters

If the number of patients

\[n = \frac{ 2 \sigma^2 (u+v)^2}{ \Delta^2} \cdot \mbox{$\Deff$}\]

has been calculated, then the number of clusters \(N_c\) receiving each treatment is \(n/\bar n_c\). Alternatively, we could directly calculate

\[\begin{equation*} N_c = \frac{ 2 \alert{(\sigma_b^2 +\sigma_w^2/\bar n_c)} (u+v)^2}{ \Delta^2}. \end{equation*}\]

13.2 Equivalence and non-inferiority trials

Up to now, we discussed superiority studies. The aim of equivalence trials is not to detect a difference, but to establish equivalence of the two treatments. An equivalence trial needs pre-specification of an interval of equivalence \(I=(-\delta, \delta)\) for the treatment difference. The equivalence margin \(\delta\) takes over the role of the minimal clinically relevant difference \(\Delta\) in standard superiority trials. If \(\delta\) is only specified in one direction, then we have a non-inferiority trial with non-inferiority interval \(I=(-\infty, \delta)\) or \(I=(-\delta, \infty)\), depending on the context. Non-inferiority trials are based on one-sided hypothesis tests to check whether one group is almost as good (not much worse) than the other group.

The following examples illustrate why non-inferiority and equivalence studies can be more suitable than superiority studies in some situations.

  1. If the intervention to be assessed is similar to standard-of-care w.r.t. the primary endpoint but may have advantages in secondary endpoints, less side effects, lower production costs, more convenient formulation (tablet instead of infusion), less doses, better quality of life.
  2. Bio-equivalence of drugs with identical active ingredient (different formulations of the same drug, Generics).

Many diseases have standard-of-care, hence placebo controls are often not possible (anymore). Instead, a new treatment is compared to an active control (standard-of-care).

Table 13.2 compares superiority to equivalence studies in terms of the underlying test hypotheses for the treatment effect \(\Delta\). The point null hypothesis in a superiority trial is “too narrow” to be proven. If a superiority trial shows a large \(P\)-value, then the only implication is that there is no evidence against \(H_0\). In contrast, equivalence studies specify an interval of equivalence (through an “equivalence margin”) as a composite alternative hypothesis with its complement as the null hypothesis. This construction makes it possible to quantify the evidence for equivalence. Figure 13.3 illustrates the different design types.

Table 13.2: Comparison of superiority, equivalence, and non-inferiority study designs.
Design Hypotheses Sample.size
Superiority \(H_0\): \(\Delta=0\) vs. \(H_1\): \(\Delta \neq 0\) \(n = \frac{ 2 \sigma^2 (z_{\alert{1-\alpha/2}}+z_{\alert{1-{\beta}}})^2}{ \Delta^2}\)
Equivalence \(H_0\): \(\left\lvert\Delta \right\rvert\geq \delta\) vs. \(H_1\): \(\left\lvert\Delta \right\rvert\leq \delta\) \(n = \frac{2 \sigma^2(z_{\alert{1-\alpha}}+z_{\alert{1-{\beta/2}}})^2}{\delta^2}\)
Non-inferiority \(H_0\): \(\Delta \leq -\delta\) vs. \(H_1\): \(\Delta > -\delta\) \(n = \frac{2 \sigma^2(z_{\alert{1-\alpha}}+z_{\alert{1-{\beta}}})^2}{\delta^2}\)
Comparison of superiority, equivalence, and non-inferiority study designs.

Figure 13.1: Comparison of superiority, equivalence, and non-inferiority study designs.

13.2.1 Equivalence trials

To assess equivalence, we compute a confidence interval at level \(\gamma\) for the difference in the treatment means. The treatments are considered equivalent if both ends of the confidence interval lie within the pre-specified interval of equivalence \(I=(-\delta, \delta)\). If this does not occur, then equivalence has not been established. The Type I error rate of this procedure is

\[\alpha \approx (1-\gamma)/2,\]

as shown in Appendix D.

For \(\gamma=90\%\), we have \(\alpha\approx 0.05\) and for \(\gamma=95\%\), we have \(\alpha\approx 0.025\).

13.2.1.1 The TOST procedure

An alternative approach to assess equivalence is the TOST procedure as follows: %

  1. Apply Two separate standard One-Sided significance Tests (TOST) at level \(\alpha\):
    \begin{itemize}
  2. Test 1 for \(H_0\): \(\Delta \leq - \delta\) vs. \(H_1\): \(\Delta > - \delta\)
  3. Test 2 for \(H_0\): \(\Delta \geq \delta\) vs. \(H_1\): \(\Delta < \delta\) \end{itemize}
  4. If both one-sided tests can be rejected, we can conclude equivalence at level \(\alpha\).

Example 13.5 Remember Example 5.1 about a soporific drugs in terms of increase in hours of sleep compared to control. While this trial was not planned as an equivalence trial, let us suppose here the goal is to show equivalence at margin of \(\delta=0.5\) h. To make this scenario more reasonable, we simulate new (fake) data. Independent of the group (drug administered), we simulate all data from the same normal distribution with mean and variance estimated from the original data. You can see that there is less difference between the two groups in Figure 13.2 (fake data) than in Figure 5.3 (original data).

Comparison of two soporific drugs in a simulated (fake) example based on Example \@ref(exm:sleep)

Figure 13.2: Comparison of two soporific drugs in a simulated (fake) example based on Example 5.1

## solution 1: confidence interval
res <- t.test(extra ~ group, data = sleep, conf.level = 0.9)
print(res$conf.int)
## [1] -0.7323645  0.5160494
## attr(,"conf.level")
## [1] 0.9

The limits of the \(90\)% CI do not lie within the interval of equivalence (-0.5h, 0.5h). Hence, equivalence cannot be established.

## solution 2: TOST procedure
tost1 <- t.test(extra ~ group, data = sleep, mu = -0.5, 
                alternative = "greater", sig.level = 0.05)
print(tost1$p.value)
## [1] 0.1453168
tost2 <- t.test(extra ~ group, data = sleep, mu = 0.5, 
                alternative = "less", sig.level = 0.05)
print(tost2$p.value)
## [1] 0.0541846

One \(p\)-value is larger than \(\alpha=5\%\), the other one is smaller. Hence, equivalence cannot be established.

Concluding superiority, equivalence or non-inferiority based on a confidence interval for the difference in means.

Figure 13.3: Concluding superiority, equivalence or non-inferiority based on a confidence interval for the difference in means.

13.2.1.2 Sample size calculation for continuous outcomes

Example 13.6 An example for a sample size calculation for an equivalence trial is Holland et al. (2017), see Figure 13.4 for title and abstract of this publication.

Figure 13.4: Equivalence trial example Holland et al. (2017).

Consider the hypotheses \(H_0: \left\lvert\Delta\right\rvert > \delta\) and \(H_1: \left\lvert\Delta\right\rvert \leq \delta\). The type I error rate is calculated for \(\Delta = \delta\). The power \(1-\beta\) is computed at the center \(\Delta=0\) of \(H_1\). Appendix D shows that the required sample size in each group then is:

\[\begin{eqnarray*} n &=& \frac{2 \sigma^2(z_{\alert{(1+\gamma)/2}}+z_{\alert{1-{\beta/2}}})^2}{\delta^2} \\ & \approx & \frac{2 \sigma^2(z_{\alert{1-\alpha}}+z_{\alert{1-{\beta/2}}})^2}{\delta^2}, \end{eqnarray*}\]

where \(z_{\gamma}\) is the \(\gamma\)-quantile of the standard normal distribution.

Table 13.2 compares the sample sizes of the different trial designs. The equivalence margin \(\delta\) takes over the role of the clinically relevant difference \(\Delta\). In practice, \(\delta\) is usually considerably smaller than \(\Delta\) in a related but conventional superiority trial. This typically leads to a larger sample size required for equivalence trials. Moreover, \(z_{1-\alpha}\) replaces \(z_{1 - \alpha/2}\) and \(z_{{1-{\beta/2}}}\) replaces \(z_{{1-{\beta}}}\). This is a consequence of the null hypothesis, but not the alternative, being two-sided in an equivalence trial.

13.2.2 Non-inferiority trials

Non-inferiority trials are useful if a proven active treatment exists and placebo-controls are not acceptable for ethical reasons. They are a special case of equivalence trials, but conducted more often.

To assess non-inferiority, just perform one of the two TOST tests, say

\[H_0: \Delta \leq - \delta \mbox{ vs. } H_1: \Delta > - \delta.\]

A one-sided superiority trial corresponds to \(\delta=0\). The alternative procedure based on confidence intervals computes a confidence interval at level \(\gamma\) and rejects \(H_0\) of inferiority if the upper bound is smaller than \(\delta\). The type I error rate is \(\alpha = (1-\gamma)/2\).

The sample size (per group) is now

\[\begin{equation*} n = \frac{2 \sigma^2(z_{\alert{1-\alpha}}+z_{\alert{1-{\beta}}})^2}{\delta^2}. \end{equation*}\]

Sample sizes of the three designs are compared in Table 13.2. Table 13.3 compares the sample sizes for the three different designs of RCTs in practical examples assuming \(\Delta = \delta = \sigma = 1\) and different values for \(\alpha\) and \(\beta\). As before, the numbers can be adjusted for different values of \(\Delta/\sigma\) (respectively \(\delta/\sigma\)). For example, if \(\Delta/\sigma\) or \(\delta/\sigma = 1/2\) then the numbers have to be multiplied with 4.

Table 13.3: Comparison of sample sizes for RCTs in an example with \(\Delta = \delta = \sigma = 1\) and different values for \(\alpha\) and \(\beta\).
alpha beta Non_inferiority Superiority Equivalence
0.05 0.2 12.4 15.7 17.1
0.05 0.1 17.1 21.0 21.6
0.03 0.2 15.7 19.0 21.0
0.03 0.1 21.0 24.8 26.0

Example 13.7 Sample size calculations in Holland et al. (2017).

(#fig:fig_holland2-1)Sample size.

(#fig:fig_holland2-2)Sample size.

\[\begin{eqnarray*} \delta/\sigma &=& 25/51 = 0.49 \\ \alpha=5\% & \rightarrow & n = 17.1 /0.49^2 = 72 = 144/2\\ \alpha=2.5\% & \rightarrow & n = 21.0 /0.49^2 = 88 > 144/2\\ \end{eqnarray*}\]

Did they use \(\alpha = 5\%\), which corresponds to 90% CI?

Example 13.8 More than \(50\%\) of acute respiratory tract infections (ARTI) are being treated with antibiotics in primary care practice, despite mainly viral etiology. Procalcitonin (PCT) is a biomarker to diagnose bacterial infections.

The PARTI trial (Briel et al. 2008) compares PCT-guided antibiotics use (decision depends on the value of the measured biomarker) versus standard of care. The hypothesis is that PCT-guided antibiotics are non-inferior to standard of care (non-inferiority trial). Here, non-inferior would mean that patients on PCT-guided antibiotics arm do equally well compared to patients receiving standard of care but receive less antibiotics.

The study cohort consists of 458 patients with ARTI. The primary endpoint is the number of days with restrictions due to ARTI. Results for the primary endpoint are shown in Figure 13.5. Patients are interviewed 14 days after randomization and the non-inferiority margin is \(\delta=1\) day. The main secondary endpoint is the proportion of antibiotics prescriptions where the PCT arm is hypothesized to be superior. The results were that 58/232 (25%) patients with PCT-guided therapy and 219/226 (97%) patients with standard therapy received antibiotics. This is a substantial reduction of antibiotics use with PCT-guided therapy of 72% (95% CI: 66% to 78%).

Results for primary endpoint in the PARTI trial (Example \@ref(exm:partiex)): 95\% confidence intervals for the difference in days with restricted activities.

Figure 13.5: Results for primary endpoint in the PARTI trial (Example 13.8): 95% confidence intervals for the difference in days with restricted activities.

13.3 Additional references

Cluster-randomization is discussed in M. Bland (2015) (Sections 2.12, 10.13 and 18.8), equivalence and cluster-randomized trials in J. N. S. Matthews (2006) (Sections 11.5–11.6). The Statistics Notes J. Martin Bland and Kerry (1997), Kerry and Bland (1998a), Kerry and Bland (1998c), Kerry and Bland (1998b) discuss different aspects of cluster-randomized trials. Studies where the special designs from this chapter are used in practice are for example Butler et al. (2013), Burgess, Brown, and Lee (2005), Lovell et al. (2006).

References

Bland, J Martin, and Sally M Kerry. 1997. Statistics Notes: Trial randomised in clusters.” BMJ 315: 600.
Bland, Martin. 2015. An Introduction to Medical Statistics. Fourth. Oxford University Press.
Briel, Matthias, Philipp Schuetz, Beat Mueller, Jim Young, Ursula Schild, Charly Nusbaumer, Pierre Périat, Heiner C Bucher, and Mirjam Christ-Crain. 2008. “Procalcitonin-Guided Antibiotic Use Vs a Standard Approach for Acute Respiratory Tract Infections in Primary Care.” Archives of Internal Medicine 168 (18): 2000–2007.
Burgess, Ian F, Christine M Brown, and Peter N Lee. 2005. “Treatment of Head Louse Infestation with 4% Dimeticone Lotion: Randomised Controlled Equivalence Trial.” Bmj 330 (7505): 1423.
Butler, Christopher C, Sharon A Simpson, Kerenza Hood, David Cohen, Tim Pickles, Clio Spanou, Jim McCambridge, et al. 2013. “Training Practitioners to Deliver Opportunistic Multiple Behaviour Change Counselling in Primary Care: A Cluster Randomised Trial.” BMJ 346.
Holland, Anne E, Ajay Mahal, Catherine J Hill, Annemarie L Lee, Angela T Burge, Narelle S Cox, Rosemary Moore, et al. 2017. Home-based rehabilitation for COPD using minimal resources: a randomised, controlled equivalence trial.” Thorax 72 (1): 57–65.
Kerry, Sally M, and J Martin Bland. 1998a. Statistics Notes: Analysis of a trial randomised in clusters.” BMJ 316: 54.
———. 1998b. Statistics Notes: Sample size in cluster randomisation.” BMJ 316 (7130): 549.
———. 1998c. Statistics Notes: The intracluster correlation coefficient in cluster randomisation.” BMJ 316: 1455.
Kinmonth, Ann Louise, Alison Woodcock, Simon Griffin, Nicki Spiegal, and Michael J Campbell. 1998. “Randomised Controlled Trial of Patient Centred Care of Diabetes in General Practice: Impact on Current Wellbeing and Future Disease Risk.” BMJ 317 (7167): 1202–8.
Lovell, Karina, Debbie Cox, Gillian Haddock, Christopher Jones, David Raines, Rachel Garvey, Chris Roberts, and Sarah Hadley. 2006. “Telephone Administered Cognitive Behaviour Therapy for Treatment of Obsessive Compulsive Disorder: Randomised Controlled Non-Inferiority Trial.” BMJ 333 (7574): 883.
Matthews, John N. S. 2006. Introduction to Randomized Controlled Clinical Trials. Second. Chapman & Hall/CRC.
Oakeshott, P., S. M. Kerry, and J. E. Williams. 1994. “Randomized Controlle Trial of the Effect of the Royal College of Radiologists’ Guidelines on General Practitioners’ Referral for Radiographic Examination.” British Journal of General Practice 44: 197–200. https://doi.org/10.1046/j.1365-2125.2001.01382.x.