Chapter 11 Sequential methods and trial protocols

11.1 Monitoring accumulating data

Adequate evidence to settle which treatment is superior may have accumulated long before a clinical trial runs to its planned conclusion. In this case, the ethical issue emerges, that patients may be receiving a treatment that could have been known to be inferior at the time of treatment. A solution is the application of repeated significance tests (RST) at several interim analyses, where a decision will be made whether or not to stop the trial. For this, the Neyman-Pearson hypothesis test is suitable. However, statistical issues emerge to ensure that the type I error rate \(\alpha\) is maintained.

11.1.1 Data monitoring committees

The decision to terminate a trial must take into account many features:

  • Efficacy of treatment
  • Worryingly high incidence of side effects
  • Evidence that the new treatment is less efficacious than the existing treatment.

A data (and safety) monitoring committee (D[S]MC) (with clinicians, statisticians, …) periodically reviews the evidence currently available from the trial. This is done at a relatively small number of times and may require unblinded study information. Extensive use of DMCs has led to the widespread use of group sequential methods.

11.1.2 Group sequential methods

“Group sequential” means that the data analysis is conducted in interim analyses after every successive group of \(2n\) patients, \(2n=20\) or \(2n=30\). Fixing a maximum number of \(N\) groups, a trial is stopped at interim if the (two-sided) \(P\)-value is smaller than a pre-specified nominal significance level \(\tilde \alpha\), or if \(N\) groups of patients have been recruited.

11.1.2.1 Nominal significance level

Using the same nominal significance level \(\tilde \alpha\) for each test is known as Pocock stopping rule. The following example illustrates that with the standard \(\alpha = 0.05\), the false positive rate is larger than \(5\)% when we have multiple interim analyses.

Example 11.1 Remember the example on stopping bias (Figure 4.2) where we had accumulating data generated without a treatment effect. Each simulation is analyzed in 10 groups of 30 patients. Figure 11.1 shows 20 selected profiles of the test statistic values based on the accumulating data and the false positive rate based on all 10000 simulations. With the standard significance threshold, the false positive rate is 19%. The false positive rate is 5% with an adjusted significance threshold.

`r maxy` selected profiles of the test statistic values based on the accumulating data and the false positive rate based on all `r format(nsim, scientific=FALSE)` simulations, once with the standard significance threshold (top) and once with an adjusted significance threshold (bottom).

Figure 11.1: r maxy selected profiles of the test statistic values based on the accumulating data and the false positive rate based on all r format(nsim, scientific=FALSE) simulations, once with the standard significance threshold (top) and once with an adjusted significance threshold (bottom).

The nominal significance level \(\tilde \alpha\) depends on the type I error rate \(\alpha\) and the number of groups \(N\). For the Pocock stopping rule, Figure 11.2 shows how the nominal significance level decreases for increasing number of groups, starting with an \(\alpha\) of \(5\)% for one group. For example, for 10 interim analyses the nominal significance level is \(\tilde \alpha = 0.0106\), which corresponds to a \(z\)-value of 2.56. This is the value we used in Example 11.1 and Figure 11.1. Standard adjustments for multiple testing are too conservative since those methods are based on accumulating data with a specific dependence structure. For example, the Bonferroni correction would use \(\tilde \alpha = 0.05/10 = 0.005\) rather than \(\tilde \alpha = 0.0106\).

Pocock stopping rule and Bonferroni correction as a function of the number of groups.

Figure 11.2: Pocock stopping rule and Bonferroni correction as a function of the number of groups.

11.1.2.2 Sample size calculation

The maximal number \(N\) of interim analyses after every successive group of \(2n\) patients affects not only the nominal significance level \(\tilde \alpha\) but also the required sample size. Sample size calculation is performed for the maximum total sample size \(T = 2n\cdot N\). Note that the effective sample size depends on when the trial stops and can be smaller.

Example 11.2 Consider a continuous outcome and fix \(\alpha=0.05\). Suppose the goal is to detect a treatment difference of 0.5 standard deviation with power 90%. Then, \(\tilde \alpha\) and \(T\) vary as shown in Table 11.1 depending on different values of \(N\). The sample size per interim analysis is then determined by \(N\) and \(T\) as \(n=T/(2N)\) (also displayed in Table 11.1).

Table 11.1: Nominal significance level \(\tilde \alpha\) to detect a treatment difference of \(0.5\) standard deviation with power 90% and the maximum total sample size \(T = 2n\cdot N\) based on the Pocock criterion with different values for the number of interim analyses \(N\).
\(N\) \(\tilde\alpha\) \(T\) \(n\)
1 0.0500 168 84.08451
2 0.0294 184 46.23373
5 0.0158 200 20.27571
10 0.0106 220 10.69069

11.1.3 Stopping rules

The Pocock stopping rule, where each test uses the same nominal significance level \(\tilde \alpha\), has two disadvantages:

  1. It is not too difficult for a trial to be halted early, which is considered as undesirable and uncompelling by some authors.
  2. Suppose the trial terminates at the final analysis with \(\tilde \alpha < p < \alpha\). Many clinicians find it difficult to accept that the result of this trial is non-significant.

It is better to use a stopping rule \(a_1, a_2, \ldots, a_N\) where \(a_j\) is the significance threshold at the \(j\)-th interim analysis with \(a_1\) very small and \(a_j\) gradually increasing to \(a_N\) close to \(\alpha\). Three such stopping rules are shown in Table 11.2 Figure 11.3, next to the Pocock stopping rule which is constant. They all control an overall type I error rate \(\alpha=5\%\) with \(N=5\) maximal interim analyses, except for the Haybittle-Peto approach which controls it only approximately.

Table 11.2: Table 11.3: Nominal significance levels of different stopping rules for 5 interim analyses.
Interim analysis
Method 1 2 3 4 5
Haybittle-Peto (1971) 0.001000 0.0010 0.0010 0.0010 0.0500
Pocock (1977) 0.015800 0.0158 0.0158 0.0158 0.0158
O’Brien-Fleming (1979) 0.000005 0.0013 0.0085 0.0228 0.0417
Fleming et al. (1984) 0.003800 0.0048 0.0053 0.0064 0.0432
Different methods for stopping rules with $5$ maximal interim analyses.

Figure 11.3: Different methods for stopping rules with \(5\) maximal interim analyses.

These stopping rules are implemented in the R package gsDesign (note that, here for 3 interim analyses, the output Nominal p is the one-sided nominal significance level, which needs to be multiplied by \(2\)):

library(gsDesign)

## Pocock stopping rule
gsDesign(k = 3, sfu = "Pocock", test.type = 2)
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.384 2.29     0.011 0.0110
##          2  0.767 2.29     0.011 0.0079
##          3  1.151 2.29     0.011 0.0060
##      Total                       0.0250 
## 
## ++ alpha spending:
##  Pocock boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta     1      2      3 Total   E{N}
##   0.0000 0.011 0.0079 0.0060 0.025 1.1276
##   3.2415 0.389 0.3421 0.1689 0.900 0.7210
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta     1      2     3 Total
##   0.0000 0.011 0.0079 0.006 0.025
##   3.2415 0.000 0.0000 0.000 0.000
## O'Brien Fleming stopping rule
gsDesign(k = 3, sfu = "OF", test.type = 2)
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.339 3.47    0.0003 0.0003
##          2  0.677 2.45    0.0071 0.0069
##          3  1.016 2.00    0.0225 0.0178
##      Total                       0.0250 
## 
## ++ alpha spending:
##  O'Brien-Fleming boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2      3 Total   E{N}
##   0.0000 0.0003 0.0069 0.0178 0.025 1.0111
##   3.2415 0.0565 0.5288 0.3147 0.900 0.7987
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta     1      2      3 Total
##   0.0000 3e-04 0.0069 0.0178 0.025
##   3.2415 0e+00 0.0000 0.0000 0.000

There are also other forms of stopping rules. A more flexible approach is based on the Lan-DeMets alpha spending function and does not require the maximum number of interim analyses to be specified in advance. Another popular approach to analyze accumulating data is Whitehead’s triangular test based on the score statistic \(Z\) and the Fisher information \(V\). Figure 11.4 shows how this approach is used as a stopping rule in an example.

Whitehead's triangular test as a stopping rule: Crossing of green/red line implies positive/negative trial conclusion.

Figure 11.4: Whitehead’s triangular test as a stopping rule: Crossing of green/red line implies positive/negative trial conclusion.

11.1.3.1 Problems of stopping at interim

If a trial terminates early, then there are problems with obtaining unbiased treatment effects due to the sequential nature of the trial. If a traditional analysis is performed in a trial that stops at interim because treatment is found to be significantly better than control, then

  • the treatment effect estimate will be too large,
  • the CI will be too narrow,
  • and the \(P\)-value will be too small,

explained in this video: (https://vimeo.com/231014768). Advanced methods for attempting to correct this bias are available, but rarely used.

Example 11.3 We illustrate the problems of stopping at interim in a simulation example. Suppose that we have two equally sized treatment arms and a continuous outcome. The primary outcome is the difference in means \(\theta\). We fix the following analysis parameters:

  • \(N=10\) interim analyses with Pocock bound \(z_P=2.56\) (nominal significance level \(\tilde \alpha = 0.01\))
  • \(n=15\) patients per group and treatment arm to achieve a power of 75% to detect the true treatment effect \(\theta=0.35\) with standard deviation \(\sigma=1\).

Significance at the \(k\)-th interim analysis implies the following minimum detectable difference (MDD, Compare Section 5.2) for the treatment effect estimate:

\[\begin{equation*} \hat \theta \geq z_P \sqrt{2/(k \cdot n)}. \end{equation*}\]

Effect sizes in a simulation with the Pocock stopping rule for a maximal number of 10 interim analyses (Example \@ref(exm:attenuated)).

Figure 11.5: Effect sizes in a simulation with the Pocock stopping rule for a maximal number of 10 interim analyses (Example 11.3).

Figure 11.5 shows that, in this simulation, the treatment effect estimate is too large on average if the trial has been stopped at one of the first eight interim analyses. The MDD is larger than the true effect size if the trial has been stopped at one of the first six interim analyses. The mean effect size of all significant trials is 0.53, the mean effect size of all non-significant trials is 0.23, and the mean effect size of all trials is 0.45 while the true effect size is only 0.35.

11.2 The trial protocol

The trial protocol serves at least three purposes:

  • Outlines the reasons for running a trial
  • Is an operations manual for the trial, e.g.: assessment for eligibility, treatment allocation, blinding procedure etc.
  • Is the scientific design document for the trial, e.g.: methods for allocation, ways of assessing outcomes etc.

It needs approval by ethics committees, funding bodies, regulatory authorities etc.

11.2.1 Designing and reporting trials

A convincing report can only result from a convincing study design. Problems should be addressed already in the trial protocol to ensure that choices have not been influenced by the results. These can be for example problems of multiplicity such as:

  • Which subgroups should be examined?
  • Which outcomes are primary and which are secondary?
  • Should outcomes be compared using baseline information?
  • Should the outcome variable be transformed?

11.2.2 Protocol deviations

Protocol deviations may happen at different stages:

  1. Not all patients may adhere to the protocol, : some may not take the medication in the quantities and at the times specified in the protocol, some may not turn up to the clinic for outcome measurements to be assessed, some may assert their right to withdraw from the study.
  2. Ineligible patients may enter the trial.
  3. Another treatment may be administered than allocated.

See J. N. S. Matthews (2006) (Section 10.2) for examples of protocol deviations. They can be dealt with differently in the analysis:

Definition 11.1

  1. The per-protocol analysis compares groups of patients who were actually treated as specified in the protocol, excluding non-compliant patients.
  2. The as-treated analysis compares groups of patients as they were actually treated. No patients are excluded from this analysis.
  3. The intention-to-treat (ITT) analysis compares groups of patients as they were allocated to treatment.

The ITT principle states: “Compare the groups as if they were formed by randomization, regardless of what has happened subsequently to the patients” J. N. S. Matthews (2006) (Sec. 10.3). Any other way of grouping the patients cannot guarantee balance at the start of the treatment. Per-protocol or as-treated analyses are subject to possible confounding, so should be interpreted cautiously. On the other hand, analysis by the ITT principle may lead to attenuated treatment effects.

Example 11.4 The Angina Trial is an RCT to compare surgical versus medical treatment for angina. Primary outcome is 2-year mortality. In this trial, not all patients received the treatment they were allocated to.

print(angina)
##   Allocated Received No.Death No.Total No.Survived MortRate
## 1   Surgery  Surgery       15      369         354     4.07
## 2   Surgery Medicine        6       26          20    23.08
## 3  Medicine  Surgery        2       48          46     4.17
## 4  Medicine Medicine       27      323         296     8.36

For the per-protocol analysis, we need to restrict the patients that we analyze.

## patients included in a per-protocol analysis
angina.perProtocol <- angina[Allocated==Received,]
print(angina.perProtocol)
##   Allocated Received No.Death No.Total No.Survived MortRate
## 1   Surgery  Surgery       15      369         354     4.07
## 4  Medicine Medicine       27      323         296     8.36

This trial can be analyzed based on the three different analysis types.

## intention-to-treat analysis
tab.intentionToTreat <- xtabs(cbind(No.Death, No.Survived) ~ Allocated, 
                              data=angina)
twoby2(tab.intentionToTreat)
## 2 by 2 table analysis: 
## ------------------------------------------------------ 
## Outcome   : No.Death 
## Comparing : Surgery vs. Medicine 
## 
##          No.Death No.Survived    P(No.Death) 95% conf.
## Surgery        21         374         0.0532    0.0349
## Medicine       29         342         0.0782    0.0549
##          interval
## Surgery    0.0802
## Medicine   0.1102
## 
##                                     95% conf. interval
##              Relative Risk:  0.6801    0.3950   1.1711
##          Sample Odds Ratio:  0.6622    0.3706   1.1832
## Conditional MLE Odds Ratio:  0.6625    0.3519   1.2284
##     Probability difference: -0.0250   -0.0616   0.0104
## 
##              Exact P-value: 0.1882 
##         Asymptotic P-value: 0.1639 
## ------------------------------------------------------
## per-protocol analysis
tab.perProtocol <- xtabs(cbind(No.Death, No.Survived) ~ Allocated, 
                         data=angina.perProtocol)
twoby2(tab.perProtocol)
## 2 by 2 table analysis: 
## ------------------------------------------------------ 
## Outcome   : No.Death 
## Comparing : Surgery vs. Medicine 
## 
##          No.Death No.Survived    P(No.Death) 95% conf.
## Surgery        15         354         0.0407    0.0247
## Medicine       27         296         0.0836    0.0579
##          interval
## Surgery    0.0663
## Medicine   0.1192
## 
##                                     95% conf. interval
##              Relative Risk:  0.4863    0.2634   0.8979
##          Sample Odds Ratio:  0.4645    0.2426   0.8896
## Conditional MLE Odds Ratio:  0.4650    0.2254   0.9256
##     Probability difference: -0.0429   -0.0816  -0.0070
## 
##              Exact P-value: 0.0246 
##         Asymptotic P-value: 0.0207 
## ------------------------------------------------------
## as-treated analysis
tab.asTreated <- xtabs(cbind(No.Death, No.Survived) ~ Received, 
                       data=angina)
twoby2(tab.asTreated)
## 2 by 2 table analysis: 
## ------------------------------------------------------ 
## Outcome   : No.Death 
## Comparing : Surgery vs. Medicine 
## 
##          No.Death No.Survived    P(No.Death) 95% conf.
## Surgery        17         400         0.0408    0.0255
## Medicine       33         316         0.0946    0.0680
##          interval
## Surgery    0.0646
## Medicine   0.1300
## 
##                                     95% conf. interval
##              Relative Risk:  0.4311    0.2444   0.7605
##          Sample Odds Ratio:  0.4070    0.2226   0.7441
## Conditional MLE Odds Ratio:  0.4074    0.2088   0.7689
##     Probability difference: -0.0538   -0.0922  -0.0184
## 
##              Exact P-value: 0.0031 
##         Asymptotic P-value: 0.0035 
## ------------------------------------------------------

11.3 Additional references

You can find more about RSTs and sequential methods in M. Bland (2015) (Ch. 9.11) and in Todd et al. (2001), about ITT in M. Bland (2015) (Ch. 2.6), about monitoring accumulating data in J. N. S. Matthews (2006) (Ch. 8) and about protocols and protocol deviations in J. N. S. Matthews (2006) (Ch. 10). Studies where the methods from this chapter are used in practice are for example ACRE Trial Collaborators and others (2009), Durán-Cantolla et al. (2010), Burgess, Brown, and Lee (2005)

References

ACRE Trial Collaborators and others. 2009. “Effect of ‘Collaborative Requesting’ on Consent Rate for Organ Donation: Randomised Controlled Trial (ACRE Trial).” The BMJ 339.
Bland, Martin. 2015. An Introduction to Medical Statistics. Fourth. Oxford University Press.
Burgess, Ian F, Christine M Brown, and Peter N Lee. 2005. “Treatment of Head Louse Infestation with 4% Dimeticone Lotion: Randomised Controlled Equivalence Trial.” Bmj 330 (7505): 1423.
Durán-Cantolla, Joaquı́n, Felipe Aizpuru, Jose Marı́a Montserrat, Eugeni Ballester, Joaquı́n Terán-Santos, Jose Ignacio Aguirregomoscorta, Mónica Gonzalez, et al. 2010. “Continuous Positive Airway Pressure as Treatment for Systemic Hypertension in People with Obstructive Sleep Apnoea: Randomised Controlled Trial.” BMJ 341.
Matthews, John N. S. 2006. Introduction to Randomized Controlled Clinical Trials. Second. Chapman & Hall/CRC.
Todd, Susan, Anne Whitehead, Nigel Stallard, and John Whitehead. 2001. “Interim Analyses and Sequential Designs in Phase III Studies.” British Journal of Clinical Pharmacology 51 (5): 394–99. https://doi.org/10.1046/j.1365-2125.2001.01382.x.