Chapter 7 Analysis of continuous outcomes

Effect size estimates quantify clinical relevance, confidence intervals indicate the (im)precision of the effect size estimates as population values, and \(P\)-values quantify statistical significance, i.e. the evidence against the null hypothesis. All three should be reported for each outcome of an RCT and the reported \(P\)-value should be compatible with the selected confidence interval. While this chapter focuses on the analysis of continuous outcomes, guidelines presented in Section 7.1 are true for all outcome types. Binary outcomes are discussed in Chapter 7.

7.1 The CONSORT Statement

The CONSORT (CONSolidated Standards Of Reporting Trials) statement is an evidence-based minimum set of recommendations for reporting RCTs (Schulz, Altman, and Moher 2010; Moher et al. 2010). It encompasses various initiatives developed by the CONSORT Group to alleviate the problems arising from inadequate reporting of RCTs. It offers a standard way for authors to prepare reports of trial findings, facilitating their complete and transparent reporting, and aiding their critical appraisal and interpretation. The CONSORT Statement comprises a 25-item checklist and a flow diagram, see Figure 7.1. The current recommendations have been established in 2010, an update will be published soon (see Hopewell et al. (2022) and https://www.consort-spirit.org/)

The CONSORT flow diagram.

Figure 7.1: The CONSORT flow diagram.

7.1.1 Reporting results

The results of a statistical analysis should be reported as follows:

  • The recommended format for CIs is “from \(a\) to \(b\)” or “\(a\) to \(b\)”, not\((a,b)\)”, “\([a,b]\)” or “\(a-b\)”.

  • The \(P\)-values should be rounded to two significant digits, e.g. \(p = 0.43\) or \(p = 0.057\). If \(0.001 < p < 0.0001\), \(P\)-values should be rounded to one significant digit, e.g. \(p=0.0004\). \(P\)-values should not be reported as \(p<0.1\) or \(p<0.05\) etc., as important information about the actual \(P\)-value is lost. Only very small \(P\)-values should be reported with the “<” symbol, e.g. “\(p < 0.0001\)”.

An example a good reporting is:

“The difference in means at follow-up was 2.28 units (95% CI: \(-1.34\) to \(5.90\), \(p=0.21\)).”

7.2 Comparison of two groups

Example 7.1 The Didgeridoo study (Puhan et al. 2005) is a randomized controlled trial with simple randomization. Patients with moderate obstructive sleep apnoea syndrome have been randomized to 4 months of Didgeridoo practice (\(m = 14\)) or 4 months on the waiting list (\(n = 11\)). The primary endpoint is the Epworth scale (integers from 0-24). This scale is ordinal but for the analysis, it is considered continuous due to the large number of possible values. Measurements are taken at the start of the study (Baseline) and after four months (Follow-up). Figure 7.2 compares the follow-up measurements of the treatment and control groups for the primary endpoint.

Follow-up measurements of primary endpoint in the Didgeridoo Study.

Figure 7.2: Follow-up measurements of primary endpoint in the Didgeridoo Study.

In order to analyze the difference of the follow-up measurements between the two groups, a \(t\)-test can be performed:

# t-test
print(res <- t.test(f.up ~ treatment, var.equal=TRUE))
## 
##  Two Sample t-test
## 
## data:  f.up by treatment
## t = 1.3026, df = 23, p-value = 0.2056
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
##  -1.340366  5.898807
## sample estimates:
##    mean in group Control mean in group Didgeridoo 
##                 9.636364                 7.357143
(DifferenceInMeans <- mean(res$conf.int))  
## [1] 2.279221

The t-test gives identical results as a linear regression analysis:

# regression analysis
library(biostatUZH)
m1 <- lm(f.up ~ treatment)
knitr::kable(tableRegression(m1, intercept=FALSE, latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.28 from -5.90 to 1.34 0.21

7.2.1 \(t\)-test

In a \(t\)-test, data are assumed to be normally distributed, and the measurements in the two groups to be independent: with mean \(\mu_T\), variance \(\sigma^2_T\) and sample size \(m\) in the treatment group, and with mean \(\mu_C\), variance \(\sigma^2_C\) and sample size \(n\) in the control group. The quantity of interest is the mean difference \(\Delta = \mu_T - \mu_C\). The variances are assumed to be equal in the two groups, i.e. \(\sigma^2_T = \sigma^2_C = \sigma^2\).

The null hypothesis of a \(t\)-test is \[ H_0: \Delta = 0 . \] The estimate \(\widehat\Delta\) of \(\Delta\) is the difference in sample means. The \(t\)-test statistic is \[ T = \frac{\widehat\Delta}{\mathop{\mathrm{se}}(\widehat\Delta)}, \] with \[ \mathop{\mathrm{se}}(\widehat\Delta) = s \cdot \sqrt{\frac{1}{m} + \frac{1}{n}}, \] where \(s^2\) is the pooled estimate of the variance \(\sigma^2\): \[ s^2 = \frac{(m - 1)s^2_T + (n-1)s^2_C}{m + n - 2}. \]

Under the null hypothesis of no effect, the test statistic T follows a \(t\)-distribution with \(m + n - 2\) degrees of freedom (df). For large degrees of freedom, the \(t\)-distribution is close to a standard normal distribution, as illustrated in Figure 7.3.

Comparison of $t$-distribution (with large degree of freedom) to a standard normal distribution.

Figure 7.3: Comparison of \(t\)-distribution (with large degree of freedom) to a standard normal distribution.

The exact two-sided \(P\)-value \(p = \operatorname{\mathsf{Pr}}(\left\lvert T\right\rvert \geq \left\lvert t\right\rvert)\) using the exact \(t\)-distribution and the approximate normal distribution is:

print(res$statistic)
##        t 
## 1.302615
## exact p-value based on t-distribution
unname(2*(1-pt(abs(res$statistic), df=23)))
## [1] 0.2055989
## normal approximation
unname(2*(1-pnorm(abs(res$statistic))))
## [1] 0.1927063

The factors \(t\) and \(z\) used to compute the limits of exact and approximate 95% confidence intervals \({\widehat \Delta} \pm t % t_{(1+\gamma)/2}%(m+n-2) \cdot \mbox{se}(\widehat \Delta)\) resp. \({\widehat \Delta} \pm z % t_{(1+\gamma)/2}%(m+n-2) \cdot \mbox{se}(\widehat \Delta)\) are compared in the following R-code:

gamma <- 0.95
## Exact factor t based on t-distribution
(t <- qt((1+gamma)/2, df=23))
## [1] 2.068658
## Approximate factor z based on normal distribution
(z <- qnorm((1+gamma)/2))
## [1] 1.959964

7.2.2 Unequal variances

In the case of unequal variances, \(\sigma_T^2\) and \(\sigma^2_C\) are assumed to be different and the standard error of \(\widehat\Delta\) then is

\[\begin{equation*} \mbox{se}(\widehat \Delta) = \sqrt{\frac{s_T^2}{m} + \frac{s_C^2}{n}}, \end{equation*}\] where \(s_T^2\) and \(s_C^2\) are estimates of the variances \(\sigma_T^2\) and \(\sigma_C^2\) in the two groups. In this case, the exact null distribution of \(T=\widehat \Delta/{\mbox{se}(\widehat \Delta)}\) is unknown. Approximate solutions include

  • Welch’s Test, which uses a \(t\)-distribution with (non-integer) degrees of freedom.

  • Behrens Test, another option that can be derived with Bayesian arguments and is implemented in biostatUZH::behrensTest().

  • The Mann-Whitney Test which gives a \(p\)-value that is not compatible with the CI for the mean difference. Instead, it provides a CI for the median of the difference between a sample from X and a sample from Y.

In R

## Welch Test
print(res2 <- t.test(f.up ~ treatment, var.equal=FALSE))
## 
##  Welch Two Sample t-test
## 
## data:  f.up by treatment
## t = 1.187, df = 12.381, p-value = 0.2575
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
##  -1.890289  6.448731
## sample estimates:
##    mean in group Control mean in group Didgeridoo 
##                 9.636364                 7.357143
print(DifferenceInMeans <- mean(res2$conf.int))
## [1] 2.279221
## Behrens Test
library(biostatUZH)
behrens.test(f.up ~ treatment)
## 
##  Behrens' t-test
## 
## data:  f.up by treatment
## t = 1.173, df = 11.361, p-value = 0.2648
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
##  -1.980971  6.539413
## sample estimates:
##    mean in group Control mean in group Didgeridoo 
##                 9.636364                 7.357143
## Mann-Whitney Test
wilcox.test(f.up ~ treatment, conf.int=TRUE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  f.up by treatment
## W = 94, p-value = 0.3623
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -2.000003  5.999959
## sample estimates:
## difference in location 
##               1.000079

wilcox.test() provides a confidence interval for the median of the difference between a sample from the Didgeridoo and the control group (“difference in location”). Here, the difference in location is 1.0 with 95% CI from -2.0 to 6.0.

7.3 Analysis of baseline and follow-up measurements

In the previous section, we focused solely on follow-up measurements. Now, we consider both baseline and follow-up measurements in the analysis.

7.3.1 Change scores

Baseline values may be imbalanced between treatment groups just as any other prognostic factor. To analyse change from baseline, we use change scores:

Definition 7.1 The change score is the change from baseline defined as: \[ \mbox{change score} = \mbox{follow-up} - \mbox{baseline}. \]

Example 7.1 (continued) Figure 7.4 shows the combinations of baseline and follow-up measurements for each individual. It is visible that the change from baseline to follow-up is larger in the treatment group than in the control group. Figure 7.5 now directly compares the change scores.

Individual baseline and follow-up measurements in the Didgeridoo Study by treatment group.

Figure 7.4: Individual baseline and follow-up measurements in the Didgeridoo Study by treatment group.

Change scores for primary endpoint in the Didgeridoo Study.

Figure 7.5: Change scores for primary endpoint in the Didgeridoo Study.

A change score analysis in for the Didgeridoo Study yields:

change.score <- f.up - baseline
print(res3 <- t.test(change.score ~ treatment, var.equal=TRUE))
## 
##  Two Sample t-test
## 
## data:  change.score by treatment
## t = 2.2748, df = 23, p-value = 0.03256
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
##  0.2695582 5.6784938
## sample estimates:
##    mean in group Control mean in group Didgeridoo 
##                -1.454545                -4.428571
(DifferenceInMeans <- mean(res3$conf.int))  
## [1] 2.974026

Let us know define some notation. The outcome means

  • at Baseline in both groups is \(\mu_B\),
  • at Follow-up in the control group is \(\mu\), and
  • at Follow-up in the treatment group is \(\mu + \Delta\).

The mean difference \(\Delta\) is of primary interest. We assume a common variance \(\sigma^2\) of all measurements, and \(n\) observations in each group. The correlation between baseline and follow-up measurements is defined as \(\rho\). The estimated difference of mean follow-up measurements is denoted by \(\widehat\Delta_1\) and the estimated difference of mean change scores by \(\widehat\Delta_2\). Both estimates are unbiased (assuming baseline balance).

The variance of these estimates is \(\mathop{\mathrm{Var}}(\widehat\Delta_1) = 2\sigma^2/n\) and \(\mathop{\mathrm{Var}}(\widehat\Delta_2) = 4\sigma^2(1 - \rho)/n\), respectively. The estimate \(\widehat\Delta_2\) will thus have smaller variance than \(\widehat\Delta_1\) for \(\rho > 1/2\), and it will produce narrower confidence intervals and more powerful tests. In the Didgeridoo study, the estimated correlation \(\hat \rho = 0.72\).

The change score analysis can also be done with a regression:

# Change score analysis
m2 <- lm(f.up ~ treatment + offset(baseline))
knitr::kable(tableRegression(m2, intercept=FALSE,
                             latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.97 from -5.68 to -0.27 0.033

The offset(x) command fixes the coefficient of x at 1.

7.3.2 Analysis of covariance

Analysis of covariance (ANCOVA) is an extension of the change score analysis:

m3 <- lm(f.up ~ treatment + baseline)
knitr::kable(tableRegression(m3, intercept = FALSE, 
                             latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.74 from -5.14 to -0.35 0.027
baseline 0.67 from 0.42 to 0.92 < 0.0001

Now the coefficient of baseline is estimated from the data.

Let us denote the coefficient of the baseline variable as \(\beta\). The ANCOVA model reduces to the analysis of follow-up for \(\beta = 0\), and to the analysis of change scores for \(\beta = 1\).

The ANCOVA model estimates \(\beta\) and the mean difference \(\Delta\) jointly with multiple regression. The estimate \(\hat\beta\) is usually close to the correlation \(\rho\).

Example 7.1 (continued) Comparison of the three different analysis methods in the Didgeridoo study:

# Follow-up analysis
m1 <- lm(f.up ~ treatment)
knitr::kable(tableRegression(m1, intercept=FALSE, 
                             latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.28 from -5.90 to 1.34 0.21
# Change score analysis
m2 <- lm(f.up ~ treatment + offset(baseline))
knitr::kable(tableRegression(m2, intercept=FALSE, 
                             latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.97 from -5.68 to -0.27 0.033
# ANCOVA
m3 <- lm(f.up ~ treatment + baseline)
knitr::kable(tableRegression(m3, intercept = FALSE, 
                             latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.74 from -5.14 to -0.35 0.027
baseline 0.67 from 0.42 to 0.92 < 0.0001

Comparison of effect estimates

Let \(\bar {b}_T\) and \(\bar {b}_C\) denote the observed mean baseline values in the current trial. The expectation of \(\widehat\Delta_1\) and \(\widehat\Delta_2\) given \(\bar b_T\) and \(\bar b_C\), are

\[\begin{eqnarray*} \mathop{\mathrm{\mathsf{E}}}(\widehat \Delta_1 \,\vert\,\bar {b}_T, \bar {b}_C) & = & \Delta + \underbrace{\rho \cdot (\bar b_T - \bar b_C)}_{\color{red}{bias}} \\ \mathop{\mathrm{\mathsf{E}}}(\widehat \Delta_2 \,\vert\,\bar {b}_T, \bar {b}_C) & = & \Delta + \underbrace{(\rho - 1) \cdot (\bar b_T - \bar b_C)}_{\color{red}{bias}} \\ \end{eqnarray*}\]

Hence both \(\widehat\Delta_1\) and \(\widehat\Delta_2\) given \(\bar b_T\) and \(\bar b_C\) are biased if there is correlation \(\rho > 0\) between baseline and follow-up measurements and there is baseline imbalance (\(\bar b_T \neq \bar b_C\)).

In the Didgeridoo study there is some baseline imbalance: \(\bar {b}_T=11.1\), \(\bar {b}_C=11.8\).

In contrast, the ANCOVA estimate \(\widehat \Delta_3\) is an unbiased estimate of the mean difference \(\Delta\) with variance

\[ \mathop{\mathrm{Var}}(\widehat \Delta_3) = 2 \sigma^2(1-\rho^2)/n, \] which is always smaller than the variances of \(\widehat \Delta_1\) and \(\widehat \Delta_2\). This means that the treatment effect estimate has a smaller standard error. As a result, the required sample size for ANCOVA reduces by the factor \(1- \rho^2\) compared to the standard comparison of two groups without baseline adjustments.

The variances of the effect estimates in the three models can be compared by the corresponding variance factors (derived in the exercises):

\[ \mathop{\mathrm{Var}}(\widehat \Delta) = \color{red}{\mbox{variance factor}} \cdot \sigma^2 /n \]


\[ \begin{aligned} \mathop{\mathrm{Var}}(\widehat \Delta_1) &= \color{red}{2} \cdot \sigma^2 /n \\ \mathop{\mathrm{Var}}(\widehat \Delta_2) &= \color{red}{4 (1-\rho)} \cdot \sigma^2 /n \\ \mathop{\mathrm{Var}}(\widehat \Delta_3) &= \color{red}{2 (1-\rho^2)} \cdot \sigma^2/n \end{aligned} \]

Figure 7.6 compares the variance factors of the three models for varying correlations \(\rho\).

Comparison of variance factors

Figure 7.6: Comparison of variance factors

Adjusting for other variables

ANCOVA allows a wide range of variables measured at baseline to be used to adjust the mean difference. The safest approach to selecting these variables is to decide this before the trial starts (in the study protocol). Prognostic variables used to stratify the allocation should always be included as covariates.

Example 7.1 (continued) In the Didgeridoo study, the mean difference has been adjusted for severity of the disease (base.apnoea) and for weight change during the study period (weight.change).

m4 <- lm(f.up ~ treatment + baseline + weight.change + base.apnoea)
knitr::kable(tableRegression(m4, intercept = FALSE, latex = FALSE, xtable = FALSE))
Coefficient 95%-confidence interval \(p\)-value
treatmentDidgeridoo -2.75 from -5.35 to -0.15 0.039
baseline 0.67 from 0.41 to 0.93 < 0.0001
weight.change -0.17 from -0.92 to 0.57 0.63
base.apnoea 0.023 from -0.25 to 0.29 0.86

7.4 Additional references

Relevant references are Chapter 10 “Comparing the Means of Small Samples” and Chapter 15 “Multifactorial Methods” in M. Bland (2015) as well as Chapter 6 “Analysis of Results” in J. N. S. Matthews (2006). Analysing controlled trials with baseline and follow up measurements is discussed in the Statistics Note from Vickers and Altman (2001). Studies where the methods from this chapter are used in practice are for example Ravaud et al. (2009), Porto et al. (2011), James et al. (2004).

References

Bland, Martin. 2015. An Introduction to Medical Statistics. Fourth. Oxford University Press.
Hopewell, Sally, Isabelle Boutron, An-Wen Chan, Gary S. Collins, Jennifer A. de Beyer, Asbjørn Hróbjartsson, Camilla Hansen Nejstgaard, et al. 2022. “An Update to SPIRIT and CONSORT Reporting Guidelines to Enhance Transparency in Randomized Trials.” Nature Medicine 28 (9): 1740–43. https://doi.org/10.1038/s41591-022-01989-8.
James, Janet, Peter Thomas, David Cavan, and David Kerr. 2004. Preventing childhood obesity by reducing consumption of carbonated drinks: cluster randomised controlled trial.” BMJ 328 (7450).
Matthews, John N. S. 2006. Introduction to Randomized Controlled Clinical Trials. Second. Chapman & Hall/CRC.
Moher, D, S Hopewell, KF Schulz, V Montori, PC Gøtzsche, PJ Devereaux, D Elbourne, M Egger, and DG Altman. 2010. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials.” BMJ 340.
Porto, Ana Maria Feitosa, Isabela Cristina Coutinho, Jailson Barros Correia, and Melania Maria Ramos Amorim. 2011. Effectiveness of antenatal corticosteroids in reducing respiratory disorders in late preterm infants: randomised clinical trial.” BMJ 342.
Puhan, Milo A, Alex Suarez, Christian Lo Cascio, Alfred Zahn, Markus Heitz, and Otto Braendli. 2005. Didgeridoo playing as alternative treatment for obstructive sleep apnoea syndrome: randomised controlled trial.” BMJ 332: 1–5.
Ravaud, P, RM Flipo, I Boutron, C Roy, A Mahmoudi, B Giraudeau, and T Pham. 2009. ARTIST (osteoarthritis intervention standardized) study of standardised consultation versus usual care for patients with osteoarthritis of the knee in primary care in France: pragmatic randomised controlled trial.” BMJ 338.
Schulz, KF, DG Altman, and D for the CONSORT Group Moher. 2010. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials.” BMJ 340.
Vickers, Andrew J, and Douglas G Altman. 2001. Statistics Notes: Analysing controlled trials with baseline and follow up measurements.” BMJ 323 (7321): 1123–24.