Chapter 7 Analysis of continuous outcomes
Effect size estimates quantify clinical relevance, confidence intervals (CIs) indicate the (im)precision of the effect size estimates as population values and P-values quantify statistical significance, i.e. evidence against the null hypothesis. All three should be reported for each outcome of an RCT where the reported p-value should be compatible with the type of confidence interval. While this chapter is about continuous outcomes, these reporting guidelines are true for all outcomes and elaborated on in Section 7.1. Throughout this chapter, the RCT in Example 7.1 will be used repeatedly for illustration. Binary outcomes are discussed in the next chapter.
7.1 The CONSORT Statement
The CONSORT Statement (Schulz, Altman, and Moher 2010,consortEE),
CONSolidated Standards Of Reporting Trials, (http://www.consort-statement.org/)
is an evidence-based minimum set of recommendations for reporting RCTs. It encompasses various initiatives developed by the CONSORT Group to alleviate the problems arising from inadequate reporting of RCTs. It offers a standard way for authors to prepare reports of trial findings, facilitating their complete and transparent reporting, and aiding their critical appraisal and interpretation. The CONSORT Statement comprises a 25-item checklist and a flow diagram, see Figures 7.1 and 7.2.

Figure 7.1: The CONSORT checklist.

Figure 7.2: The CONSORT flow diagram.
7.1.1 Reporting results
Results from a statistical analysis should be reported as follows:
The recommended format for CIs is “from a to b” or “a to b”, not “(a,b)”, “[a,b]” or “a−b”.
Round P-values to two significant digits, e.g. p=0.43 or p=0.057. Round to one significant digit if 0.001<p<0.0001, e.g. p=0.0004. Report very small P-values with “p<0.0001”.
Do not report P-values as p<0.1 or p<0.05 etc., as important information about the actual P-value is lost.
An example from the Didgeridoo Study is (see Example 7.1):
“The difference in means at follow-up was 2.28 units (95% CI: −1.34 to 5.90, p=0.21).”
Example 7.1 The Didgeridoo Study (Puhan et al. 2005) is a randomized controlled trial with simple randomization, see the abstract in Figure 7.3.
Patients with moderate obstructive sleep apnoea syndrome are divided in two groups as follows:
## treatment
## Control Didgeridoo
## 11 14
- Treatment group: 4 months Didgeridoo practice (m=14)
- Control group: 4 months waiting list (n=11)
The primary endpoint is the Epworth scale (integers from 0-24). This scale is ordinal but for the analysis, it is considered as continuous due to the large number of possible values. Measurements are taken at the start of the study (Baseline) and after four months (Follow-up). Figure 7.4 compares the follow-up measurements of the treatment and control group for the primary endpoint. When we want to analyze the difference of these follow-up measurements between the two groups, regression analysis gives equivalent results as a t-test:
\begin{figure}
Figure 7.4: Follow-up measurements of primary endpoint in the Didgeridoo Study.
##
## Two Sample t-test
##
## data: f.up by treatment
## t = 1.3026, df = 23, p-value = 0.2056
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
## -1.340366 5.898807
## sample estimates:
## mean in group Control mean in group Didgeridoo
## 9.636364 7.357143
## [1] 2.279221
# regression analysis
library(biostatUZH)
m1 <- lm(f.up ~ treatment)
knitr::kable(tableRegression(m1, intercept=FALSE, latex = FALSE))
Coefficient | 95%-confidence interval | p-value | |
---|---|---|---|
treatmentDidgeridoo | -2.28 | from -5.90 to 1.34 | 0.21 |
7.2 Comparison of two groups
We can compare the follow-up measurements in two groups
- with a t-test in case of equal variances,
- with Welch’s-test or Behrens test in case of unequal variances.
The t-test is explained in Appendix B (and also the z-test). Alternative methods that adjust for baseline values are discussed in Section 7.3.
7.2.1 Equal variances
The t-test assumes independent measurements in two groups that are normally distributed with equal variances. The null hypothesis is that the mean difference Δ between the two groups is 0. Assuming H0 is true, the test statistic
T=ˆΔse(ˆΔ).
follows a t-distribution with m+n−2 degrees of freedom’’ (df).
For large df’s, the t-distribution is close to a standard normal distribution as illustrated in Figure 7.5.

Figure 7.5: Comparison of t-distribution (with large degree of freedom) to a standard normal distribution.
Here is a comparison of the two-sided P-value p=Pr(|T|≥|t|) using the exact t-distribution and the approximate normal distribution:
## t
## 1.302615
## [1] 0.2055989
## [1] 0.1927063
Here we compare the factor used to compute the limits of exact and approximate 95% confidence intervals ˆΔ±t resp. ˆΔ±z.
## [1] 2.068658
## [1] 1.959964
7.2.2 Unequal variances
In case of unequal variances, σ2T and σ2C are assumed to be different and the standard error then is
se(ˆΔ)=√s2Tm+s2Cn, where s2T and s2C are estimates of the variances σ2T and σ2C in the two groups. In this case, the exact null distribution of T=ˆΔ/se(ˆΔ) is unknown. Approximate solutions are:
Standard normal, which is fine for large m and n.
Welch’s Test, which uses a t-distribution with (non-integer) degrees of freedom ν where ν is the solution of (s2T/m+s2C/n)2ν=(s2T/m)2m−1+(s2C/n)2n−1
Behrens Test is another option that can be derived with Bayesian arguments .
The Mann-Whitney Test gives a p-value, which is not compatible with the CI for the mean difference. Instead, it provides a CI for the median of the difference between a sample from X and a sample from Y.
In R
##
## Welch Two Sample t-test
##
## data: f.up by treatment
## t = 1.187, df = 12.381, p-value = 0.2575
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
## -1.890289 6.448731
## sample estimates:
## mean in group Control mean in group Didgeridoo
## 9.636364 7.357143
## [1] 2.279221
##
## Behrens' t-test
##
## data: f.up by treatment
## t = 1.173, df = 11.361, p-value = 0.2648
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
## -1.980971 6.539413
## sample estimates:
## mean in group Control mean in group Didgeridoo
## 9.636364 7.357143
##
## Wilcoxon rank sum test with continuity correction
##
## data: f.up by treatment
## W = 94, p-value = 0.3623
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -2.000003 5.999959
## sample estimates:
## difference in location
## 1.000079
7.3 Adjusting for baseline
7.3.1 Change scores
Baseline values may be imbalanced between treatment groups just as any other prognostic factor. To analyse change from baseline, we use change scores:
Definition 7.1 The change score is the change from baseline defined as: change score=follow-up−baseline.
Example 7.2 Figure 7.6 shows the combinations of baseline and follow-up measurements for each individual. It is visible that the change from baseline to follow-up is larger in the treatment group than in the control group. Figure 7.7 now directly compares the change scores.

Figure 7.6: Individual baseline and follow-up measurements in the Didgeridoo Study by treatment group.

Figure 7.7: Change scores for primary endpoint in the Didgeridoo Study.
A change score analysis in for the Didgeridoo Study yields:
##
## Two Sample t-test
##
## data: change.score by treatment
## t = 2.2748, df = 23, p-value = 0.03256
## alternative hypothesis: true difference in means between group Control and group Didgeridoo is not equal to 0
## 95 percent confidence interval:
## 0.2695582 5.6784938
## sample estimates:
## mean in group Control mean in group Didgeridoo
## -1.454545 -4.428571
## [1] 2.974026
This result can also be found in an extract of the reported results in Figure 7.8.
7.3.2 Analysis of covariance (ANCOVA)
Analysis of covariance (ANCOVA) is an extension of the change score analysis. First note that the change score analysis can also be done with regression, see model m2 in Example 7.2. Here, the command fixes the coefficient of at 1. It is therefore natural to extend this regression model to the ANCOVA model as it is done in model m3 of the example. Now the coefficient β is estimated. The ANCOVA model reduces
- to the analysis of follow-up for β=0,
- to the analysis of change scores for β=1.
Denote the different outcome means as μB at baseline (in both groups), μ at follow-up in the control group, μ+Δ at follow-up in the treatment group. The mean difference Δ is of primary interest. Let σ2 be the common variance of all measurements and ρ the correlation between baseline and follow-up measurements. Assume there are n observations in each group.
The ANCOVA model estimates β and the mean difference Δ jointly with multiple regression. With σ2B and σ2F being the variances of baseline and follow-up, it holds that β=ρσFσB, which simplifies to ρ if the variance does not change from baseline to follow-up, if σ2B=σ2F.
Example 7.3 Comparing the three different analysis methods in the Didgeridoo Study:
# Follow-up analysis
m1 <- lm(f.up ~ treatment)
knitr::kable(tableRegression(m1, intercept=FALSE, latex = FALSE))
Coefficient | 95%-confidence interval | p-value | |
---|---|---|---|
treatmentDidgeridoo | -2.28 | from -5.90 to 1.34 | 0.21 |
# Change score analysis
m2 <- lm(f.up ~ treatment + offset(baseline))
knitr::kable(tableRegression(m2, intercept=FALSE, latex = FALSE))
Coefficient | 95%-confidence interval | p-value | |
---|---|---|---|
treatmentDidgeridoo | -2.97 | from -5.68 to -0.27 | 0.033 |
# ANCOVA
m3 <- lm(f.up ~ treatment + baseline)
knitr::kable(tableRegression(m3, intercept = FALSE, latex = FALSE))
Coefficient | 95%-confidence interval | p-value | |
---|---|---|---|
treatmentDidgeridoo | -2.74 | from -5.14 to -0.35 | 0.027 |
baseline | 0.67 | from 0.42 to 0.92 | < 0.0001 |
7.3.2.1 Comparison of effect estimates
Let ˉFT and ˉFC be the mean follow-up values and ˉBT and ˉBC the mean baseline values in the treatment and control group, respectively. Then,
ˆΔ1=ˉFT−ˉFC,ˆΔ2=(ˉFT−ˉBT)−(ˉFC−ˉBC).
Now, let ˉbT and ˉbC denote the observed mean baseline values in the current trial. Conditioning on these baseline values, we have (the proof is part of the exercises):
E(ˆΔ1|ˉbT,ˉbC)=Δ+ρ⋅(ˉbT−ˉbC)E(ˆΔ2|ˉbT,ˉbC)=Δ+(ρ−1)⋅(ˉbT−ˉbC)
Hence, given the mean baseline values ˉbT and ˉbC, ˆΔ1 and ˆΔ2 are unbiased if there is baseline balance (ˉbT=ˉbC). However, they are both biased whenever the following two conditions hold:
- There is (positive) correlation 0<ρ<1, between baseline and follow-up measurements.
- There is baseline imbalance (ˉbT≠ˉbC):
In the Didgeridoo Study, , there is baseline imbalance: ˉbT=11.1, ˉbC=11.8.
The ANCOVA estimate
ˆΔ3=ˉFT−ˉFC−ρ⋅(ˉbT−ˉbC)
on the other hand is an unbiased estimate of the mean difference Δ (see proof in the exercises).
The variances of the effect estimates in the three models can be compared by the corresponding variance factors (derived in the exercises):
Var(ˆΔ)=variance factor⋅σ2/n
Var(ˆΔ1)=2⋅σ2/nVar(ˆΔ2)=4(1−ρ)⋅σ2/nVar(ˆΔ3)=2(1−ρ2)⋅σ2/n
Figure 7.9 compares the variance factors of the three models for varying correlations ρ. The variance of ˆΔ3 is always smaller than the variances of ˆΔ1 and ˆΔ2. Hence, the required sample size for ANCOVA reduces by the factor (1−ρ2) compared to the standard comparison of two groups without baseline adjustments. For ρ>1/2, ˆΔ2 will have smaller variance than ˆΔ1, so will produce narrower CIs and more powerful tests. In the Didgeridoo Study we obtain ˆρ=0.72.

Figure 7.9: Comparison of variance factors
7.3.2.2 Adjusting for other variables
ANCOVA allows a wide range of variables measured at baseline to be used to adjust the mean difference. The safest approach to selecting these variables is to decide this before the trial starts (in the study protocol). Prognostic variables used to stratify the allocation should always be included as covariates.
Example 7.4 In the Didgeridoo Study, the mean difference has been adjusted for severity of the disease () and for weight change during the study period (). The same results as in Figure 7.8 can be obtained with the following regression model.
m4 <- lm(f.up ~ treatment + baseline + weight.change + base.apnoea)
knitr::kable(tableRegression(m4, intercept = FALSE, latex = FALSE))
Coefficient | 95%-confidence interval | p-value | |
---|---|---|---|
treatmentDidgeridoo | -2.75 | from -5.35 to -0.15 | 0.039 |
baseline | 0.67 | from 0.41 to 0.93 | < 0.0001 |
weight.change | -0.17 | from -0.92 to 0.57 | 0.63 |
base.apnoea | 0.023 | from -0.25 to 0.29 | 0.86 |
7.4 Additional references
Relevant references are Chapter 10 “Comparing the Means of Small Samples” and Chapter 15 “Multifactorial Methods” in M. Bland (2015) as well as Chapter 6 “Analysis of Results” in J. N. S. Matthews (2006). Analysing controlled trials with baseline and follow up measurements is discussed in the Statistics Note Vickers and Altman (2001). Studies where the methods from this chapter are used in practice are for example Ravaud et al. (2009),porto,james.