B Statistical tests and \(P\)-values
B.1 Concepts
B.1.1 \(P\)-value
The \(P\)-value \(p\) is a value between \(0\) and \(1\) which can be used for hypothesis testing or for significance testing. We test for evidence against a point null hypothesis H\(_0\) which is given by a reference value, usually \(0\). For example, H\(_0\) could be the hypothesis that there is no treatment effect. The \(P\)-value is correctly interpreted as the probability, under the assumption of the null hypothesis H\(_0\), of obtaining a result equal to or more extreme than what was actually observed.
B.1.2 Hypothesis testing
The type I error rate \(\alpha\) is the probability of rejecting \(H_0\) although \(H_0\) is true. Hypothesis testing wants to control the probability for this type of error by keeping it small, usually \(\alpha = 0.05\). In practice, the \(P\)-value is compared to the threshold \(\alpha\) and rejects \(H_0\) if and only if \(p\leq \alpha\).
B.1.2.1 Relation to confidence interval
A \(\gamma\cdot 100\)% confidence interval (CI) can be used to carry out a hypothesis test with \(\alpha= 1-\gamma\) by rejecting H\(_0\) if and only if the CI does not contain the reference value. If the reference value lies just on the boundary of the CI, then the \(P\)-value is \(1-\gamma\). Below, we show how to compute a \(P\)-value from a CI in case of a \(z\)-test or a \(t\)-test, see also Altman and Bland (2011a), Altman and Bland (2011b).
Suppose we know the estimate \(\widehat{\theta}\) and the lower and upper limits \(l\) and \(u\) of the \(\gamma\cdot 100\)% Wald CI. Using the definition of the Wald CI, we can compute the standard error:
\[\begin{equation*} \mbox{se}(\widehat{\theta})=\dfrac{u-l}{2z_\gamma} \end{equation*}\]
From this, we can compute what was observed for the test statistic to compute the \(P\)-value:
\[\begin{equation*} p = 2\bigr(1-\text{pnorm}(\widehat{\theta}/\mbox{se}(\widehat{\theta}))\bigr) \end{equation*}\]
The same can be done if we know the symmetric CI based on the \(t\)-distribution
by exchanging \(z_\gamma\) and pnorm()
with \(t_\gamma\) and pt()
using the
correct degrees of freedom for the \(t\)-distribution.
While the \(P\)-value quantifies the strength of evidence in one number, the CI shows the effect size and the amount of uncertainty. This is different information, hence both \(P\)-value and CI should be reported.
B.1.3 Significance testing
Significance testing uses the \(P\)-value to quantify the strength of evidence against the null hypothesis. Instead of a strict threshold, the interval from \(0\) to \(1\) is divided into regions of “weak evidence”, “strong evidence” etc. (see Figure 5.1).
B.2 Continuous outcome
Commonly used tests for continuous outcomes with known variances are the \(z\)- and \(t\)-test. The \(z\)-test should only be used if the sample size is large. Otherwise, the \(t\)-test is more appropriate.
B.2.1 \(z\)-test
If we know the reference distribution of the test statistic under the null hypothesis, it can be used to compute the \(P\)-value. We consider the estimated test statistic
\[Z=\widehat{\theta}/\mbox{se}(\widehat{\theta}),\]
where \(\widehat{\theta}\) is the estimate of the unknown parameter \(\theta\) and \(\mbox{se}(\widehat{\theta})\) is the standard error. Using normal approximation of the estimator for large sample size, \(Z\) follows approximately a standard normal distribution (\(z\)-test or also Wald test).
B.2.1.1 Application in one group
Example B.1 In Example A.1, let the null hypothesis be \(H_0\): \(\mu_D = 73\). Two-sided or one-sided \(P\)-values from a \(z\)-test can be computed as follows:
## Data
library(mlbench)
data("PimaIndiansDiabetes2")
## Exclude missing values
ind <- which(!is.na(PimaIndiansDiabetes2$pressure))
pima <- PimaIndiansDiabetes2[ind, ]
## Blood pressure levels for the diabetic group
ind <- which(pima$diabetes == "pos")
diab_bp <- pima$pressure[ind]
n <- length(diab_bp)
mu <- mean(diab_bp-73)
se <- sd(diab_bp-73) / sqrt(n)
## two-sided p-value
library(biostatUZH)
printWaldCI(theta = mu, se.theta = se, conf.level = 0.95)
## Effect 95% Confidence Interval P-value
## [1,] 2.321 from 0.803 to 3.840 0.003
## one-sided p-value
test_stat <- mu/se
pnorm(q = test_stat, mean = 0, sd = 1, lower.tail = FALSE) # H_0: <73
## [1] 0.001367318
## [1] 0.9986327
Note that the test is performed by shifting the blood pressure levels by \(73\) and then testing if these shifted blood pressure levels are different from \(0\).
B.2.2 \(t\)-test
The \(t\)-test assumes independent measurements in two groups that are normally distributed with equal variances:
\[\begin{eqnarray*} \text{Treatment: }&Y_1, \ldots, Y_{m} &\sim \mathop{\mathrm{N}}(\mu_T, \sigma_T^2)\\ \text{Control: }&X_1, \ldots, X_{n} &\sim \mathop{\mathrm{N}}(\mu_C, \sigma_C^2) \end{eqnarray*}\]
The sample sizes are \(m\) and \(n\), respectively, and the equal variances assumption implies that \(\sigma_T^2=\sigma_C^2\) The quantity of interest is the mean difference \(\Delta = \mu_T-\mu_C\). The null hypothesis is
\[H_0: \Delta = 0.\]
The estimate of \(\Delta\) is the difference in sample means
\[\widehat \Delta = \bar{Y} - \bar{X}\]
with standard error
\[\begin{equation*} \mbox{se}(\widehat \Delta) = s \cdot \sqrt{\frac{1}{m} + \frac{1}{n}}, \end{equation*}\]
where
\[s^2 = \frac{(m-1) s_T^2 +(n-1) s_C^2}{m+n-2}\]
is an estimate of the common variance \(\sigma^2\). Here, \(s_T^2\) and \(s_C^2\) are the estimates of the variances \(\sigma_T^2\) and \(\sigma_C^2\) in the two groups. The \(t\)-test statistic is
\[T = \frac{\widehat \Delta}{\mbox{se}(\widehat \Delta)}.\]
Assuming \(H_0\) is true, the test statistic \(T\) follows a \(t\)-distribution with \(m+n-2\) “degrees of freedom” (df). In case of only one group of size n, it is a \(t\)-distribution with \(n-1\) degrees of freedom.
B.3 Binary outcome
For binary outcomes, Wald CIs are often not appropriate. However, \(P\)-values cannot be easily computed from CIs other than Wald or \(t\)-test CIs. In these cases, it is common to use the \(\chi^2\)-test, Fisher’s exact test for small samples or McNemar’s test for paired data. These tests generally test the null hypothesis that the events of an investigated factor (or the proportions) in two groups are independent.
B.3.1 \(\chi^2\)-test
The \(\chi^2\)-test computes the expected number of cases \(e_i\) in each cell \(i=1,\ldots,4\) of a \(2 \times 2\) table under the assumption of no difference between the groups.
Therapy | Dead | Alive | Total |
---|---|---|---|
APSAC | 9 | 153 | 162 |
Heparin | 19 | 132 | 151 |
total | 28 | 285 | 313 |
The expected number of cases in a particular cell are defined as the product of the corresponding row and column sums divided by the total number of participants. For example, the expected number of cases in the first cell of Table B.1 are
\[\begin{equation*} \left(\frac{162}{313}\cdot \frac{28}{313}\right) \cdot 313 = \frac{162\cdot 28}{313} = 14.49, \end{equation*}\]
to be compared with 9 observed cases. The part in the bracket is the probability to lie in this cell based on the marginal frequencies (in this case, in the first row and first column). Multiplying this probability with the total number of participants yields the expected number of cases in this cell. The expected frequencies for agreement by chance are calculated in the same way (Table 3.3).
The expected number of cases \(e_i\) are then compared with the observed number of cases \(y_i\) based on the test statistic \[ T = \sum_i \frac{(y_i - e_i)^2}{e_i}. \] This test statistic follows a \(\chi^2\)-distribution with 1 degree of freedom under the null hypothesis of no difference between the two groups, so a \(P\)-value can be easily calculated. Note that it is a two-sided test due to the quadratic differences where it does not matter whether \(y_i<e_i\) or \(e_i<y_i\).
The \(\chi^2\)-test with continuity correction is based on the modified test statistic \[ T = \sum_i \frac{\left((\left\lvert y_i - e_i\right\rvert - 0.5)_+\right)^2}{e_i}, \] here \(x_+ = \max\{x, 0\}\). What we illustrated here for \(2 \times 2\) tables can also be generalized to more categories.
B.3.1.1 Application in two paired groups
Example B.2 In Example A.6 with paired data, a McNemar test can be performed in R as follows (default is with continuity correction):
## Data
tabIschemia <- matrix(c(14, 0, 5, 22), nrow = 2)
colnames(tabIschemia) <- c("Lab ischemic", "Lab normal")
rownames(tabIschemia) <- c("Clin ischemic", "Clin normal")
print(tabIschemia)
## Lab ischemic Lab normal
## Clin ischemic 14 5
## Clin normal 0 22
##
## McNemar's Chi-squared test with continuity correction
##
## data: tabIschemia
## McNemar's chi-squared = 3.2, df = 1, p-value =
## 0.07364
##
## McNemar's Chi-squared test
##
## data: tabIschemia
## McNemar's chi-squared = 5, df = 1, p-value = 0.02535
B.3.2 Fisher’s exact test
Fisher’s exact test is based on the probabilities of all possible tables with the observed row and column totals under the null hypothesis of no difference between the groups. It is a two-sided test, there are three different versions and it can also be generalized to more categories.
B.4 Survival outcome
For survival outcomes, the log-rank test can be used to compare two treatment groups. The log-rank test gives a \(P\)-value based on the expected number of events under the null hypothesis of no difference between the two groups. An estimate of the hazard ratio with CI can be derived from the observed and expected number of events in the two groups.