B Statistical tests and \(p\)-values

B.1 Concepts

B.1.1 \(P\)-value

The \(p\)-value \(p\) is a value between \(0\) and \(1\) which can be used for hypothesis testing or for significance testing. We test for evidence against a point null hypothesis H\(_0\) which is given by a reference value, usually \(0\). For example, H\(_0\) could be the hypothesis that there is no treatment effect. The \(p\)-value is correctly interpreted as the probability, under the assumption of the null hypothesis H\(_0\), of obtaining a result equal to or more extreme than what was actually observed.

B.1.2 Hypothesis testing

The type I error rate \(\alpha\) is the probability of rejecting \(H_0\) although \(H_0\) is true. Hypothesis testing wants to control the probability for this type of error by keeping it small, usually \(\alpha = 0.05\). In practice, the \(p\)-value is compared to the threshold \(\alpha\) and rejects \(H_0\) if and only if \(p\leq \alpha\).

B.1.2.1 Relation to confidence interval

A \(\gamma\cdot 100\)% confidence interval (CI) can be used to carry out a hypothesis test with \(\alpha= 1-\gamma\) by rejecting H\(_0\) if and only if the CI does not contain the reference value. If the reference value lies just on the boundary of the CI, then the \(p\)-value is \(1-\gamma\). Below, we show how to compute a \(p\)-value from a CI in case of a \(z\)-test or a \(t\)-test, see also .

Suppose we know the estimate \(\widehat{\theta}\) and the lower and upper limits \(l\) and \(u\) of the \(\gamma\cdot 100\)% Wald CI. Using the definition of the Wald CI, we can compute the standard error:

\[\begin{equation*} \mbox{se}(\widehat{\theta})=\dfrac{u-l}{2z_\gamma} \end{equation*}\]

From this, we can compute what was observed for the test statistic to compute the \(p\)-value:

\[\begin{equation*} p = 2\bigr(1-\text{pnorm}(\widehat{\theta}/\mbox{se}(\widehat{\theta}))\bigr) \end{equation*}\]

The same can be done if we know the symmetric CI based on the \(t\)-distribution by exchanging \(z_\gamma\) and “pnorm()” with \(t_\gamma\) and “pt()” using the correct degrees of freedom for the \(t\)-distribution.

While the \(p\)-value quantifies the strength of evidence in one number, the CI shows the effect size and the amount of uncertainty. This is different information, hence both \(p\)-value and CI should be reported.

B.1.3 Significance testing

Significance testing uses the \(p\)-value to quantify the strength of evidence against the null hypothesis. Instead of a strict threshold, the interval from \(0\) to \(1\) is divided into regions of “weak evidence”, “strong evidence” etc. (see Figure 5.1).

B.2 Continuous outcome

Commonly used tests for continuous outcomes with known variances are the \(z\)- and \(t\)-test. The \(z\)-test should only be used if the sample size is large. Otherwise, the \(t\)-test is more appropriate.

B.2.1 \(z\)-test

The \(p\)-value is the probability that under the null hypothesis, we would obtain a certain result for our test statistic. If we know the reference distribution of the test statistic under the null hypothesis, it can be used to compute the \(p\)-value. We consider the estimated test statistic

\[Z=\widehat{\theta}/\mbox{se}(\widehat{\theta}),\]

where \(\widehat{\theta}\) is the estimate of the unknown parameter \(\theta\) and \(\mbox{se}(\widehat{\theta})\) is the standard error. Using normal approximation of the estimator for large sample size, \(Z\) follows approximately a standard normal distribution (\(z\)-test or also Wald test).

B.2.1.1 Application in one group

Example B.1 In Example A.1, let the null hypothesis be \(H_0\): \(\mu_D = 73\). Two-sided or one-sided \(p\)-values from a \(z\)-test can be computed as follows:

## Data
library(mlbench)
data("PimaIndiansDiabetes2")

## Exclude missing values
ind <- which(!is.na(PimaIndiansDiabetes2$pressure))
pima <- PimaIndiansDiabetes2[ind, ]

## Blood pressure levels for the diabetic group
ind <- which(pima$diabetes == "pos")
diab_bp <- pima$pressure[ind]

n <- length(diab_bp)
mu <- mean(diab_bp-73)
se <- sd(diab_bp-73) / sqrt(n)

## two-sided p-value
library(biostatUZH)
printWaldCI(theta = mu, se.theta = se, conf.level = 0.95)
##      Effect 95% Confidence Interval P-value
## [1,] 2.321  from 0.803 to 3.840     0.003
## one-sided p-value
test_stat <- mu/se
pnorm(q = test_stat, mean = 0, sd = 1, lower.tail = FALSE) # H_0: <73
## [1] 0.001367318
pnorm(q = test_stat, mean = 0, sd = 1, lower.tail = TRUE)  # H_0: >73
## [1] 0.9986327

Note that the test is performed by shifting the blood pressure levels by \(73\) and then testing if these shifted blood pressure levels are different from \(0\).

B.2.2 \(t\)-test

The \(t\)-test assumes independent measurements in two groups that are normally distributed with equal variances:

\[\begin{eqnarray*} \text{Treatment: }&Y_1, \ldots, Y_{m} &\sim \mathop{\mathrm{N}}(\mu_T, \sigma_T^2)\\ \text{Control: }&X_1, \ldots, X_{n} &\sim \mathop{\mathrm{N}}(\mu_C, \sigma_C^2) \end{eqnarray*}\]

The sample sizes are \(m\) and \(n\), respectively, and the equal variances assumption implies that \(\sigma_T^2=\sigma_C^2\) The quantity of interest is the mean difference \(\Delta = \mu_T-\mu_C\). The null hypothesis is

\[H_0: \Delta = 0.\]

The estimate of \(\Delta\) is the difference in sample means

\[\widehat \Delta = \bar{Y} - \bar{X}\]

with standard error

\[\begin{equation*} \mbox{se}(\widehat \Delta) = s \cdot \sqrt{\frac{1}{m} + \frac{1}{n}}, \end{equation*}\]

where

\[s^2 = \frac{(m-1) s_T^2 +(n-1) s_C^2}{m+n-2}\]

is an estimate of the common variance \(\sigma^2\). Here, \(s_T^2\) and \(s_C^2\) are the estimates of the variances \(\sigma_T^2\) and \(\sigma_C^2\) in the two groups. The \(t\)-test statistic is

\[T = \frac{\widehat \Delta}{\mbox{se}(\widehat \Delta)}.\]

Assuming \(H_0\) is true, the test statistic \(T\) follows a \(t\)-distribution with \(m+n-2\) “degrees of freedom” (df). In case of only one group of size n (no difference), it is a \(t\)-distribution with \(n-1\) degrees of freedom.

B.2.2.1 Application in two unpaired groups

See Example 7.1 and Section~??.

B.2.2.2 Application in two paired groups

See Example 5.1.

B.3 Binary outcome

For binary outcomes, Wald CIs are often not appropriate. However, \(p\)-values cannot be easily computed from CIs other than Wald or \(t\)-test CIs. In these cases, it is common to use the \(\chi^2\)-test, Fisher’s exact test for small samples or McNemar’s test for paired data. These tests generally test the null hypothesis that the events of an investigated factor (or the proportions) in two groups are independent.

B.3.1 \(\chi^2\)-test

The \(\chi^2\)-test computes the expected number of cases \(e_i\) in each cell \(i=1,\ldots,4\) of a \(2 \times 2\) table under the assumption of no difference between the groups.

Table B.1: Results of the APSAC Study from Example
Therapy Dead Alive Total
APSAC 9 153 162
Heparin 19 132 151
total 28 285 313

B.3.1.1 Application in one group

Example B.2 The \(p\)-value from a \(t\)-test can be read from the t.test output, here also in Example B.1. The argument mu = 73 shifts the reference value from \(0\) (the default value) to \(73\). The argument alternative can be set to two-sided or less/greater for one-sided.

t.test(diab_bp, mu = 73, alternative = "two.sided", 
       conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  diab_bp
## t = 2.9961, df = 251, p-value = 0.003009
## alternative hypothesis: true mean is not equal to 73
## 95 percent confidence interval:
##  73.79545 76.84740
## sample estimates:
## mean of x 
##  75.32143

The expected number of cases in a particular cell are defined as the product of the corresponding row and column sums divided by the total number of participants. For example, the expected number of cases in the first cell of Table B.1 are

\[\begin{equation*} \left(\frac{162}{313}\cdot \frac{28}{313}\right) \cdot 313 = \frac{162\cdot 28}{313} = 14.49, \end{equation*}\]

to be compared with 9 observed cases. The part in the bracket is the probability to lie in this cell based on the marginal frequencies (in this case, in the first row and first column). Multiplying this probability with the total number of participants yields the expected number of cases in this cell. The expected frequencies for agreement by chance are calculated in the same way (Table 3.3).

The expected number of cases \(e_i\) are then compared with the observed number of cases \(y_i\) based on the test statistic \[ T = \sum_i \frac{(y_i - e_i)^2}{e_i}. \] This test statistic follows a \(\chi^2\)-distribution with 1 degree of freedom under the null hypothesis of no difference between the two groups, so a \(p\)-value can be easily calculated. Note that it is a two-sided test due to the quadratic differences where it does not matter whether \(y_i<e_i\) or \(e_i<y_i\).

The \(\chi^2\)-test with continuity correction is based on the modified test statistic \[ T = \sum_i \frac{\left((\left\lvert y_i - e_i\right\rvert - 0.5)_+\right)^2}{e_i}, \] here \(x_+ = \max\{x, 0\}\). What we illustrated here for \(2 \times 2\) tables can also be generalized to more categories.

B.3.1.2 Application in two unpaired groups

See Example ??. The default of the chisq.test is with continuity correction.

B.3.1.3 Application in two paired groups

Example B.3 In Example A.5 with paired data, a McNemar test can be performed in R as follows (default is with continuity correction):

## Data
tabIschemia <- matrix(c(14, 0, 5, 22), nrow = 2)
colnames(tabIschemia) <- c("Lab ischemic", "Lab normal")
rownames(tabIschemia) <- c("Clin ischemic", "Clin normal")

print(tabIschemia)
##               Lab ischemic Lab normal
## Clin ischemic           14          5
## Clin normal              0         22
## With continuity correction
mcnemar.test(x = tabIschemia, correct = TRUE)
## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  tabIschemia
## McNemar's chi-squared = 3.2, df = 1, p-value =
## 0.07364
## Without continuity correction
mcnemar.test(x = tabIschemia, correct = FALSE)
## 
##  McNemar's Chi-squared test
## 
## data:  tabIschemia
## McNemar's chi-squared = 5, df = 1, p-value = 0.02535

B.3.2 Fisher’s exact test

Fisher’s exact test is based on the probabilities of all possible tables with the observed row and column totals under the null hypothesis of no difference between the groups. It is a two-sided test, there are three different versions and it can also be generalized to more categories.

B.3.2.1 Application in two unpaired groups

See Example ??. The fisher.test also provides an odds ratio with CI.

B.4 Survival outcome

For survival outcomes, the log-rank test can be used to compare two treatment groups. The log-rank test gives a \(p\)-value based on the expected number of events under the null hypothesis of no difference between the two groups. An estimate of the hazard ratio with CI can be derived from the observed and expected number of events in the two groups.