B Statistical tests and P-values
B.1 Concepts
B.1.1 P-value
The P-value p is a value between 0 and 1 which can be used for hypothesis testing or for significance testing. We test for evidence against a point null hypothesis H0 which is given by a reference value, usually 0. For example, H0 could be the hypothesis that there is no treatment effect. The P-value is correctly interpreted as the probability, under the assumption of the null hypothesis H0, of obtaining a result equal to or more extreme than what was actually observed.
B.1.2 Hypothesis testing
The type I error rate α is the probability of rejecting H0 although H0 is true. Hypothesis testing wants to control the probability for this type of error by keeping it small, usually α=0.05. In practice, the P-value is compared to the threshold α and rejects H0 if and only if p≤α.
B.1.2.1 Relation to confidence interval
A γ⋅100% confidence interval (CI) can be used to carry out a hypothesis test with α=1−γ by rejecting H0 if and only if the CI does not contain the reference value. If the reference value lies just on the boundary of the CI, then the P-value is 1−γ. Below, we show how to compute a P-value from a CI in case of a z-test or a t-test, see also Altman and Bland (2011a), Altman and Bland (2011b).
Suppose we know the estimate ˆθ and the lower and upper limits l and u of the γ⋅100% Wald CI. Using the definition of the Wald CI, we can compute the standard error:
se(ˆθ)=u−l2zγ
From this, we can compute what was observed for the test statistic to compute the P-value:
p=2(1−pnorm(ˆθ/se(ˆθ)))
The same can be done if we know the symmetric CI based on the t-distribution
by exchanging zγ and pnorm()
with tγ and pt()
using the
correct degrees of freedom for the t-distribution.
While the P-value quantifies the strength of evidence in one number, the CI shows the effect size and the amount of uncertainty. This is different information, hence both P-value and CI should be reported.
B.1.3 Significance testing
Significance testing uses the P-value to quantify the strength of evidence against the null hypothesis. Instead of a strict threshold, the interval from 0 to 1 is divided into regions of “weak evidence”, “strong evidence” etc. (see Figure 5.1).
B.2 Continuous outcome
Commonly used tests for continuous outcomes with known variances are the z- and t-test. The z-test should only be used if the sample size is large. Otherwise, the t-test is more appropriate.
B.2.1 z-test
If we know the reference distribution of the test statistic under the null hypothesis, it can be used to compute the P-value. We consider the estimated test statistic
Z=ˆθ/se(ˆθ),
where ˆθ is the estimate of the unknown parameter θ and se(ˆθ) is the standard error. Using normal approximation of the estimator for large sample size, Z follows approximately a standard normal distribution (z-test or also Wald test).
B.2.1.1 Application in one group
Example B.1 In Example A.1, let the null hypothesis be H0: μD=73. Two-sided or one-sided P-values from a z-test can be computed as follows:
## Data
library(mlbench)
data("PimaIndiansDiabetes2")
## Exclude missing values
ind <- which(!is.na(PimaIndiansDiabetes2$pressure))
pima <- PimaIndiansDiabetes2[ind, ]
## Blood pressure levels for the diabetic group
ind <- which(pima$diabetes == "pos")
diab_bp <- pima$pressure[ind]
n <- length(diab_bp)
mu <- mean(diab_bp-73)
se <- sd(diab_bp-73) / sqrt(n)
## two-sided p-value
library(biostatUZH)
printWaldCI(theta = mu, se.theta = se, conf.level = 0.95)
## Effect 95% Confidence Interval P-value
## [1,] 2.321 from 0.803 to 3.840 0.003
## one-sided p-value
test_stat <- mu/se
pnorm(q = test_stat, mean = 0, sd = 1, lower.tail = FALSE) # H_0: <73
## [1] 0.001367318
## [1] 0.9986327
Note that the test is performed by shifting the blood pressure levels by 73 and then testing if these shifted blood pressure levels are different from 0.
B.2.2 t-test
The t-test assumes independent measurements in two groups that are normally distributed with equal variances:
Treatment: Y1,…,Ym∼N(μT,σ2T)Control: X1,…,Xn∼N(μC,σ2C)
The sample sizes are m and n, respectively, and the equal variances assumption implies that σ2T=σ2C The quantity of interest is the mean difference Δ=μT−μC. The null hypothesis is
H0:Δ=0.
The estimate of Δ is the difference in sample means
ˆΔ=ˉY−ˉX
with standard error
se(ˆΔ)=s⋅√1m+1n,
where
s2=(m−1)s2T+(n−1)s2Cm+n−2
is an estimate of the common variance σ2. Here, s2T and s2C are the estimates of the variances σ2T and σ2C in the two groups. The t-test statistic is
T=ˆΔse(ˆΔ).
Assuming H0 is true, the test statistic T follows a t-distribution with m+n−2 “degrees of freedom” (df). In case of only one group of size n, it is a t-distribution with n−1 degrees of freedom.
B.3 Binary outcome
For binary outcomes, Wald CIs are often not appropriate. However, P-values cannot be easily computed from CIs other than Wald or t-test CIs. In these cases, it is common to use the χ2-test, Fisher’s exact test for small samples or McNemar’s test for paired data. These tests generally test the null hypothesis that the events of an investigated factor (or the proportions) in two groups are independent.
B.3.1 χ2-test
The χ2-test computes the expected number of cases ei in each cell i=1,…,4 of a 2×2 table under the assumption of no difference between the groups.
Therapy | Dead | Alive | Total |
---|---|---|---|
APSAC | 9 | 153 | 162 |
Heparin | 19 | 132 | 151 |
total | 28 | 285 | 313 |
The expected number of cases in a particular cell are defined as the product of the corresponding row and column sums divided by the total number of participants. For example, the expected number of cases in the first cell of Table B.1 are
(162313⋅28313)⋅313=162⋅28313=14.49,
to be compared with 9 observed cases. The part in the bracket is the probability to lie in this cell based on the marginal frequencies (in this case, in the first row and first column). Multiplying this probability with the total number of participants yields the expected number of cases in this cell. The expected frequencies for agreement by chance are calculated in the same way (Table 3.3).
The expected number of cases ei are then compared with the observed number of cases yi based on the test statistic T=∑i(yi−ei)2ei. This test statistic follows a χ2-distribution with 1 degree of freedom under the null hypothesis of no difference between the two groups, so a P-value can be easily calculated. Note that it is a two-sided test due to the quadratic differences where it does not matter whether yi<ei or ei<yi.
The χ2-test with continuity correction is based on the modified test statistic T=∑i((|yi−ei|−0.5)+)2ei, here x+=max. What we illustrated here for 2 \times 2 tables can also be generalized to more categories.
B.3.1.1 Application in two paired groups
Example B.2 In Example A.6 with paired data, a McNemar test can be performed in R as follows (default is with continuity correction):
## Data
tabIschemia <- matrix(c(14, 0, 5, 22), nrow = 2)
colnames(tabIschemia) <- c("Lab ischemic", "Lab normal")
rownames(tabIschemia) <- c("Clin ischemic", "Clin normal")
print(tabIschemia)
## Lab ischemic Lab normal
## Clin ischemic 14 5
## Clin normal 0 22
##
## McNemar's Chi-squared test with continuity correction
##
## data: tabIschemia
## McNemar's chi-squared = 3.2, df = 1, p-value =
## 0.07364
##
## McNemar's Chi-squared test
##
## data: tabIschemia
## McNemar's chi-squared = 5, df = 1, p-value = 0.02535
B.3.2 Fisher’s exact test
Fisher’s exact test is based on the probabilities of all possible tables with the observed row and column totals under the null hypothesis of no difference between the groups. It is a two-sided test, there are three different versions and it can also be generalized to more categories.
B.4 Survival outcome
For survival outcomes, the log-rank test can be used to compare two treatment groups. The log-rank test gives a P-value based on the expected number of events under the null hypothesis of no difference between the two groups. An estimate of the hazard ratio with CI can be derived from the observed and expected number of events in the two groups.