B Statistical tests and P-values

B.1 Concepts

B.1.1 P-value

The P-value p is a value between 0 and 1 which can be used for hypothesis testing or for significance testing. We test for evidence against a point null hypothesis H0 which is given by a reference value, usually 0. For example, H0 could be the hypothesis that there is no treatment effect. The P-value is correctly interpreted as the probability, under the assumption of the null hypothesis H0, of obtaining a result equal to or more extreme than what was actually observed.

B.1.2 Hypothesis testing

The type I error rate α is the probability of rejecting H0 although H0 is true. Hypothesis testing wants to control the probability for this type of error by keeping it small, usually α=0.05. In practice, the P-value is compared to the threshold α and rejects H0 if and only if pα.

B.1.2.1 Relation to confidence interval

A γ100% confidence interval (CI) can be used to carry out a hypothesis test with α=1γ by rejecting H0 if and only if the CI does not contain the reference value. If the reference value lies just on the boundary of the CI, then the P-value is 1γ. Below, we show how to compute a P-value from a CI in case of a z-test or a t-test, see also Altman and Bland (2011a), Altman and Bland (2011b).

Suppose we know the estimate ˆθ and the lower and upper limits l and u of the γ100% Wald CI. Using the definition of the Wald CI, we can compute the standard error:

se(ˆθ)=ul2zγ

From this, we can compute what was observed for the test statistic to compute the P-value:

p=2(1pnorm(ˆθ/se(ˆθ)))

The same can be done if we know the symmetric CI based on the t-distribution by exchanging zγ and pnorm() with tγ and pt() using the correct degrees of freedom for the t-distribution.

While the P-value quantifies the strength of evidence in one number, the CI shows the effect size and the amount of uncertainty. This is different information, hence both P-value and CI should be reported.

B.1.3 Significance testing

Significance testing uses the P-value to quantify the strength of evidence against the null hypothesis. Instead of a strict threshold, the interval from 0 to 1 is divided into regions of “weak evidence”, “strong evidence” etc. (see Figure 5.1).

B.2 Continuous outcome

Commonly used tests for continuous outcomes with known variances are the z- and t-test. The z-test should only be used if the sample size is large. Otherwise, the t-test is more appropriate.

B.2.1 z-test

If we know the reference distribution of the test statistic under the null hypothesis, it can be used to compute the P-value. We consider the estimated test statistic

Z=ˆθ/se(ˆθ),

where ˆθ is the estimate of the unknown parameter θ and se(ˆθ) is the standard error. Using normal approximation of the estimator for large sample size, Z follows approximately a standard normal distribution (z-test or also Wald test).

B.2.1.1 Application in one group

Example B.1 In Example A.1, let the null hypothesis be H0: μD=73. Two-sided or one-sided P-values from a z-test can be computed as follows:

## Data
library(mlbench)
data("PimaIndiansDiabetes2")

## Exclude missing values
ind <- which(!is.na(PimaIndiansDiabetes2$pressure))
pima <- PimaIndiansDiabetes2[ind, ]

## Blood pressure levels for the diabetic group
ind <- which(pima$diabetes == "pos")
diab_bp <- pima$pressure[ind]

n <- length(diab_bp)
mu <- mean(diab_bp-73)
se <- sd(diab_bp-73) / sqrt(n)

## two-sided p-value
library(biostatUZH)
printWaldCI(theta = mu, se.theta = se, conf.level = 0.95)
##      Effect 95% Confidence Interval P-value
## [1,] 2.321  from 0.803 to 3.840     0.003
## one-sided p-value
test_stat <- mu/se
pnorm(q = test_stat, mean = 0, sd = 1, lower.tail = FALSE) # H_0: <73
## [1] 0.001367318
pnorm(q = test_stat, mean = 0, sd = 1, lower.tail = TRUE)  # H_0: >73
## [1] 0.9986327

Note that the test is performed by shifting the blood pressure levels by 73 and then testing if these shifted blood pressure levels are different from 0.

B.2.2 t-test

The t-test assumes independent measurements in two groups that are normally distributed with equal variances:

Treatment: Y1,,YmN(μT,σ2T)Control: X1,,XnN(μC,σ2C)

The sample sizes are m and n, respectively, and the equal variances assumption implies that σ2T=σ2C The quantity of interest is the mean difference Δ=μTμC. The null hypothesis is

H0:Δ=0.

The estimate of Δ is the difference in sample means

ˆΔ=ˉYˉX

with standard error

se(ˆΔ)=s1m+1n,

where

s2=(m1)s2T+(n1)s2Cm+n2

is an estimate of the common variance σ2. Here, s2T and s2C are the estimates of the variances σ2T and σ2C in the two groups. The t-test statistic is

T=ˆΔse(ˆΔ).

Assuming H0 is true, the test statistic T follows a t-distribution with m+n2 “degrees of freedom” (df). In case of only one group of size n, it is a t-distribution with n1 degrees of freedom.

B.3 Binary outcome

For binary outcomes, Wald CIs are often not appropriate. However, P-values cannot be easily computed from CIs other than Wald or t-test CIs. In these cases, it is common to use the χ2-test, Fisher’s exact test for small samples or McNemar’s test for paired data. These tests generally test the null hypothesis that the events of an investigated factor (or the proportions) in two groups are independent.

B.3.1 χ2-test

The χ2-test computes the expected number of cases ei in each cell i=1,,4 of a 2×2 table under the assumption of no difference between the groups.

Table B.1: Results of the APSAC Study from Example 8.1
Therapy Dead Alive Total
APSAC 9 153 162
Heparin 19 132 151
total 28 285 313

The expected number of cases in a particular cell are defined as the product of the corresponding row and column sums divided by the total number of participants. For example, the expected number of cases in the first cell of Table B.1 are

(16231328313)313=16228313=14.49,

to be compared with 9 observed cases. The part in the bracket is the probability to lie in this cell based on the marginal frequencies (in this case, in the first row and first column). Multiplying this probability with the total number of participants yields the expected number of cases in this cell. The expected frequencies for agreement by chance are calculated in the same way (Table 3.3).

The expected number of cases ei are then compared with the observed number of cases yi based on the test statistic T=i(yiei)2ei. This test statistic follows a χ2-distribution with 1 degree of freedom under the null hypothesis of no difference between the two groups, so a P-value can be easily calculated. Note that it is a two-sided test due to the quadratic differences where it does not matter whether yi<ei or ei<yi.

The χ2-test with continuity correction is based on the modified test statistic T=i((|yiei|0.5)+)2ei, here x+=max. What we illustrated here for 2 \times 2 tables can also be generalized to more categories.

B.3.1.1 Application in two paired groups

Example B.2 In Example A.6 with paired data, a McNemar test can be performed in R as follows (default is with continuity correction):

## Data
tabIschemia <- matrix(c(14, 0, 5, 22), nrow = 2)
colnames(tabIschemia) <- c("Lab ischemic", "Lab normal")
rownames(tabIschemia) <- c("Clin ischemic", "Clin normal")

print(tabIschemia)
##               Lab ischemic Lab normal
## Clin ischemic           14          5
## Clin normal              0         22
## With continuity correction
mcnemar.test(x = tabIschemia, correct = TRUE)
## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  tabIschemia
## McNemar's chi-squared = 3.2, df = 1, p-value =
## 0.07364
## Without continuity correction
mcnemar.test(x = tabIschemia, correct = FALSE)
## 
##  McNemar's Chi-squared test
## 
## data:  tabIschemia
## McNemar's chi-squared = 5, df = 1, p-value = 0.02535

B.3.2 Fisher’s exact test

Fisher’s exact test is based on the probabilities of all possible tables with the observed row and column totals under the null hypothesis of no difference between the groups. It is a two-sided test, there are three different versions and it can also be generalized to more categories.

B.4 Survival outcome

For survival outcomes, the log-rank test can be used to compare two treatment groups. The log-rank test gives a P-value based on the expected number of events under the null hypothesis of no difference between the two groups. An estimate of the hazard ratio with CI can be derived from the observed and expected number of events in the two groups.

References

Altman, D G and Bland, M J 2011a Statistics Notes: How to obtain the P value from a confidence interval. BMJ, 343.
Altman, D G and Bland, M J 2011b Statistics Notes: How to obtain the confidence interval from a P value. BMJ, 343.