3 Statistical Hypothesis Testing

3.1 Hypotheses and Test-Statistics

Let X1,,Xn be a random sample and assume that the distribution of Xi, depends on some unknown parameter θΘ, and where Θ is the parameter space.

General Testing Problem: H0:θΘ0 against H1:θΘ1 H0 is the null hypothesis, while H1 is the alternative. Θ0Θ and Θ1Θ are used to denote the possible values of θ under H0 and H1. We assume that Θ0Θ1=.

In many cases, the null hypothesis states that θ is equal to a specific value θ0R, i.e., Θ0={θ0} and the null hypothesis is H0:θ=θ0. Depending on the alternative one then often distinguishes between one-sided (Θ1=(θ0,) or Θ1=(,θ0)) and two-sided tests (Θ1={θRθθ0}). The data X1,,Xn is used in order to decide whether or not to reject H0.

Test Statistic: Every statistical hypothesis test relies on a corresponding test statistic T=T(X1,,Xn). Any test statistic is a random variable since it is a deterministic function of random variables. Given realizations x1,,xn we obtain a realization of the test statistic denote by Tobs=T(x1,,xn). How can we use Tobs to decide between H0 and H1? Generally, the distribution of T under H0 is analyzed in order to define a rejection region C

For one-sided tests in the scalar case, the rejection rejection C is typically of the form either (,c0] or [c1,). For two-sided tests C typically takes the form of (,c0][c1,). The limits c0 and c1 of the respective intervals are called “critical values”, and are obtained from quantiles of the null distribution, i.e., the distribution of T under H0.

3.2 Significance Level, Size and p-Values

We can make two decision errors: (1) Reject a null hypothesis although it is true or (2) fail to reject when null hypothesis is not true

We cannot avoid both errors. Given that we impose an upper bound for the Type I error, we try to minimize the type II error.

Significance Level: In a statistical significance test, the probability of a type I error is controlled by the significance level α (e.g., α=5%). P( Type I error )=sup The size of a statistical test is defined as \sup _{\theta \in \Theta_{0}} P\left(T \in \mathcal C \mid \theta \in \Theta_{0}\right). That is, the preselected significance level \alpha is an upper bound for the size, which may not be attained (i.e., size <\alpha ) if, for instance, the relevant probability function is discrete.

Practically important significance levels:

p-Value: The p-value is the probability of obtaining a test statistic at least as “extreme” as the one that was actually observed, assuming that the null hypothesis is true.

Remark: For given data, having determined the p-value of a test we also know the test decisions for all possible levels \alpha : \alpha> p-value \Rightarrow H_{0} is rejected while \alpha< p-value \Rightarrow H_{0} cannot be rejected

Example: Let X_{i} \sim N\left(\mu, \sigma^{2}\right) independently for all i=1, \ldots, 5=n. Observed realizations from this i.i.d. random sample: x_{1}=19.20, x_{2}=17.40, x_3=18.50, x_{4}=16.50, x_{5}=18.90. That is, the empirical mean is given by \bar{x}=18.1.

Testing problem: H_{0}: \mu=\mu_{0} against H_{1}: \mu \neq \mu_{0} (i.e., a two-sided test), where \mu_{0}=17.

Since the variance is unknown, we use the sample standard deviation, s, which then leads to the t-test for testing H_{0}. Test statistic of the t-test: T=\frac{\sqrt{n}\left(\bar{X}-\mu_{0}\right)}{s}, where s^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} is the unbiased estimator of \sigma^{2}. Thus, T follows the t-distribution with n-1 degrees of freedom under the null hypothesis H_0:\mu=\mu_0. In this case, we we write T_{n-1}.

Given our data point, the value of the test statistic is \begin{gathered} T_{o b s}=\frac{\sqrt{5}(18.1-17)}{1.125}=2.187 \\ \Rightarrow \text { p-value }=2 \min \left\{P\left(T_{n-1} \leq 2.187\right), P\left(T_{n-1} \geq 2.187\right)\right\}=0.094 \end{gathered}

The above computations in \mathrm{R}

X <- c(19.20, 17.40, 18.50, 16.50, 18.90) 
mu_0 <- 17 # hypothetical mean
n <- length(X) # sample size
X_mean <- mean(X) # empirical mean
X_sd <- sd(X) # empirical sd
# t-test statistic
t_test_stat <- sqrt(n)*(X_mean - mu_0)/X_sd

2*min(pt(q = t_test_stat, df = n-1, lower.tail = TRUE),
  pt(q = t_test_stat, df = n-1, lower.tail = FALSE))
## [1] 0.09401508

Of course, there is also a t.test() function in R:

t.test(X, mu = mu_0, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  X
## t = 2.1869, df = 4, p-value = 0.09402
## alternative hypothesis: true mean is not equal to 17
## 95 percent confidence interval:
##  16.70347 19.49653
## sample estimates:
## mean of x 
##      18.1

3.3 The Power Function

For every possible value \theta \in \Theta_{0} \cup \Theta_{1}, the power function \beta_{n, \alpha} depending on sample size n and significance level \alpha is defined by \beta_{n, \alpha}(\theta)=P\left(\text { reject } H_{0} \mid \theta \in \Theta_{0} \cup \Theta_{1}\right). Obviously, \beta_{n, \alpha}(\theta) \leq \alpha for all \theta \in \Theta_{0}. Furthermore, for any \theta \in \Theta_{1} we have 1-\beta_{n, \alpha}(\theta) is the probability of committing a type II error.

The power function is an important tool for accessing the quality of a test to reject false alternatives.

A significance test of level \alpha>0 is called consistent if \lim _{n \rightarrow \infty} \beta_{n, \alpha}(\theta)=1 for all \theta \in \Theta_{1}.

When choosing between different testing procedures for the same testing problem, one will usually prefer the most powerful test. Consider a fixed sample size n. For a specified \theta \in \Theta_{1}, a test with power function \beta_{n, \alpha}(\theta) is said to be most powerful for \theta if for any alternative test with power function \beta_{n, \alpha}^{*}(\theta), \beta_{n, \alpha}(\theta) \geq \beta_{n, \alpha}^{*}(\theta) holds for all levels \alpha>0.

A test with power function \beta_{n, \alpha}(\theta) is said to be uniformly most powerful against the set of alternatives \Theta_{1} if for any alternative test with power function \beta_{n, \alpha}^{*}(\theta), \beta_{n, \alpha}(\theta) \geq \beta_{n, \alpha}^{*}(\theta) \quad \text { holds for all } \theta \in \Theta_{1}, \alpha>0 Unfortunately, uniformly most powerful tests only exist for very special testing problems.

Example: Let X_{1}, \ldots, X_{n} be an i.i.d. random sample. Assume that n=9, and that X_{i} \sim N\left(\mu, 0.18^{2}\right). Hence, in this simple example only the mean \mu=E(X) is unknown, while the standard deviation has the known value \sigma=0.18.

Testing problem: H_{0}: \mu=\mu_{0} against H_{1}: \mu \neq \mu_{0} for \mu_{0}=18.3 (i.e., a two-sided test).

Since the variance is known, a test may rely on the Gauss (or Z) test statistic: Z=\frac{\sqrt{n}\left(\bar{X}-\mu_{0}\right)}{\sigma}=\frac{3(\bar{X}-18.3)}{0.18} Under H_{0} we have Z \sim N(0,1), and for the significance level \alpha=0.05 the null hypothesis is rejected if |Z| \geq z_{1-\alpha / 2}=1.96 \text {, } where z_{1-\alpha / 2} denotes the (1-\alpha / 2)-quantile of the standard normal distribution. Note that the size of this test equals its level \alpha=0.05.

For determining the rejection region of a test it suffices to determine the distribution of the test statistic under H_{0}. But in order to calculate the power function one needs to quantify the distribution of the test statistic for all possible values \theta \in \Theta. For many important problems this is a formidable task. For the Gauss test, however, it is quite easy. Note that for any (true) mean value \mu \in \mathbb{R} the corresponding distribution of Z \equiv Z_{\mu}=\sqrt{n\left(\bar{X}-\mu_{0}\right)} / \sigma is Z_{\mu}=\frac{\sqrt{n}\left(\mu-\mu_{0}\right)}{\sigma}+\frac{\sqrt{n}(\bar{X}-\mu)}{\sigma} \sim N\left(\frac{\sqrt{n}\left(\mu-\mu_{0}\right)}{\sigma}, 1\right) This implies that \begin{aligned} \beta_{n, \alpha}(\mu) &=P\left(\left|Z_{\mu}\right|>z_{1-\alpha / 2}\right) \\ &=1-\Phi\left(z_{1-\alpha / 2}-\frac{\sqrt{n}\left(\mu-\mu_{0}\right)}{\sigma}\right)+\Phi\left(-z_{1-\alpha / 2}-\frac{\sqrt{n}\left(\mu-\mu_{0}\right)}{\sigma}\right), \end{aligned} where \Phi denotes the distribution function of the standard normal distribution.

Implementing the power function of the two-sided Z-test in \mathrm{R} :

# The power function
beta_Ztest_TwoSided <- function(n, alpha, sigma, mu_0, mu){ 
  # (1-alpha/2)-quantile of N(0,1):
  z_upper <- qnorm(p = 1-alpha/2)
  # location shift under H_1:
  location_shift <- sqrt(n) * (mu - mu_0)/sigma
  # compute power
  power <- 1 - pnorm( z_upper - location_shift) + pnorm(-z_upper - location_shift)
    return(power)
}
# Apply the function
n <-9
sigma <- 0.18
mu_0 <- 18.3
##
beta_Ztest_TwoSided(n = n, alpha = 0.05, sigma = sigma, mu_0 = mu_0, mu=18.35)
## [1] 0.132558

Now plot the power function:

library(ggplot2)
beta_Ztest_TwoSided <- Vectorize(FUN = beta_Ztest_TwoSided,
                                 vectorize.args = "mu_0")
mu_0_vec <- seq(from = 17.75, to = 18.25, len = 50)
beta_vec <- beta_Ztest_TwoSided(n     =   10,
                                alpha = 0.05,
                                sigma = 0.18,
                                mu    =  18,
                                mu_0  = mu_0_vec)
beta_df <- data.frame("mu_0"  = mu_0_vec,"Beta"  = beta_vec)
ggplot(data = beta_df, aes(x=mu_0, y=Beta)) +
   geom_line() + geom_hline(yintercept = 0.05, lty=2) +
   geom_text(aes(x=17.77, y=0.07, label='alpha==0.05'), parse=TRUE, size=5) +
   labs(title = expression(paste("Powerfunction of the two-sided Z-Test (n=10 and ",
                                alpha==0.05,")")),
       x = expression(paste(mu[0])), 
       y = expression(paste(beta)), size=8) +
           theme_bw() + theme(axis.text  = element_text(size=12),
           axis.title = element_text(size=14))
## Warning in geom_text(aes(x = 17.77, y = 0.07, label = "alpha==0.05"), parse = TRUE, : All aesthetics have length 1, but the data has 50 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

This example illustrates the power function of a sensible test, since:

Assuming that the basic assumptions (i.e., normality and known variance) are true, the above Gauss-test is the most prominent example of a uniformly most powerful test. Under its (restrictive) assumptions, no other possible test can achieve a larger value of \beta_{n, \alpha}(\mu) for any possible value of \mu.

3.4 Asymptotic Null Distributions

Generally, the underlying distributions are unknown. In this case it is usually not possible to compute the power function of a test for fixed n. (Exceptions are so called “distribution-free” tests in nonparametric statistics.) The only way out of this difficulty is to rely on large sample asymptotics and corresponding asymptotic distributions, which allow to approximate the power function and to study the asymptotic efficiency of a test. The finite sample behavior of a test for different sample sizes n is then evaluated by means of simulation studies.

For a real-valued parameter \theta most tests of H_{0}: \theta=\theta_{0} rely on estimators \hat{\theta} of \theta. Under suitable regularity conditions on the underlying distribution, central limit theorems usually imply that \sqrt{n}(\hat{\theta}-\theta) \rightarrow^d N\left(0, v^{2}\right) \quad \text { as } \quad n \rightarrow \infty, where v^{2} is the asymptotic variance of the estimator.

Often a consistent estimator \hat{v}^{2} of v^{2} can be determined from the data. For large n we then approximately have \frac{\sqrt{n}(\hat{\theta}-\theta)}{v} \stackrel{a}{\sim} N(0,1) . For a given \alpha, a one-sided test of H_{0}: \theta=\theta_{0} against H_{1}: \theta>\theta_{0} then rejects H_{0} if Z=\frac{\sqrt{n}\left(\hat{\theta}-\theta_{0}\right)}{v}>z_{1-\alpha} . The corresponding asymptotic approximation (valid for sufficiently large n ) of the true power function is then given by \beta_{n, \alpha}(\theta)=1-\Phi\left(z_{1-\alpha}-\frac{\sqrt{n}\left(\theta-\theta_{0}\right)}{v}\right)

Note that in practice the (unknown) true value v^{2} is generally replaced by an estimator \hat{v}^{2} determined from the data. As long as \hat{v}^{2} is a consistent estimator of v^{2} this leads to the same asymptotic power function. The resulting test is asymptotically unbiased and consistent.

Example Let X_{1}, \ldots, X_{n} be an iid random sample. Consider testing H_{0}: \mu= \mu_{0} against H_{1}: \mu>\mu_{0}, where \mu:=E\left(X_{i}\right). For a given level \alpha the t-test then rejects H_{0} if T=\frac{\sqrt{n}\left(\bar{X}-\mu_{0}\right)}{S}>t_{n-1 ; 1-\alpha}, where t_{n-1 ; 1-\alpha} is the 1-\alpha quantile of a t-distributions with (n-1)-degrees of freedom. This is an exact test if the distribution of X_{i} is normal.

In the general case, the justification of the t-test is based on asymptotic arguments. Under some regularity conditions the central limit theorem implies that \sqrt{n}(\bar{X}-\mu) \rightarrow^d N\left(0, \sigma^{2}\right) \quad \text { as } \quad n \rightarrow \infty with \sigma^{2}=\operatorname{Var}\left(X_{i}\right).

Moreover, S^{2} is a consistent estimator of \sigma^{2} and t_{n-1 ; 1-\alpha} \rightarrow z_{1-\alpha} as n \rightarrow \infty. Thus even if the distribution of X_{i} is non-normal, for sufficiently large n, T=\frac{\sqrt{n}\left(\bar{X}-\mu_{0}\right)}{S} is approximately N(0,1)-distributed and the asymptotic power function of the t-test is given by \beta_{n, \alpha}(\theta)=1-\Phi\left(z_{1-\alpha}-\frac{\sqrt{n}\left(\mu-\mu_{0}\right)}{\sigma}\right) .

3.5 Confidence Sets

Suppose we test H_0: \theta = \theta_0 and have a test such that P\left(T \in \mathcal C \mid H_0 \text{ is true} \right) = \alpha

Now collect all values of \theta_0 for which we do not reject the null hypothesis, and call this set CS. Then, P\left(\theta \in CS\right) = 1 - P\left(\theta \notin CS\right) = 1 - P\left(T \in \mathcal C \mid H_0 \text{ is true}\right) = 1 - \alpha

We call CS an exact 1-\alpha confidence set for \theta.

Similarly, if P\left(T \in \mathcal C \mid H_0 \text{ is true} \right) \rightarrow \alpha, then P\left(\theta \in CS\right) \rightarrow 1 - \alpha and CS is an asymptotically valid 1-\alpha confidence set for \theta.

Example: In the previous example, we considered testing H_{0}: \mu= \mu_{0} against H_{1}: \mu>\mu_{0}, where \mu:=E\left(X_{i}\right) using a random sample. For a given level \alpha the t-test then rejects H_{0} if T=\frac{\sqrt{n}\left(\bar{X}-\mu_{0}\right)}{S}>t_{n-1 ; 1-\alpha}, The corresponding confidence set is CS = \left\{ \mu \in \mathbb{R}: \bar{X} - t_{n-1 ; 1-\alpha} \frac{S}{\sqrt{n} } \leq \mu \leq \bar{X} + t_{n-1 ; 1-\alpha} \frac{S}{\sqrt{n} } \right\} This is an exact confidence interval if we have a sample from a normal and an asymptotically valid confidence interval otherwise.