11 Classic hypothesis testing and confidence intervals - definitions and set-up

This chapter is on the definitions, set-up, errors, and interpretation of hypothesis testing and confidence intervals. For computation, see the next chapters.

The difference between classical inference and the permutation testing we did earlier is that we did not (have to) make any assumptions about the underlying distributions. Classical hypothesis testing is used in situations where we can make some assumptions about the distribution(s) to use. The basic idea remains the same though:

11.1 Formulate a hypothesis

The null-hypothesis, \(\small H_o\) , contains equality, i.e. =, \(\le\), or \(\ge\). Other names you may find are H-oh, H-zero, H-naught and very rarely dividing hypothesis or one-sided hypothesis. In any case, this is what, for the purpose of your study, you assume to be true. This means that you are performing all your calculations and analysis under the assumption that \(\small H_o\) is valid. Many authors will simply use = in all \(\small H_o\) to emphasize that fact. You are looking for evidence against that assumption. The thought process is similar to our legal system. We are working under the assumption that an accused is innocent (the \(\small H_o\)) and look for evidence that might us cause to reject that assumption of innocence beyond reasonable doubt.

The alternative hypothesis, \(\small H_1\) aka \(\small H_A\), contains \(\ne\), \(>\), or \(<\). With this notation, \(\small H_o\) and \(\small H_1\) cover all possibilities.

You never accept \(\small H_o\), you either reject it or fail to reject it.

Make sure your experiment design, sample choice, sample size, data collection, method chosen etc. are sound and appropriate to the question you study. If you use an inappropriate tool, the results will not be acceptable. Remember the old saying:

Don’t try to shave a cat with a butter knife

11.2 Compute a test statistic from the data

The procedures for finding various test statistics are outlined below. Keep in mind that,in addition to the null hypothesis, there are other assumptions made about the data that need to be met.

11.3 Compute/find the p-value.

In recent years, there has been much controversy about p-values, to the point were the ASA in 2016 felt compelled to make an official statement (https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108). Below follows a summary of the most important points:

  • Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

  • P-values can indicate how incompatible the data are with a specified statistical model. A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.”

  • P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.

  • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

  • Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Cherry-picking promising findings … should be vigorously avoided.

  • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

  • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible.

11.4 Errors

Two potential mistakes are possible. We could reject \(\small H_o\) when in fact \(\small H_o\) was correct. This is call a type I error. On the other hand, we could fail to reject \(\small H_o\) when we should have done so. This is called a type II error. \[\small \text{P(type I error) = P(reject } H_o | H_o \text{ is true})\] \[\small \beta = \text{P(type II error) = P(did not reject } H_o | H_A \text{ is true)= P(did not reject } H_o | H_o \text{ is false})\] The power of a test is the probability of correctly identifying and rejecting a false Null hypothesis (the probability of not making a type 2 error).

\[\small 1-\beta =1- \text{P(type II error) = P(reject } H_o | H_A \text{ is true)= P(reject } H_o | H_0 \text{ is false})\]

Example Suppose you sit on a jury in a trial. In the US we have the “presumed innocent until proven guilty” rule, so here

\(\small H_o\): Defendent is innocent

\(\small H_A\): Defendent is guilty

In this case, a type I error means you convict an innocent person. A type II error means you let a guilty person go free. Which error is worse depends on the context.

\(\small H_o\) true \(\small H_o\) false
reject \(\small H_o\) type I error correct decision
don’t reject \(\small H_o\) correct decision type II error

Assignment: You have two drugs. Drug 1 is very expensive, drug 2 is pretty cheap. You are testing

\(\small H_o\): Both drugs are equally effective

\(\small H_a\): The expensive drug works better

  1. What is the type I error here?

  2. What is the type II error here?

  3. What are the consequences of making each error, and which one is worse?

Assignment: You have two drugs that are equally effective for some disease, and also cost the same. Drug 2 might cause a serious side-effect in some patients, drug 1 has been used for decades with reported side effects. You are testing

\(\small H_o\): the frequency of the side effect in both drugs is the same

\(\small H_a\): the frequency of the side effect in drug 2 is greater than that in drug 1

  1. What is the type I error here?

  2. What is the type II error here?

  3. What are the consequences of making each error, and which one is worse?

The significance level of a hypothesis test is the largest value (commonly called \(\small \alpha\)) that we find acceptable for the probability of a type I error.

Note: Although not correct, in practice p-value and the probability of a type I error are sometimes used interchangeably. The significance value then becomes the cut-off for p-values that are “small enough” to reject \(\small H_o\).

11.5 Confidence intervals

Confidence intervals are a great way to confirm the results you got from your hypothesis testing. The general idea is this: If you know - or can reasonably assume - the sampling distribution of your test statistic, then you determine the range for the most likely \(x \%\) of values. With some algebra, you can then find an interval for the parameter you are interested in. Let’s look at an example:

Let’s say you pull a sample of size 340 from a population of normally distributed random variables with unknown mean \(\mu\) and unknown standard deviation \(\sigma\). From the central limit theorem we know that the sample means \(\bar{x}\) are distributed normally with mean \(\bar{\bar{x}} = \mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}=\frac{\sigma}{\sqrt{340}}\), or \(\small \bar{x} \sim \mathcal{N} (\mu, \frac{\sigma}{\sqrt{n}})\)) for short.
We also know that, for any random variable \(\small X\), the standard score \(z=\frac{X - \text{mean}}{\text{standard deviation}}\) is distributed with the same distribution as \(\small X\), except it has mean = 0 and standard deviation = 1.
Let’s assume we want the 95% confidence interval for the unknown parameter \(\mu\). Because \(\bar{x}\) are normally distributed, we will use the standard normal distribution \(\small \mathcal{N} (0, 1)\). For a standard normal variable, 95% of all values are in the range \(0 \pm 1.96\). We thus have:

\[0.95 = P(-1.96 \le \frac{\bar{\bar{x}}-\mu)}{\sigma / \sqrt{n}} \le 1.96)\] \[= P(-1.96 \frac{\sigma}{\sqrt{n}} \le \bar{\bar{x}} - \mu \le 1.96 \frac{\sigma}{\sqrt{n}} )\]

\[ = P(\bar{\bar{x}}-1.96 \frac{\sigma}{\sqrt{n}} \le \mu \le\bar{\bar{x}}+ 1.96 \frac{\sigma}{\sqrt{n}} )\]

Thus the 95% confidence interval for \(\mu\) is given by \(\small \bar{\bar{x}} \pm E = \bar{\bar{x}} \pm 1.96 \frac{\sigma}{\sqrt{n}}\). We call \(\small E = 1.96 \frac{\sigma}{\sqrt{n}}\) the margin of error. We are 95% confident that the interval \(\small \bar{\bar{x}} \pm E = \bar{\bar{x}}\) contains the actual value of \(\mu\).

We will always compute confidence intervals as part of hypothesis tests, they are automatically included in R.

11.6 Interpreting confidence intervals

Say you are wondering if the actual value of your parameter is 11. You computed a 90% confidence interval as [11.3, 11.6] and a 95% confidence interval as [10.9, 12]. (Why is the 95% confidence interval larger?)
The first (90%) interval does not contain our assumed value 11. We can be 90% certain that 11 is not the actual value of the unknown parameter.
The 95% interval does contain the assumed value, 11, as well as larger and smaller possible values. This means that we are 95% sure that the actual value of our unknown parameter could be 11, or less than 11, or more than 11. The 95% confidence interval is inconclusive.

Assume you are comparing two samples to each other and got the following confidence intervals for an unknown parameter:
Group A: 95% c-interval [-2,2], 99% c-interval [-4,4]
Group B: 95% c-interval [2.1, 3.1], 99% c-interval [1.1, 4.1]
We can be 95% certain the parameter is not the same for both groups (there are no common values in the 95% confidence intervals). However, we can not be 99% confident, because at the 99% confidence level the two intervals overlap. It is possible that the two parameters are the same, that A’s parameter is smaller, or that A’s parameter is larger. The 99% confidence intervals are inconclusive.

11.7 Writing your conclusion for hypothesis tests

Once you performed your tests and are confident in your work, you need to write a conclusion. You should phrase your conclusion in the context of the problem, and if possible without technical jargon. Remember, whoever hires you to do statistics does so because they are not an expert.

For example, assume you ran a test on two groups pf people to compare their incomes.
Bad We fail to reject \(\small H_o: \mu_1 = \mu_2\) with a p-value of 0.06 at the 5% significance level.
Really bad We accept \(\small H_o\) with a p-value of 0.06 at the 5% significance level. (Why is this really bad?)
Better We did not find that the average income is statistically significantly different for both groups.
Finally, keep in mind that statistically significant is not the same as significant. Statistically significant difference simply means that a difference could be detected using statistical techniques, it doesn’t mean that the difference is meaningful or (real world) significant.