# Chapter 9 Statistical inference

In Chapter 7, we learned about estimation: the use of data and statistics to construct the best possible guess at the value of some parameter \(\theta\).

In this chapter, we will pursue a different goal. Instead of estimating the value of \(\theta\) we will what values of \(\theta\) can be confidently ruled out given the evidence available in our data:

- A hypothesis test determines whether a
*particular*value*can*be ruled out. - A confidence interval determines a
*range*of values that*cannot*be ruled out.

The set of procedures for constructing confidence intervals and
hypothesis tests is called statistical ** inference**.

**Chapter goals**

In this chapter we will learn how to:

- Construct and perform a simple hypothesis test
- Construct a confidence interval
- Correctly interpret both hypothesis tests and confidence intervals

## 9.1 Principles of inference

### 9.1.1 Evidence

The purpose of statistical inference is to systematically account for the uncertainty associated with limited evidence. That is, there are important aspects of the data generating process we do not know. The data provide some evidence about those unknown aspects, but the evidence they provide may not be strong. Statistical inference asks us what statements about the data generating process can be made with confidence based on the data.

**Roulette**

Suppose you work as a casino regulator for the BCLC (British Columbia Lottery Corporation, the crown corporation that regulates all commercial gambling in B.C.). You have been given data with recent roulette results from a particular casino and are tasked with determining whether the casino is running a fair game.

Before getting caught up in math, let’s think about how we might assess evidence:

- If we have data from many games, and the win rate for gamblers is about
the same as the win probability in a fair game, then we would conclude there
is strong evidence the game is
*fair or close to fair*. - If we have data from many games, and the win rate for gamblers is much
higher or lower than the win probability in a fair game, then we would
conclude there is strong evidence that the game is
*not fair*. - If we have no data, or data from only a few games, we probably
*can’t tell*whether the game is fair or unfair

In this chapter we will formalize these basic ideas about evidence.

### 9.1.2 A basic framework

For the remainder of this chapter, suppose we have a data set \(D_n = (x_1,x_2,\ldots,x_n)\) of size \(n\). The data comes from an unknown data generating process that includes an unknown parameter of interest \(\theta\).

**DGP and parameter of interest for roulette**

Let \(D_n = (x_1,\ldots,x_n)\) be a data set of results from \(n = 100\) games of roulette at a local casino. More specifically, let \[x_i = I(\textrm{Red wins})\] Our parameter of interest is the probability that red wins: \[p_{red} = \Pr(x_i = 1) = E(x_i)\] We know that red wins in a fair game with probability \(p_{red} = 18/37 \approx 0.486\).

## 9.2 Hypothesis tests

We will start with hypothesis tests. The idea of a hypothesis test
is to determine whether the data rule out or ** reject** a specific
value of the unknown parameter \(\theta\).

Intuitively, if we have no (useful) data we cannot rule anything out, but as we obtain more data, we can rule out more values.

### 9.2.1 The null and alternative hypotheses

The first step in a hypothesis test is to define the
** null hypothesis**. The null hypothesis is a statement
about our parameter \(\theta\) that takes the form:
\[H_0: \theta = \theta_0\]
where \(\theta_0\) is a specific number. This is the value of
\(\theta\) we are interested in ruling out.

The next step is to define the ** alternative hypothesis**. The
alternative hypothesis defines every other value of \(\theta\)
we are allowing, and is usually written as:
\[H_1: \theta \neq \theta_0\]
where \(\theta_0\) is the same number as used in the null.

**Null and alternative for \(p_{red}\)**

In our roulette example, our null hypothesis is that the game is fair: \[H_0: p_{red} = 18/37 \approx 0.486\] and the alternative hypothesis is that it is not fair: \[H_1: p_{red} \neq 18/37\]

Notice that there is something of an asymmetry between the null and alternative hypothesis: the null is typically (though not necessarily) a single value and the alternative is every other possible value.

**What null hypothesis to choose?**

Our framework here assumes that you already know what null hypothesis you wish to test, but we might briefly consider how we might choose a null hypothesis to test.

In some applications there are null hypotheses that are of clear interest for that specific case:

- In our roulette example, the natural null to test is whether the win probability matches that of a fair game (\(p = p_{fair}\)).
- When measuring the effect \(\beta\) of one variable on another, the natural null to test is “no effect at all” (\(\beta = 0\)).
- When comparing the mean of some characteristic or outcome across two groups (for example, average wages of men and women), the natural null to test is that they are the same (\(\mu_m = \mu_W\))
- In epidemiology, a contagious disease will tend to spread if its reproduction rate \(R\) is greater than one, and decline if it is less than one, so \(R = 1\) is a natural null to test.

If there is no obvious null hypothesis, it may make sense to test many null hypotheses and report all of the results. There is nothing wrong with doing that.

### 9.2.2 The test statistic

Our next step is to construct a ** test statistic** that can be
calculated from our data. A valid test statistic for a given null
hypothesis is a statistic \(t_n\) that has the following two
properties:

- The probability distribution of \(t_n\)
(i.e., when \(H_0\) is true) is*under the null**known*. - The probability distribution of \(t_n\)
(i.e., when \(H_1\) is true) is*under the alternative**different*from its probability distribution under the null.

The test statistic is usually based on an estimator of the parameter, and is usually constructed so that it is typically close to zero when the null is true, and far from zero when the null is false. But it does not need to be.

*A test statistic for roulette*

A natural test statistic for the win probability of a bet on red would be the corresponding win frequency in our data. We could use either the relative win frequency (which also happens to be the sample average): \[\hat{f}_{red} = \bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i\] but it will be more convenient to use the absolute win frequency: \[t_n = n\hat{f}_{red} = n\bar{x}_n =\sum_{i=1}^n x_i\] Next we need to find the probability distribution of \(t_n\) under the null, and under the alternative.

In general, since \(x_i \sim Bernoulli(p_{red})\) we have: \[t_n \sim Binomial(n,p_{red})\] Remember that the \(Binomial(n,p)\) distribution is the distribution corresponding to the number of times an event with probability \(p\) happens in \(n\) independent trials.

Under the null (when \(H_0\) is true), \(p_{red} = 18/37\) and so: \[t_n \sim Binomial(100,18/37)\] Since this distribution does not involve any unknown parameters, our test statistic satisfies the requirement of having a known distribution under the null.

Under the alternative (when \(H_1\) is true), \(p_{red}\) can take on any value
*other* than \(18/37\). The sample size is still \(n=100\), so the
distribution of the test statistic is:
\[t_n \sim Binomial(100,p_{red}) \textrm{ where $p_{red} \neq 18/37$ }\]
Notice that the distribution of our test statistic under the alternative
is not known, since \(p_{red}\) is not known. But the distribution is
*different* under the alternative, and that is what we require from
our test statistic.

### 9.2.3 Significance and critical values

After choosing a test statistic \(t_n\), the next step is to choose
** critical values**. The critical values are two numbers
\(c_L\) and \(c_H\) (where \(c_L < c_H\)) such that

- \(t_n\) has a
*high*probability of being between \(c_L\) and \(c_H\) when the null is true. - \(t_n\) has a
*lower*probability of being between \(c_L\) and \(c_H\) when the alternative is true.

The range of values from \(c_L\) to \(c_H\) is called the ** critical range**
of our test.

Given the test statistic and critical values:

- We
if \(t_n\) is outside of the critical range.*reject the null*- This means we have strong evidence that \(H_0\) is false.
- The reason we reject here is that we know we would be unlikely to observe such a value of \(t_n\) if \(H_0\) were true.

- We
if \(t_n\) is inside of the critical range.*fail to reject the null*- This means we do not have strong evidence that \(H_0\) is false.
- This does not mean we have strong evidence that \(H_0\) is true. We may just not have enough evidence to reach a conclusion.

Notice that there is an asymmetry here: in the absence of evidence,
we will not reject *any* null hypotheses.

How do we choose critical values? You can think of critical values as setting a standard of evidence, so we need to balance two considerations:

- The probability of rejecting a false null is called the
of the test.*power*- We want our test to reject the null when it is false, so power is good.

- The probability of rejecting a true null is called the
or*size*of a test.*significance*- We do not want our test to reject the null when it is true, so size is bad.

- There is always a trade off between power and size
- A
*narrower*critical range (higher \(c_L\) or lower \(c_H\)) will increase the rejection rate, increasing both power (good) and size (bad). - A
*wider*critical range (lower \(c_L\) or higher \(c_H\)) will reduce the rejection rate, reducing both power (bad) and size (good).

- A

Given this trade off between power and size, we might construct some criterion that includes both (just like MSE includes both variance and bias) and choose critical values to maximize that criterion. In practice, we do not typically do that.

Instead, we follow a simple convention:

- Set the size to a fixed value \(\alpha\).
- In general, the conventional size varies by field, and typically varies with how much data is typical in that field.
- In economics and most other social sciences, the usual convention is to use a size of 5% (\(\alpha = 0.05\)).
- We sometimes see 1% (\(\alpha = 0.01\)) when working with larger data sets or 10% (\(\alpha = 0.10\)) when working with small data sets.
- In physics or genetics, where data sets are much larger, the conventional size is much lower.

- Calculate critical values that imply the desired size.
- With a size of 5% \((\alpha = 0.05)\), we would:
- set \(c_L\) to the 2.5 percentile (0.025 quantile) of the null distribution
- set \(c_H\) to the 97.5 percentile (0.975 quantile) of the null distribution

- With a size of 10% \((\alpha = 0.10)\), we would:
- set \(c_L\) to the 5 percentile (0.05 quantile) of the null distribution
- set \(c_H\) to the 95 percentile (0.95 quantile) of the null distribution

- With a size of \(\alpha\), we would:
- set \(c_L\) to the \(\alpha/2\) quantile of the null distribution
- set \(c_H\) to the \(1-\alpha/2\) quantile of the null distribution

- With a size of 5% \((\alpha = 0.05)\), we would:

In other words, we set size equal to a conventional value, and let the power be whatever is implied by that.

**Critical values for roulette**

We earlier showed that the distribution of \(t_n\) under the null is:
\[t_n \sim Binomial(100,18/37)\]
We can get a size of 5% by choosing:
\[c_L = 2.5 \textrm{ percentile of } Binomial(100,18/37)\]
\[c_H = 97.5 \textrm{ percentile of } Binomial(100,18/37)\]
We can then use Excel or R to calculate these critical values. In Excel,
the function you would use is `BINOM.INV()`

- The formula to calculate \(c_L\) is
`=BINOM.INV(100,18/37,0.025)`

- The formula to calculate \(c_H\) is
`=BINOM.INV(100,18/37,0.975)`

The calculations below were done in R:

`## 2.5 percentile of binomial(100,18/37) = 39`

`## 97.5 percentile of binomial(100,18/37) = 58`

In other words we reject the null (at 5% significance) that the roulette wheel is fair if red wins fewer than 39 games or more than 58 games.

**A general test for a single probability**

We can generalize the test we have constructed so far to the case of the probability of any event:

Test component | Roulette example | General case |
---|---|---|

Parameter | \(p_{red} = \Pr(\textrm{Red wins})\) | \(p = \Pr(\textrm{event})\) |

Null hypothesis | \(H_0:p_{red} = 18/37\) | \(H_0:p = p_0\) |

Alternative hypothesis | \(H_1: p_{red} \neq 18/37\) | \(H_1: p \neq p_0\) |

Test statistic | \(t = n\hat{f}_{RED}\) | \(t = n\hat{f}_{\textrm{event}}\) |

Null distribution | \(Binomial(100,18/37)\) | \(Binomial(n,p_0)\) |

Critical value \(c_L\) | 39 | 2.5 percentile of \(Binomial(n,p_0)\) |

Critical value \(c_H\) | 58 | 97.5 percentile of \(Binomial(n,p_0)\) |

Decision | Reject if \(t \notin [39,58]\) | Reject if \(t \notin [c_L,c_H]\) |

### 9.2.4 The power of a test

As mentioned above, the power of a test is defined as the probability of rejecting the null when it is false, and the alternative is true.

The size of a test is a number, since the distribution of the test statistic
is known under the null. Since the alternative typically allows
more than one value of the parameter \(\theta\), the power
of a test is not a number but a *function* of the unknown true value
of \(\theta\) (and sometimes other unknown features of the DGP):
\[power(\theta) = \Pr(\textrm{reject $H_0$})\]
In some cases we can actually calculate this function.

**The power curve for roulette**

Power curves can be tricky to calculate, and I will not ask you to calculate them for this course. But they can be calculated, and it is useful to see what they look like.

Figure 9.1 below depicts the power curve for the roulette test we have just constructed; that is, we are testing the null that \(p_{red} = 18/37\) at a 5% size. The blue line depicts the power curve for \(n=100\) as in our example, while the green line depicts the power curve for \(n=20\).There are a few features I would like you to notice, all of whcih are common to most regularly used tests:

- The power curve reaches its lowest value at the red point
\((18/37,0.05)\). Note that \(18/37\) is the parameter value
under the null, and \(0.05\) is the size of the test. In other words:
- The power is always at least as big as the size, and is usually bigger.
- We are more likely to reject the null when it is false than when it is true. That’s good!
- When a test has this desirable property, we call
it an
test.*unbiased*

- The power increases as \(\theta\) gets further from the null.
- That is, we are more likely to detect unfairness in a game
that is
*very*unfair than when in one that is*a little*unfair.

- That is, we are more likely to detect unfairness in a game
that is
- Power also increases with the sample size; the blue line (\(n = 100\)) is above the green line (\(n = 20\)).

Power analysis is often used by researchers to determine how much data to collect. Each additional observation increases power but costs money, so it is important to spend enough to get clear results but not much more than that.

**P values**

The convention of always using a 5% significance level for hypothesis tests is somewhat arbitrary and has some negative unintended consequences:

- Sometimes a test statistic falls just below or just above the critical value, and small changes in the analysis can change a result from reject to cannot-reject.
- In many fields, unsophisticated researchers and journal editors misinterpret “cannot reject the null” as “the null is true.”

One common response to these issues is to report what is called
the ** p-value** of a test. The p-value of a test is defined
as the significance level at which one would switch from rejecting
to not-rejecting the null. For example:

- If the p-value is 0.43 (43%) we would not reject the null at 10%, 5% or 1%.
- If the p-value is 0.06 (6%) we would reject the null at 10% but not at 5% or 1%.
- If the p-value is 0.02 (2%) we would reject the null at 10% and 5% but not at 1%.
- If the p-value is 0.001 (0.1%) we would reject the null at 10%, 5%, and 1%.

The p-value of a test is simple to calculate from the test statistic and its distribution under the null. I won’t go through that calculation here.

## 9.3 Application: the mean

Having described the general framework and a single example, we now move on to the most common application: constructing hypothesis tests and confidence intervals on the mean in a random sample.

Let \(D = (x_1,\ldots,x_n)\) be a random sample of size \(n\) on some random variable \(x_i\) with unknown mean \(E(x_i) = \mu_x\) and variance \(var(x_i) = \sigma_x^2\).

Let the sample average be: \[\bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i\] and the sample variance: \[s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\] Both of these statistics are easily calculated from the data, and we have previosly discussed their properties in detail.

### 9.3.1 The null and alternative hypotheses

Suppose that you want to test the null hypothesis: \[H_0: \mu_x = 1\] against the alternative hypothesis \[H_1: \mu_x \neq 1\] Having stated our null and alternative hypotheses, we need to construct a test statistic.

Remember that our test statistic needs to have a *known* distribution
under the null, and a *different* distribution under the alternative.

### 9.3.2 The T statistic

The typical test statistic we use in this setting is called the
** T statistic**, and takes the form:
\[t_n = \frac{\bar{x}_n - 1}{s_x/\sqrt{n}}\]
The idea here is that we take our estimate of the parameter (\(\bar{x}\)),
subtract its expected value under the null (\(1\)), and divide
by an estimate of its standard deviation (\(s_x/\sqrt{n}\)). We can
add and subtract the unknown true mean \(\mu_x\) to get:
\[\begin{align}
t_n &= \frac{\bar{x}_n - \mu_x + \mu_x - 1}{s_x/\sqrt{n}} \\
&= \frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} + \frac{\mu_x - 1}{s_x/\sqrt{n}}
\end{align}\]
The first part of this expression is a random variable with a mean of zero
and a variance of (about) one. The second part of the expression is
exactly zero when \(H_0\) is true, and not exactly zero when it is false.

Recall that we need the probability distribution of \(t_n\) to be known when \(H_0\) is true, and different when it is false. The second criterion is clearly met, and the first criterion is met if we can find the probability distribution of \(\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}}\).

Unfortunately, if we don’t know the exact probability distribution of \(x_i\) we don’t know the exact probability distribution of statistics calculated from it. Once we have a potential test statistic, there are two standard solutions to this problem:

- Assume a specific probability distribution (usually a normal distribution) for \(x_i\). We can (or at least a proefessional statistician can) then mathematically derive the distribution of any test statistic from this distribution.
- Use the central limit theorem to get an approximate probability distribution.

We will explore both of those options.

### 9.3.3 Asymptotic critical values

We will start with the asymptotic solution to the problem. As we learned in Chapter 7, the Central Limit Theorem tells us that: \[\frac{\bar{x}_n - \mu_x}{\sigma_x/\sqrt{n}} \rightarrow N(0,1)\] Under the null our test statistic looks just like this, but with the sample standard deviation \(s_x\) in place of the population standard deviation \(\sigma_x\). It turns out that Slutsky’s theorem allows us to make this substitution, and it can be proved that: \[\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} \rightarrow N(0,1)\]

Therefore, under the null: \[t_n \rightarrow N(0,1)\] In other words, while we do not know the exact (finite-sample) distribution of \(t_n\) we know that \(N(0,1)\) provides a useful asymptotic approximation to that distribution.

Therefore, if we want a test that has the ** asymptotic size** of 5%, we
can use Excel or R to calculate critical values. In Excel, the function
would be

`NORM.INV`

or `NORM.S.INV`

, and the formulas would be:- \(c_L\):
`=NORM.S.INV(0.025)`

or`=NORM.INV(0.025,0,1)`

. - \(c_H\):
`=NORM.S.INV(0.975)`

or`=NORM.INV(0.975,0,1)`

.

The caclulations below were done in R:

`## cL = 2.5 percentile of N(0,1) = -1.96`

`## cH = 97.5 percentile of N(0,1) = 1.96`

These particular critical values are so commonly used that I want you to remember them.

### 9.3.4 Exact critical values

Most economic data comes in sufficiently large samples that the asymptotic distribution of \(t_n\) is a reasonable approximation and the asymptotic test works well. But occasionally we have samples that are small enough that it doesn’t.

Another option is to assume that the \(x_i\) variables are normally distributed: \[x_i \sim N(\mu_x,\sigma_x^2)\] where \(\mu_x\) and \(\sigma_x^2\) are unknown parameters.

Now at this point it is important to remind you: many interesting
variables are *not* normally distributed (for example, our roulette outcome
is discrete uniform, and the result of a given bet is Bernouilli)
and so this assumption may very well be incorrect.

If this was a more advanced course I would derive the
distribution of \(t_n\) under the null. But for ECON 233, I will
just ask you to understand that it *can* be derived once we
assume normality of the \(x_i\).

The null distribution of this particular test statistic under these particular assumptions was derived in the 1920’s by William Sealy Gosset, a statistician working at the Guiness brewery. To avoid getting in trouble at work (Guiness did not want to give away trade secrets) Gosset published under the pseudonym “Student”. As a result, the family of distributions he derived is called “Student’s T distribution”.

When the null is true, the test statistic \(t_n\) as described above has Student’s T distribution with \(n-1\) degrees of freedom: \[t_n \sim T_{n-1}\] As always, it has a different distribution (sometimes called the “noncentral T distribution”) when the null is false.

The \(T_{n-1}\) distribution looks a lot like the \(N(0,1)\) distribution, but has slightly higher probability of extreme positive or negative values. As \(n\) increases the \(T_{n-1}\) distribution converges to the \(N(0,1)\) distribution, just as predicted by the central limit theorem.

Having found our test statistic and its distribution under the null,
we can calculate our critical values:
\[c_L = 2.5 \textrm{ percentile of } T_{n-1}\]
\[c_H = 97.5 \textrm{ percentile of } T_{n-1}\]
We can obtain these percentiles using Excel or R. In Excel, the
relevant function is `T.INV`

. For example, if we have 5 observations, then:

- We would cacluate \(c_L\) by the formula
`=T.INV(0.025,4)`

. - We would cacluate \(c_H\) by the formula
`=T.INV(0.975,4)`

.

The results (calculated below using R) would be:

`## cL = 2.5 percentile of T_4 = -2.776`

`## cH = 97.5 percentile of T_4 = 2.776`

In contrast, if we have 30 observations, then:

- We would cacluate \(c_L\) by the formula
`=T.INV(0.025,29)`

. - We would cacluate \(c_H\) by the formula
`=T.INV(0.975,29)`

.

The results (calculated below using R) would be:

`## cL = 2.5 percentile of T_29 = -2.045`

`## cH = 97.5 percentile of T_29 = 2.045`

and if we have 1,000 observations:

- We would cacluate \(c_L\) by the formula
`=T.INV(0.025,999)`

. - We would cacluate \(c_H\) by the formula
`=T.INV(0.975,999)`

.

The results (calculated below using R) would be:

`## cL = 2.5 percentile of T_999 = -1.962`

`## cH = 97.5 percentile of T_999 = 1.962`

Notice that the Student’s T test is more ***conservative** (less likely
to reject) than the asymptotic test for smaller sample sizes. But
at some point (around \(n = 30\)) the difference between the two tests
becomes negligible.

In practice, most data sets in economics have well over 30 observations so economists tend to use asymptotic tests unless they have a very small sample.

## 9.4 Confidence intervals

Hypothesis tests have one very important limitation: although they allow us to rule out \(\theta = \theta_0\) for a single value of \(\theta_0\), they say nothing about other values very close to \(\theta_0\).

For example, suppose you are a medical reasearcher trying to measure the effect of a particular treatment, let \(\theta\) be the treatment, and suppose that you have tested the null hypothesis that the treatment has no effect (\(\theta = 0\)).

- If you reject this null, you have concluded that the effect has
*some*effect. However, that does not rule out the possibility that the effect of the treatment is*very small*. - If you fail to reject this null, you cannot rule out the possibility
that the treatment has
*no*effect. However, this does not rule out the possibility that the effect is*very large*.

The solution to this would be to do a hypothesis test for every possible
value of \(\theta\), and classify them into values that were rejected
and not rejected. This is the idea of a ** confidence interval**.

A confidence interval for the parameter \(\theta\) with coverage rate \(CP\) is an interval with lower bound \(CI_L\) and upper bound \(CI_H\) constructed from the data in such a way that \[\Pr(CI_L < \theta < CI_H) = CP\] In economics and most other social sciences, the convention is to report confidence intervals with a coverage rate of 95%. \[\Pr(CI_L < \theta < CI_H) = 0.95\] Note that \(\theta\) is a fixed (but unknown) parameter, while \(CI_L\) and \(CI_H\) are statistics calculated from the data.

How do we calculate confidence intervals? It turns out to be entirely straightforward: confidence intervals can be constructed by inverting hypothesis tests:

- The 95% confidence interval is all values that cannot be rejected at a 5% level of significance.
- The 90% confidence interval all values that cannot be
rejected at a 10% level of significance.
- It is
*narrower*than the 95% confidence interval.

- It is
- The 99% confidence interval is all values that cannot be
rejected at a 1% level of significance.
- It is
*wider*than the 95% confidence interval.

- It is

**A confidence interval for the win probability**

Calculating a confidence interval for \(p_{red}\) is somewhat tricky to do by hand, but easy to do on a computer:

- Construct a grid of many values between 0 and 1.
- For each value \(p_0\) in the grid, test the null hypothesis \(H_0: p_{red} = p_0\) against the alternative hypothesis \(H_1: p_{red} \neq p_0\).
- The confidence interval is the range of values for \(p_0\) that are not rejected.

For example, suppose that red wins on 40 of the 100 games. Then a 95% confidence interval for \(p_{red}\) is:

`## 0.32 to 0.49`

Notice that the confidence interval includes the fair value of \(0.486\) but it also includes some very unfair values. In other words, while we are unable to rule out the possibility that we have a fair game, the evidence that we have a fair game is not very strong.

### 9.4.1 Confidence intervals for the mean

Confidence intervals for the mean are very easy to calculate. Again we construct them by inverting the hypothesis test.

Pick any \(\mu_0\). To test the null \[H_0: \mu_x = \mu_0\] our test statistic is: \[t_n = \sqrt{n}\frac{\bar{x}-\mu_0}{s_x}\] and we fail to reject the null if \[c_L < t_n < c_H\] where \(c_L\) and \(c_H\) are our critical values.

Plugging \(t_n\) to this expression we fail to reject the null whenever: \[c_L < \sqrt{n}\frac{\bar{x}-\mu_0}{s_x} < c_H\] Solving for \(\mu_0\) we fail to reject whenever: \[\bar{x} - c_H s_x/\sqrt{n} < \mu_0 < \bar{x} - c_L s_x/\sqrt{n}\] All that remains is to choose a confidence/size level, and decide whether to use an asymptotic or finite sample test.

If we are using the asymptotic approximation to construct a 95% confidence interval, then the 5% asyptotic critical values are \(c_L = -1.96\) and \(c_H \approx 1.96\) and the confidence interval is: \[CI = \bar{x} \pm 1.96 s_x/\sqrt{n}\] In other words, the 95% confidence interval for \(\mu_x\) is just the point estimate plus or minus roughly 2 standard errors.

If we have a small sample, and choose to assume normality rather than using the asymptotic approximation, then we need to use the slighly larger critical values from the \(T_{n-1}\) distribution. For example, if \(n=5\), then \(c_L \approx -2.78\), \(c_H \approx 2.78\) and the 95% confidence interval is: \[CI = \bar{x} \pm 2.78 s_x/\sqrt{n}\] As with hypothesis tests, finite sample confidence intervals are typically more conservative (wider) than their asymptotic cousins, but the difference becomes negligible as the sample size increases.