Chapter 5 Introduction to Inference

5.1 Module Overview

This module will bridge the gap between our discussion on the normal distribution and our first forays into statistical inference. As it turns out, much of the statistical inference we will use relies on the normal distribution and the t-distribution, which we will introduce in this module. We begin our study of statistical inference by learning about confidence intervals.

Module Learning Objectives/Outcomes

  1. Find the distribution of a sample mean.
  2. Estimate probabilities for a sample mean.
  3. Calculate and interpret confidence intervals for a population mean.
  4. Use the standard normal and t-distributions to find critical values.

This module’s outcomes correspond to course outcome (6) apply statistical inference techniques of parameter estimation such as point estimation and confidence interval estimation and (7) apply techniques of testing various statistical hypotheses concerning population parameters.

5.2 Sampling Distributions

5.2.1 Sampling Error

We want to use a sample to learn something about a population, but no sample is perfect! Sampling error is the error resulting from using a sample to estimate a population characteristic.

If we use a sample mean \(\bar{x}\) to estimate \(\mu\), chances are that \(\bar{x}\ne\mu\) (they might be close but… they might not be!). We will consider

  • How close is \(\bar{x}\) to \(\mu\)?
  • What if we took many samples and calculated \(\bar{x}\) many times?
    • How would that relate to \(\mu\)?
    • What would be the distribution of these values?

The distribution of a statistic (across all possible samples of size \(n\)) is called the sampling distribution. We will focus primarily on the distribution of the sample mean.

For a variable \(x\) and given a sample size \(n\), the distribution of \(\bar{x}\) is called the sampling distribution of the sample mean or the distribution of \(\boldsymbol{\bar{x}}\).

Example: Suppose our population is the five starting players on a particular basketball team. We are interested in their heights (measures in inches). The full population data is

Player A B C D E
Height 76 78 79 81 86

The population mean is \(\mu=80\). Consider all possible samples of size \(n=2\):

Sample A,B A,C A,D A,E B,C B,D B,E C,D C,E D,E
\(\bar{x}\) 77 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5

There are 10 possible samples of size 2. Of these samples, 10% have means exactly equal to \(\mu\) (for a random sample of size 2, you’d have a 10% chance to find \(\bar{x}=\mu\)… and a 90% chance not to!).

In general, the larger the sample size, the smaller the sampling error tends to be in estimating \(\mu\) using \(\bar{x}\).

In practice, we have one sample and \(\mu\) is unknown. We also have limited resources to collect data, so it may not be feasible to collect a very large sample.

The mean of the distribution of \(\bar{x}\) is \(\mu_{\bar{X}}=\mu\) and the standard deviation is \(\sigma_{\bar{X}}=\sigma/\sqrt{n}\). We refer to the standard deviation of a sampling distribution as standard error. (Note that this standard error formula is built for very large populations, so it will not work well for our basketball players. This is okay! We usually work with populations so large that we treat them as “infinite.”)

Example: The mean living space for a detatched single family home in the United States is 1742 ft\(^2\) with a standard deviation of 568 square feet. (Does that mean seem huge to anyone else??) For samples of 25 homes, determine the mean and standard error of \(\bar{x}\).

Using our formulae: \[\mu_{\bar{X}} = \mu = 1742\] and \[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} = \frac{568}{\sqrt{25}} = 113.6.\]

5.2.2 The Sampling Distribution of \(\bar{X}\)

First, we consider the setting where \(X\) is Normal(\(\mu\), \(\sigma\)). The plots below show (A) a random sample of 1000 from a Normal(100, 25) distribution and (B) the approximate sampling distribution of \(\bar{X}\) when X is Normal(100, 25).

Notice how the x-axis changes from one plot to the next.

In fact, if \(X\) is Normal(\(\mu\), \(\sigma\)), then \(\bar{X}\) is Normal(\(\mu_{\bar{X}}=\mu\), \(\sigma_{\bar{X}}=\sigma/\sqrt{n}\)).

Central Limit Theorem

For relatively large sample sizes, the random variable \(\bar{X}\) is approximately normally distributed regardless of the distribution of \(X\): \[\bar{X}\text{ is Normal}(\mu_{\bar{X}}=\mu, \sigma_{\bar{X}}=\sigma/\sqrt{n}).\]

Notes

  • This approximation improves with increasing sample size.
  • In general, “relatively large” means sample sizes \(n \ge 30\).

5.3 Developing Confidence Intervals

Recall: A point estimate is a single-value estimate of a population parameter. We say that a statistic is an unbiased estimator if the mean of its distribution is equal to the population parameter. Otherwise, it is a biased estimator.

Comment Remember how our formula for standard deviation, the “mean squared deviance” divides by \(n-1\) instead of \(n\)? We do this so that \(s\) is an unbiased estimate of \(\sigma\).

Ideally, we want estimates that are unbiased with small standard error. For example, a sample mean (unbiased) with a large sample size (results in smaller standard error).

Point estimates are useful, but they only give us so much information. The variability of an estimate is also important!

Example Think about estimating what tomorrow’s weather will be like. If it’s May in Sacramento, the average high temperature is 82 degrees Fahrenheit, but it’s not uncommon to have highs anywhere from 75 to 90! Since the highs are so variable, it’s hard to be confident using 82 to predict tomorrow’s weather.

On the flip side, think about July in Phoenix. The average high is 106 degrees Fahrenheit. In Phoenix, it’s uncommon to have a July day with a high below 100. Since the highs are not variable, you could feel pretty confidence using 105 to predict tomorrow’s weather.

Take a look at these two boxplots:

Both samples are size \(n=100\) and have \(\bar{x}=0\), which would be our point estimate for \(\mu\)… but Variable 1 has a standard deviation of \(\sigma=0.5\) and Variable 2 has standard deviation \(\sigma=5\). As a result, we can be more confident in our estimate of the population mean for Variable 1 than for Variable 2.

We want to formalize this idea of confidence in our estimates. A confidence interval is an interval of numbers based on the point estimate of the parameter. Say we want to be 95% confident about a statement. In Statistics, this means that we have arrived at our statement using a method that will give us a correct statement 95% of the time.

Assume we are taking a sample from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\). We will assume the value of \(\sigma\) is known to us. Then \(\bar{X}\) is Normal(\(\mu, \sigma/\sqrt{n}\)). If we standardize \(\bar{X}\), we get \[Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}.\]

We want some interval \((a,b)\). We will start by considering \(a < Z < b\), so \(a < Z\) and \(Z < b\) (or \(b > Z\)). Then

\[ \begin{aligned} Z &< b\\ \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} &< b\\ \bar{X}-\mu &< b\sigma/\sqrt{n} \\ \bar{X}-b\sigma/\sqrt{n} &< \mu \end{aligned} \]

and

\[ \begin{aligned} a &< Z \\ a &< \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \\ a\sigma/\sqrt{n} &< \bar{X}-\mu \\ \mu &< \bar{X}-a\sigma/\sqrt{n} \end{aligned} \]

putting these together, \[ \bar{X}-b\frac{\sigma}{\sqrt{n}} < \mu < \bar{X}-a\frac{\sigma}{\sqrt{n}}.\] If we want to be 95% confident, then we want \(P(a < Z < b)=0.95\): \[P\left(\bar{X}-b\frac{\sigma}{\sqrt{n}} < \mu < \bar{X}-a\frac{\sigma}{\sqrt{n}}\right) = 0.95.\] To calculate the 95% confidence interval, we need to find \(a\) and \(b\) such that \(P(a < Z < b)=0.95\).

We want this interval to be as narrow (small) as possible. Why? Narrower intervals are more informative. If I say I’m 95% confidence that tomorrow’s high will be between -100 and 200 degrees Fahrenheit, that’s a useless interval. If I change it to between 70 and 100, that’s a little better. Changing it to between 85 and 90 is even better. This is what we mean by more informative.

It turns out that, with a symmetric distribution like the normal distribution, the way to make a confidence interval as narrow as possible is to take advantage of this symmetry. Each of the plots below show a shaded area of 0.95. The narrowest interval (along the horizontal axis) is the first interval, which is shaded on \((-1.96 < z < 1.96)\).

The confidence interval, then, is \[\left(\bar{x} - z_*\frac{\sigma}{\sqrt{n}}, \bar{x} + z_*\frac{\sigma}{\sqrt{n}}\right)\] where \(z_* = 1.96\). The midpoint of this interval is \(\bar{x}\). The value of \[z_*\frac{\sigma}{\sqrt{n}}\] is called the margin of error.

5.3.1 Interpreting a Confidence Interval

To interpret a confidence interval, we need to think back to our definition of probability as “the proportion of times is would occur if the experiment were run infinitely many times.” In the confidence interval case, if an experiment is run infinitely many times, the true value of \(\mu\) will be contained in 95% of the intervals.

The graphic above shows 95% confidence intervals for 100 samples of size \(n=60\) drawn from a population with mean \(\mu=80\) and standard deviation \(\sigma=25\). Each sample’s confidence interval is represented by a horizontal line. The dot in the middle of each is the sample mean. When a confidence interval does not capture the population mean \(\mu\), the line is printed in red. Based on this concept of repeated sampling, we would expect about 95% of these intervals to capture \(\mu\). In fact, 96 of the 100 intervals capture \(\mu\).

Finally, when you interpret a confidence interval, it is important to do so in the context of the problem.

Example The preferred keyboard height for typists is approximately normally distributed with \(\sigma=2.0\). A sample of size \(n=31\), resulted in a mean prefered keyboard height of \(80 cm\). Find and interpret a 95% confidence interval for keyboard height.

The interval is \[\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}} = 80.0 \pm 1.96\times\frac{2.0}{\sqrt{31}} = 80.0 \pm 0.70 = (79.3, 80.7).\] Interpretation:We can be 95% confident that the mean preferred keyboard height for typists is between 79.3cm and 80.7cm.

Notice that I kept the interpretation simple! That’s okay - just be sure you are also able to explain what it means to be 95% confident (using the concept of repeated sampling).

Common mistakes:

  • It is NOT accurate to say that “the probability that \(\mu\) is in the confidence interval is 0.95.” The parameter \(\mu\) is some fixed quantity and it’s either in the interval or it isn’t.
  • We are NOT “95% confident that \(\bar{x}\) is in the interval.” The value \(\bar{x}\) is some known quantity and it’s always in the interval.

5.3.2 Exercises

  1. Suppose I took a random sample of 50 Sac State students and asked about their SAT scores and found a mean score of 1112. Prior experience with SAT scores in the CSU system suggests that SAT scores are well-approximated by a normal distribution with standard deviation known to be 50.
    1. Find a 95% confidence interval for Sac State SAT scores.
    2. Interpret your interval in the context of the problem.
    3. What is the width of your interval? If you want a narrower interval, what could you do?

5.4 Other Levels of Confidence

While the 95% confidence interval is common in research, there’s nothing inherently special about it. You could calculate a 90%, a 99%, or - if you’re feeling spicy - something like a 43.8% confidence interval. These numbers are called the confidence level and they represent the proportion of times that the parameter will fall in the interval (if we took many samples).

The 100(1-\(\alpha\))% confidence interval for \(\mu\) is given by \[\bar{x}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\] where \(z_{\alpha/2}\) is the z-score associated with the \([1-(\alpha/2)]\)th percentile of the standard normal distribution. The value \(z_{\alpha/2}\) is called the critical value (“c.v.” on the plot, below).

We can find critical values in R using the same command we used to find percentiles: qnorm(p). We want a 100(1-\(\alpha\))% confidence interval, so we need to quickly solve for \(\alpha\) and divide by 2. For example, for a 98% interval, \[100(1-\alpha) = 98 \implies \alpha=0.02\] Then \(\alpha/2 = 0.01\) and

qnorm(0.01)
## [1] -2.326348

So the critical value is \(z_{\alpha/2}=2.326\). Notice that I dropped the negative sign here. That’s because our formula uses \(\pm z_{\alpha/2}\), so the sign doesn’t matter. I’ll always ignore that negative for critical values. As long as you write your interval as \((\text{smaller number}, \text{bigger number})\), it’s all good.

Common Critical Values

Confidence Level \(\alpha\) Critical Value, \(z_{\alpha/2}\)
90% 0.10 1.645
95% 0.05 1.96
98% 0.02 2.326
99% 0.01 2.575

5.4.1 Breaking Down a Confidence Interval

Consider \[\left(\bar{x}- z_{\alpha/2}\frac{\sigma}{\sqrt{n}}, \quad \bar{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\] The key values are

  • \(\bar{x}\), the sample mean
  • \(\sigma\), the population standard deviation
  • \(n\), the sample size
  • \(z_{\alpha/2}\), the critical value \[P(Z > z_{\alpha/2}) = \frac{\alpha}{2}\]

The value of interest is \(\mu\), the (unknown) population mean; the confidence interval gives us a reasonable range of values for \(\mu\).

In addition, the formula includes

  • The standard error, \(\frac{\sigma}{\sqrt{n}}\)
  • The margin of error, \(z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\)

5.4.2 Confidence Level, Precision, and Sample Size

If we can be 99% confident (or even higher), why do we tend to “settle” for 95%?? Take a look at the common critical values (above) and the confidence interval formula \[\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}.\] What will higher levels of confidence do to this interval? Think back to the intuitive interval width explanation with the weather. Mathematically, the same thing will happen: the interval will get wider! And remember, a narrow interval is a more informative interval. There is a trade off here between interval width and confidence. In general, the scientific community has settled on 95% as a compromise between the two, but different fields may use different levels of confidence.

There is one other thing we can control in the confidence interval: the sample size \(n\). One strategy is to specify the confidence level and the maximum acceptable interval width and use these to determine sample size. We know that \[\text{interval width} \ge 2z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\] (Note: I use \(\ge\) because \(2z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\) is the maximum interval width - we would still be happy if this value turned out to be smaller!) Letting interval width equal \(w\), we can solve for \(n\): \[ n \ge \left(2z_{\alpha/2}\frac{\sigma}{w}\right)^2\] Alternately, we may specify a maximum margin of error \(m\) instead: \[ n \ge \left(z_{\alpha/2}\frac{\sigma}{m}\right)^2\] Once we’ve done this calculation, we need a whole number for \(n\). Since \(n \ge\) something, we will always round up.

Example Suppose we want a 95% confidence interval for the mean of a normally distributed population with standard deviation \(\sigma=10\). It is important for our margin of error to be no more than 2. What sample size do we need?

Using the formula for sample size with a desired margin of error, I can plug in \(z_{0.05/2}=1.96\), \(m=2\) and \(\sigma=10\): \[n = \left(1.96\times\frac{10}{2}\right)^2 = 96.04\]. So (rounding up!) I need a sample size of at least 97.

A few comments:

  • As desired width/margin of error decreases, \(n\) will increase.
  • As \(\sigma\) increases, \(n\) will also increase. (More population variability will necessitate a larger sample size.)
  • As confidence level increases, \(n\) will also increase.

5.4.3 Exercises

  1. In the previous section, you worked with a random sample of 50 Sac State students with mean SAT score 1112. Prior experience with SAT scores in the CSU system suggests that SAT scores are well-approximated by a normal distribution with standard deviation known to be 50. Calculate a
    1. 98% confidence interval.
    2. 90% confidence interval.
    3. Interpret each interval in the context of the problem. Comment on how the intervals change as you change the confidence level.
    4. Find the sample size required for a 98% confidence interval with maximum margin of error 10.

5.5 Confidence Intervals, \(\sigma\) Unknown

In practice, the value of \(\sigma\) is almost never known… but we know that we can estimate \(\sigma\) using \(s\). Can we plug in \(s\) for \(\sigma\)? Sometimes!

Remember the Central Limit Theorem (Section 5.1)? For samples of size \(n \ge 30\), \(\bar{X}\) will be approximately normal even if \(X\) isn’t. In this case, we can plug in \(s\) for \(\sigma\): \[\bar{x} \pm z_{\alpha/2}\frac{s}{\sqrt{n}}.\]

That setting is pretty straightforward! Now we need to consider the setting where \(n < 30\), which will require a bit of additional work.

5.5.1 The T-Distribution

Enter: the t-distribution. If \[Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\] has a standard normal distribution (for \(X\) normal or \(n\ge30\)), the slightly modified \[T = \frac{\bar{X}-\mu}{s/\sqrt{n}}\] has what we call the t-distribution with \(n-1\) degrees of freedom (even when \(n < 30\)!). The only thing we need to know about degrees of freedom is that \(df=n-1\) is the t-distribution’s only parameter.

The t-distribution is symmetric and always centered at 0. When \(n\ge30\), the t-distribution is approximately equivalent to the standard normal distribution. For smaller sample sizes, the t-distribution has more area in the tails (and therefore less area in the center of the distribution).

For a sample of size \(n < 30\), we plug in \(s\) for \(\sigma\) and use a t critical value (instead of a z critical value): \[\bar{x} \pm t_{df, \, \alpha/2}\frac{s}{\sqrt{n}}.\] The t critical value is found through \[P(T_{df} > t_{df, \, \alpha/2}) = \alpha/2\] where \(T_{df}\) is the t-distribution with \(df=n-1\) degrees of freedom.

To find a t critical value, we will again use R, now with the command qt(p, df). (Notice that this is similar to the command for the standard normal distribution, but instead of “norm” for normal it has “t” for the t-distribution.) For example, for a 98% interval with a sample size of 15, \[100(1-\alpha) = 98 \implies \alpha=0.02\] Then \(\alpha/2 = 0.01\) and \(df=15-1=14\).

qt(0.01, df=14)
## [1] -2.624494

which gives the t critical value \(t_{14,\alpha/2} = 2.625\). Notice again that I am able to ignore the sign because our formula uses \(\pm t_{df,\alpha/2}\).

As before, if you prefer you may use the applet, Rossman and Chance t Probability Calculator, instead of R. For this applet, enter the degrees of freedom \(n-1\) next to “df.” Then check the top box under “t-value probability” and make sure the inequality is clicked to “>” . Enter the value of \(\alpha/2\) for the probability. Click anywhere else on the page and the applet will automatically fill in the box under “t-value.” This is your t critical value.

5.6 Summary of Confidence Interval Settings

Setting 1: \(\mu\) is target parameter, \(X\) is normal, \(\sigma\) known

  • Critical value: \(z_{\alpha/2}\)
  • Confidence interval: \[\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}.\]
Setting 2: \(\mu\) is target parameter, \(n \ge 30\), \(\sigma\) unknown

  • Critical value: \(z_{\alpha/2}\)
  • Confidence interval: \[\bar{x} \pm z_{\alpha/2}\frac{s}{\sqrt{n}}.\]
Setting 3: \(\mu\) is target parameter, \(n < 30\), \(\sigma\) unknown

  • Critical value: \(t_{df, \alpha/2}\)
  • Confidence interval: \[\bar{x} \pm t_{df, \alpha/2}\frac{s}{\sqrt{n}}.\]