Chapter 6 Confidence Interval

6.1 Motivation


In the last chapter we discussed the CLT, a concept central (pardon the pun) to our study of statistics. Now we will see it’s first true application: Confidence Intervals. These quantitative tools will use the Central Limit Theorem to bridge a new relationship between sample and population.





6.2 Confidence Intervals and the CLT


In the previous section, we discussed (but did not prove, since that’s super super hard) the Central Limit Theorem. In layman’s terms, the theorem states that, in the end, “everything becomes normal.” That is, when you add enough random variables (usually more than 30 or so, we did it last time with sample means) the aggregate distribution is Normally distributed (don’t forget, ANY sum of Normals is Normal, doesn’t matter how many you add).


Perhaps the most interesting application of the CLT is to the sample mean/proportion, which we saw were Normally distributed with the following parameters if the sample size \(n\) is large enough:

\[\hat{p} \sim N(p, \frac{pq}{n}), \; \bar{X} \sim N(\mu, \frac{\sigma^2}{n})\]

Where \(\hat{p}\) and \(\mu\) are the sample proportion and mean, respectively.



Last time, we just went over how to get to this point; now, we are going to apply it in pretty interesting ways.


Let’s think about what we are doing every time we take a sample; and for now, we are going to stick with sampling means, since these are the most basic and common of confidence intervals. Say we’re trying to find the average number of times Oreo eaters dump their Oreo in milk whilst eating it. Again, we can’t possibly take a census of all Oreo eaters, so we simply take a sample (say of \(n\) people, where \(n\) is large) and count the times they dip their delicious chocolate/vanilla hybrid cookie in milk while eating it.


This will get us some sample mean \(\bar{X}\). What’s amazing about a confidence interval (which we will learn how to use in a second) is that we can make a statement about the actual population proportion \(\mu\) given just the sample proportion \(\bar{X}\) (and some other info, but you get the point). This has to do with the fact that we know the underlying distribution of \(\bar{X}\) is Normal.


Let’s start by standardizing \(\bar{X}\), AKA transforming it from a Normal with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\) to a Normal with mean 0 and variance 1 (you’ll see why in a minute). As we discussed in one of the previous units, subtracting a mean from a Normal random variable and dividing by the standard deviation produces a standard normal distribution, or \(N(0,1)\) (remember, this comes from the \(a+bX\) transformation rule) which has a mean of 0 and a variance of 1. So, since the sample size of \(n\) is large, we know \(\bar{X} \sim N(\mu,\frac{\sigma^2}{n})\), and thus:

\[\frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)\]

Don’t forget that standard deviation is the square root of variance.


So, basically what we’ve done here is create a new random variable by standardizing the sample mean, and this random variable has the standard normal distribution (mean of 0, variance of 1).


What do we know about the standard normal? Quite a lot, actually, and the first thing that comes to mind is the Empirical Rule (otherwise known as the 68-95-99.7 rule).


In this book, we’re mostly going to work at the level of 95% confidence, so the Empirical Rule will come in handy. Remember, it says that about 95% of data for a Normal variable lies within +/- 2 standard deviations of the mean. Technically, the number 2 isn’t very exact: 95% of the data is more precisely within 1.96 standard deviations of the mean for a Normal; we just round to 2 to make the rule easier to remember. However, for confidence intervals and calculations, we will use 1.96 (try this on a calculator to prove to yourself that 95% of data is within +/- 1.96 standard deviations).


So, we know that this standardized \(\bar{X}\) has a variance and standard deviation of 1, so being 1.96 standard deviations from the mean is the same as being 1 times 1.96 away, or just 1.96. We can then write in terms of probability:

\[P(-1.96 < \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} < 1.96) = .95\]

Remember, we are trying to make a claim about \(\mu\), or the true population mean of number of times an Oreo is dunked. We will use some basic algebra to isolate \(\mu\), then:

\[P\big(\bar{X} - 1.96(\frac{\sigma}{\sqrt{n}}) < \mu < \bar{X} + 1.96(\frac{\sigma}{\sqrt{n}})\big) = .95\]

If you are trying this for yourself, remember that dividing by a negative number flips the equality signs.


What we just did is pretty awesome. Think about what this probability statement says: we can now say, with .95 probability, that the true mean \(\mu\) is bounded somewhere between those two endpoints.


Wait one second though; we should make sure that everything in the endpoints ${X} \(+/-\) 1.96()$ are readily available to use. Let’s see: if we take a sample, we’ll always have the sample mean \(\bar{X}\) and the sample size \(n\). However, we might not have \(\sigma\), since that’s the true standard deviation of the population, which isn’t always readily available. Sometimes you will be given the true standard deviation \(\sigma\) in problems in this book, but usually we will use an approximation. When \(n>30\), it’s okay to approximate \(\sigma\) using \(s\), or the standard deviation of the population with the standard deviation of the sample.


Ok, great. That means that all we have to do is take a sample, use it’s mean, standard deviation, sample size and the number 1.96 and we can calculate an interval for where the actual mean lies.


If the excitement isn’t hitting you yet, let’s go back to the Oreo example. Say that we take a sample of 50 people (so, since the sample size is greater than 30, we pass the relevant assumptions), and get a mean of 3.2 dunks with a standard deviation of .7 dunks. What can we say about the true population mean of number of dunks?

Well, since \(n\) is large, we know:

\[P\big(\bar{X} - 1.96(\frac{s}{\sqrt{n}}) < \mu < \bar{X} + 1.96(\frac{s}{\sqrt{n}})\big) = .95\]

Plugging in for \(\bar{X}\), \(s\) and \(n\) gets us:

\[P\big(3.2 - 1.96(\frac{.7}{\sqrt{50}}) < \mu < 3.2 + 1.96(\frac{.7}{\sqrt{50}})\big) = P(3 < \mu < 3.4) = .95\]

Wow!!! That last probability statement essentially says “there is a .95 probability that the true mean of Oreo dunks is between 3 and 3.4 dunks.”


Pretty amazing, huh? With just a large enough sample, we can generate very useful statements about an unknown population parameter just because we know the underlying distribution of our sample statistic (shoutout to the CLT). This is called a two-sided confidence interval for a mean. It’s probably the most basic of confidence intervals that we will learn.


It’s also very important to interpret these confidence intervals correctly. The probability statement we made was technically correct, but for this book you should use this conclusion: “we are 95% confident that the true mean of Oreo dunks is between 3 and 3.4”. You will be asked to interpret confidence intervals many times; use this shell to answer these questions.


You may also be asked what these confidence intervals really mean. This is kind of a silly question, since it’s just asking about a technicality. Nonetheless, you will be asked it, so it’s good to know the official meaning. Technically, when we build an interval like we just did (AKA, 3 to 3.4) we know (because of all the CLT stuff) that the true population \(\mu\) has a 95% chance of being in this interval. Another way to say this (the “technical” way) is that 95% of confidence intervals we generate from samples of this size will contain the true mean \(\mu\). Just be sure to be familiar with this interpretation, just in case.


6.3 Confidence Interval Minutiae


We’ve learned the general form for these confidence intervals, but there are lots of different variations that could be applied in questions about them. We’ll try to put them to rest here.


1. Certainty vs. Precision

This is a key part of understanding how confidence intervals work; thankfully, it’s not too mentally rigorous. Here, certainty just means ‘level of confidence’: we are more certain in a 95% confidence interval than a 90% confidence interval. Precision means how accurate our statement is, so it essentially means how wide the interval is (a wider interval is less precise).


How should these two vary? Our gut tells us that they are inverse. Consider if someone developed a confidence interval for the average college student GPA in the United States. Here are two very extreme intervals:


I am 100% confident that the average GPA is somewhere between 0 and 4.0


I am less than 1% confident that the average GPA is between 3.2 and 3.21


Clearly, the first statement has a lot of certainty but not a lot of precision, while the second is very precise but with little certainty. Both are equally worthless: the statement with ‘one hundred percent confidence’ just told us the scale a GPA is on, and the person who gave a very tight interval has extremely little confidence in this answer.


Anyways, it seems that the more certain you get, the less precise you get. This makes sense; just imagine that if you wanted to ever get to the max certainty of 100%, you would have to expand the interval to it’s widest possible size, which would mean not being precise at all.


What drives this relationship? Well, let’s think about the difference between a 95% confidence interval and a 68% confidence level. What changes in the formula \(\bar{X}\) \(+/-\) \(1.96(\frac{\sigma}{\sqrt{n}})\)? The sample mean, size and standard deviation are all functions of the sample and are not affected by confidence. However, we know that the 1.96 is used because of the Empirical Rule (approximately 95% of the data is within 1.96 standard deviations of the mean). This number changes for different levels of confidence. That makes sense, because it’s giving the standard deviations away that some weight of the data is captured by. In other words, is 68% of the data within more or less than 1.96 standard deviations of the mean?


It’s definitely less than 1.96, since we need to get less of the data (68% instead of 95%) and thus have to go farther from the mean. So, if we go from 95% to 68%, we go from 1.96 to some smaller number (turns out to be .994). This will result in a smaller interval (since we’re multiplying by a smaller number), which means we are more precise. However, we dropped in confidence from 95 to 68, so we are less certain. The inverse relationship, then, is explained by this z-score.


2. The Margin of Error

You will often hear this term thrown around in confidence interval sections. It just refers to a certain part of the general formula: whatever you are adding/subtracting to the sample statistic to get the size of the interval (whereas the statistic, say the sample mean, gives the center). Specifically, here it is given as:

\[ z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})\]

A quick note: that \(z_{\frac{\alpha}{2}}\), as I’m sure you have guessed, makes up the 1.96 part of the 95% confidence interval. What it basically means is ‘the z-score that gives the desired level of confidence, or the \(\alpha\) (alpha) level.’ For example, our alpha level will usually be 5% (or 95% confidence) and we’ve already seen how 1.96 gives the weight for this. Dividing the alpha by two is only because we are doing a two-sided interval, for now.


So, this is the margin of error. It’s aptly named if you think about it, since it accounts for the width of the interval and thus how much we are ‘off’ by. It’s important to envision how each of the three variables in the margin of error affect the error itself.


First, the standard deviation. Clearly, as the standard deviation increases, the number above will also increase. Why? Well, the more spread that a sample has, the less accurate we can be with it. This translates into the larger, less precise interval.


Next is the \(z_{\frac{\alpha}{2}}\) term. Again, as this increases, so does the entire equation. Does this make sense? Remember, the larger the number, here, the larger the confidence interval (we saw this as we went from 95% to 68% confidence). This is because the larger the value, the higher percent of data is within that z-score of the mean. It makes sense, then, that this should increase the interval and thus decrease the precision, because we already know that certainty and precision are inverse.


Finally is the sample size. This is the only one that actually decreases the size of the interval. Why does this make sense? Well, it doesn’t take a pro to tell you that larger samples are better because they are generally more accurate. In our case, though, there is an additional proponent of the higher samples: the CLT. That’s right, we learned that the more random variables you add, the closer and closer you get to a Normal random variable. So, getting bigger and bigger samples just makes the overall distribution more and more Normal, which makes our approximation better and better. This creates more precision overall.


***A quick note: the margin of error will have a different formula for different types of intervals (mean, proportion, etc.) but you will learn about them as you learn each interval and can go through the same qualitative thinking process as we did here.


3. The Return of the t Distribution


We discussed the t Distribution briefly in previous units. If you can recall, it’s essentially a doppelganger of the Normal Distribution: both are basically bell shaped curves. However, the t Distribution is slightly different in that it’s tails are a little fatter and thus it has slightly more variance from the mean that it is centered around. We also learned that as the degrees of freedom of a t distribution increased, the distribution becomes more and more Normal (tails get skinnier and skinnier) and when \(n>30\) it is for all purposes Normal.


We sort of swept over this distribution last time, but it will prove useful to use here. Remember that when using the formula $ z_{}()$ we can use \(s\), or the sample standard deviation, to approximate \(\sigma\), or the actual standard deviation of the population if the sample size is large enough (over 30). However, what if we aren’t given info from a large enough sample? What are we to do?

Here’s where the fat tails of the t Distribution come in handy. Since they have more variance, they capture the increased variance from the poor approximation of a small sample. That is, if you only have a sample of 15 people, then the CLT hasn’t quite kicked in yet and made the distribution totally Normal. It’s getting to look like the Normal, but there’s still a little more variance than we would like. The t Distribution is perfect, then, because it’s similar to the normal with a little more variance.


So, in the case of a small sample size, you will have to use a t-score instead of a z-score (instead of the \(z_\frac{\alpha}{2}\), a \(t_\frac{\alpha}{2}\)).


4. Solving for Sample Size


Besides actually calculating the interval itself, you may be given the interval and other values and asked to calculate another variable in the formula (sample statistic, standard deviation, etc.). These are usually straightforward, but there are two tricks you should know:


  1. The center of a confidence interval is also the sample statistic (mean or proportion) since you build the interval but doing sample mean or proportion +/- some value. You also know that the interval is symmetric about the center, since you are adding and subtracting the same value. So, if you were asked for the sample mean that was used to calculate the confidence interval (30,40), it would be easy to find 35, since it is the middle of those 2 numbers.


  1. Finding the sample size can be tricky with proportions, since you don’t know what value of \(p\) to use (you’ll understand more about this when you actually learn how to do a confidence interval for proportions, just thought we would get this concept out of the way now). In this case, you should set \(p=.5\) (and thus \(q\) would also be .5) since this maximizes the variance. We know this because \(pq\) is at a max when \(p=.5\), and decreases as \(p\) goes to 0 or 1.


6.4 The Formulas


Great, so you know the process of actually getting a confidence interval and you understand the concepts. Turns out that the most important part of this section is actually using these confidence intervals, which of course requires the formulas. I’ll give the formulas now (you probably have them, this will just be for reference), and just try to imagine how we got to them using the same process as we did with a one-sample mean. Again, we’re going to operating at the 95% level of confidence for all of these, so you’ll see 1.96.


First, we know well the formula for a one sample mean two-sided interval for the true mean \(\mu\):

\[\bar{X} +/- 1.96(\frac{\sigma}{\sqrt{n}})\]

How about for one sample proportion two-sided interval? Well, for a sample proportion \(\hat{p}\) from a sample size \(n\), we get the interval for the true proportion \(p\):

\[\hat{p} +/- 1.96\sqrt{\frac{\hat{p}\hat{q}}{n}}\]

Remember, \(\hat{q} = 1 - \hat{p}\).


This is extremely similar to the version with the mean; remember, here is where we use \(\hat{p} = .5\) if we have to solve for sample size, since it will give us a ‘worst case scenario’ for how big \(n\) should be.


One last thing about these intervals: since we know that we can’t have a proportion less than 0 or greater than 1, we can cut off any interval past these two endpoints. For example, if we got the interval (.4, 1.3) for some proportion, we could just change that to (.4, 1). If you were to interpret this, you would say “We are 95% confident that the true population proportion is between .4 and 1.”


That’s about all for now. One more thing: we’ve only discussed two-sided intervals, but of course we can build one-sided intervals. The idea here is that we find an upper bound or lower bound for a population parameter.


Why would this be important? Consider a farmer who is worried about the amount of rain next harvest season. He doesn’t really care if there’s a lot of rain, since this means his crops will flourish (I know that there is such a thing as too much water for plants, but stick with this simplification for now), only if there is a small amount of rain, or drought, that will make his crops dry up. That is, he wants to calculate a lower bound on the amount of rain to see if he has to move farms.


What would you think should change between two-sided and one-sided intervals? Well, for one thing, one of the endpoints is going to be infinity or negative infinity, if we are calculating a lower bound and upper bound, respectively. We are also going to change our z-score, since we are looking for 95% density on a different part of the distribution.


This seems a little hazy, but would you expect 1.96 to increase or decrease when going to a one-sided interval? Well, consider if you are the farmer trying to build some lower bound, one-sided interval for the amount of rain. Since you don’t care about the upper bound, you won’t bother to set it anywhere and will just let it be infinity. So, for now we have \((x, \infty)\) and we have to solve for \(x\).


Well, since we have \(\infty\) as one of our bounds, we are going to have more density on that side than we would if we cut it off in some spot! Think about a two-sided confidence interval: you are doing a plus and minus, and thus restricting the area between two points on either side of the distribution: here, -1.96 and 1.96. However, when you run a one-sided interval, you are letting one of those 1.96 drift towards positive or negative infinity, which means you are going to get more density out of that side. That means that you will need less density out of the other side to get to 95%!!! In our case, the 95% threshold is 1.64. So, after all that, we can say, for a sample mean, we have the one-sided confidence intervals:

\[(-\infty, \bar{X} + 1.64\frac{\sigma}{\sqrt{n}}), \; (\bar{X} - 1.64\frac{\sigma}{\sqrt{n}}, \infty)\]

For an upper and lower interval, respectively.




So, what did we learn? We now know how to calculate confidence intervals (through the CLT) and why they are important (can generate claims about the true population parameters). In the next section, we will apply these even further to actually test different values in a distribution. Stay tuned!