Chapter 5 Central Limit Theorem
5.1 Motivation
The Central Limit Theorem, fondly called the `CLT’ by it’s loyal followers, is one of the most useful tools in Statistics. It’s applications also catapult the Normal distribution, which we have began to discuss, into infamy. Most importantly, the CLT opens up a new chapter in this book: the chapter of inference.
5.2 Normal Distribution
We’ve discussed the fundamentals of the Normal distribution, and soon we will see it’s relevance in application of the Central Limit Theorem. First, though, we must nail down some of it’s lesser known concepts.
First is Checking for Normality. The idea here is that you may be presented with a distribution (a data set with a histogram, summary stats, etc.) and would like to see if it is indeed Normal. This is really a basic approximation test to see if the characteristics of the rogue distribution match the properties we’ve learned about the Normal: namely, it’s center and spread.
So, what do we know about the Normal distribution that could help us identify it? Well, we know it is a symmetric distribution centered at 0. Therefore, we could check the following two conditions:
\[mean \approx median, \; skewness \approx 0\]
These two conditions should hold for a Normal distribution, so they’re worth checking.
Next, we can apply what we know about the spread of the data. Via the Empirical rule, we know the 68-95-99.7 rule (% of data that falls within +/1 1,2, and 3 standard deviations of the mean). You can check to see if these percentages sort of match up with the data given, and you can also check:
\[range \approx 6\sigma, \; IQR \approx 1.33\sigma\]
Because 99.7% of the data falls within +/- 3 \(\sigma\) of the mean, or 6 \(\sigma\) total, and because, by using the InvNorm function on the calculator (discussed in Unit IV), we find that the 25th percent of the data has a z-score of about \(-\frac{1.33}{2}\).
Anyways, these conditions should hold for relatively normal distributions. Again, this is pretty qualitative, so it is usually up to your interpretation; as long as your numbers and methods back you up, you really just have to explain well why or why not you think the distribution is relatively normal.
Next is the t distribution. It will be introduced now, but not really rigorously used in this section (we will apply it more in the more advanced inference sections). For now, we’re really only concerned with the basics.
First of all, it sounds like a silly name, but saying t distribution
is the same as saying Binomial' or
Uniform’; it’s just the name of the distribution. Perhaps the best way to describe the t distribution is as a doppelganger to the standard Normal. Indeed, the picture looks very similar to the Bell Curve of the Normal distribution: it is symmetric, unimodal, and centered at 0. One difference is that the tails are fatter in the t distribution; that is, there is more variance, and higher probability density farther out from the mean. Another is that, instead of having two parameters (the Normal is guided by it’s mean and variance, \(\mu\) and \(\sigma\)) the t distribution only has one parameter: degrees of freedom (abbreviated `df’). For now, all you have to know is that as the degrees of freedom increase, the t distribution becomes more and more Normal (the tails get skinnier and skinnier). When df is larger than 30, the t distribution is basically Normal.
Again, seems pretty cursory, but those are the basics. The t distribution will make a comeback in the Hypothesis Testing and Confidence Interval sections, where it’s fat tails (and thus more variance) will prove valuable.
5.3 Central Limit Theorem
We touched upon the relationship between sample and population in Unit I; now, this dynamic will now come to fruition as the ground work of statistical inference.
Remember one of the conveniences of Statistics? Large parts of it exist because of the sheer fact that it is difficult to take a census, or get information about an entire population. Instead, we have inferential statistics, where we take a sample from the population, measure the sample statistics, and use these statistics to estimate what the true parameters of the population are.
So, if you are interested in measuring how many U.S. State Capitols the average 5th grader has memorized, you would need to use inferential statistics. You can’t ask every single 5th grader who walks the Earth to list off the Capitols they know; it’s just not feasible. Instead, you take a sample of 5th graders (maybe you ask 30 of them) and use these results to estimate the true average for all 5th graders.
I mentioned statistics and parameters; the former is gathered from a sample and used to estimate the latter in a population. Specifically, we will use \(\mu\) as the parameter for true population mean}, and \(\bar{X}\) (the sample mean, or all the values in a sample divided by the total number of values) as the sample statistic used to guess \(\mu\). This only applies to when we are curious about a mean; what about proportions? An example of this might be investigating what the true proportion of 5th graders are that know who Ronald Reagan was. In this case, we use the sample statistic \(\hat{p}\) (which is just naive probability, or \(\frac{x}{n} =\frac{number \; of \; successes \; in \; sample}{sample \; size}\), where a success here is knowing who Ronald was) to estimate the true population proportion p.
We’ll stick with \(\bar{X}\) (in english, `X-bar’) and estimating the true mean \(\mu\) for now. Let’s think about \(\bar{X}\), or the average, as a random variable. After all, this is just the results of random sampling coming together. If we were sampling weights and decided to sample 10 people, we might get 100, 120, 140, 130, 125, 167, 189, 143, 126, 231 for our weights the first time and 132, 251, 112, 178, 195, 164, 99, 164, 127, 132 the second time, the point being that the first sample mean does not equal the second sample mean (\(\bar{X_1} \neq \bar{X_2}\)) and thus the sample mean is just a function of these random sample outputs. Officially:
\[\bar{X} = \frac{1}{n}\sum\limits_{i=1}^n X_i\]
Where \(X_i\) is the value of the \(i^{th}\) sample (so, in the first set of samples we just discussed, \(X_{10} = 231\), or the tenth person weighed 231 pounds).
This definitely has the look of something that could be random, since all we’re doing is adding up random outputs (like random people’s weight) and dividing by \(n\) (the number of subjects in the sample). Well, if \(\bar{X}\) is a random variable, it must have an Expectation and Variance…Can we calculate those?
\[E(\bar{X}) = E(\frac{1}{n}\sum\limits_{i=1}^n X_i) = \frac{1}{n}\sum\limits_{i=1}^n E(X_i) = \frac{1}{n}\sum\limits_{i=1}^n \mu = \mu\]
By the \(a + bX\) rule and using the fact that \(E(X_i) = \mu\), since (intuitively) the expected value of a random person should be the average of a population.
Nothing crazy here; this is basically saying that you are adding \(\mu\) up \(n\) times (since there are \(n\) subjects in the sample, or \(X_i\)’s, and each one has an expectation of \(\mu\), or the population average) and dividing by \(n\). This just cancels out and gives us \(\mu\). So, as we would expect, the expectation of the average of a sample to be the average of the population. Concisely, \(E(\bar{X}) = \mu\).
What about Variance?
\[Var(\bar{X}) = Var(\frac{1}{n}\sum\limits_{i=1}^n X_i) = \frac{1}{n^2}\sum\limits_{i=1}^n Var(X_i) = \frac{1}{n^2}\sum\limits_{i=1}^n \sigma^2 = \frac{\sigma^2}{n}\]
By the \(a + bX\) rule and using the fact that \(Var(X_i) = \sigma^2\), since (intuitively) the variance of a single random subject should be the variance of a population.
A little bit different here, but still not insane; because taking the variance squares factors like \(\frac{1}{n}\), an extra \(n\) sticks around in the denominator, and we are left with the variance of the sample mean is equal to the variance of the population divided by the number of subjects in the sample, \(n\). Concisely, \(Var(\bar{X}) = \frac{\sigma^2}{n}\).
So we have the mean and variance of \(\bar{X}\), but what about another very important property of the random variable: it’s underlying distribution (we call this the sampling distribution)?
This sounds pretty funky. How could we nail down anything as specific as a distribution? Remember, distributions include Binomial, Uniform, Normal, etc. It seems unlikely that just adding up all the sample values and taking the average would fit so nicely into one of these distributions.
However,this is exactly what the Central Limit Theorem states. Officially, the CLT says regardless of underlying distributions, the sum of random variables is Normal, given that the sum is large enough.
That’s crazy! That means if we add up enough random variables, no matter what each random variable is, the end result will be Normal, as long as we add up enough variables for the CLT to `kick in’ . What does that mean for \(\bar{X}\)? Well, if a sample size is large enough (add up enough \(X_i\)’s), and we repeatedly draw samples from the population to calculate \(\bar{X}\) then the sampling distribution of \(\bar{X}\) becomes Normally distributed. It does not matter what the distribution of the actual population is; taking enough samples will make the sampling distribution of the sample mean \(\bar{X}\) Normal!
What’s all this jargon with `sampling distributions’? It essentially just means that if you continually drew samples from the population and took sample means, all of those sample means would follow a Normal distribution. So, if you have just one sample mean, you know the distribution it is a part of and thus can work with it. It’s the same thing with taking just one person’s weight when you know the approximate distribution of all weights, and using that information to make a claim about the specific person.
If you’re confused as to why this works, you should be. We don’t even approach proving the CLT in here, so don’t worry about understanding the machinery behind it. It’s really just a wacky result that makes life a lot easier, since we have the tools necessary to work with a Normal distribution (z-score, etc.). You can see why the Normal is so important; when you add anything up enough times, it magically becomes normal!!!
Anyways, let’s keep forging on with \(\bar{X}\). We found it’s mean and variance, and we know it’s Normal if the sample size is large enough (traditionally, ‘enough’ subjects in a sample for the CLT to kick in is 30 subjects in the sample or more). We can succinctly say, then:
\[\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\]
When \(n \geq 30\).
It’s crazy talk. Let’s think about the implications. This means that the average of the sample mean is the average of the population, which is a good thing. Since the Normal distribution is symmetric and centered about the mean, we can also say that the sample mean is centered around the true population mean. The variance of \(\frac{\sigma^2}{n}\), then, gives the standard error, or how much the sample mean varies from the true mean.
Since we are well accustomed to the tools of probability density with Normal distributions (z-scores), we can use probabilities on the sample mean. That basically means that we can say how unlikely or likely a sample mean is, because we know the underlying distribution. For example, instead of just saying a sample mean is `really high’, we can actually find the z-score and say probabilistically how high it is (i.e., if it has a z-score of 2, it’s in the top 97.5th percentile). This concept will become important in later inferential sections, when we test assumptions based on how unlikely they are given certain sample results (don’t worry if you don’t get it, these topics are coming soon).
For now, the CLT problems will focus on using or finding the distribution of a sample mean. Two classic examples follow:
``Samples of size 32 are drawn from a population with mean 35 and standard deviation 25. Find \(P(\bar{X} > 40)\)."
This is basically asking what the likelihood is that we take a sample (of 100 subjects) and find a mean of over 40. Since the sample size is large (greater than 30), we can apply the Central Limit Theorem, and say that \(\bar{X} \sim N(100, \frac{25^2}{32})\). Therefore, we simply have to calculate the z-score for 50. We do \(\frac{40 - 35}{\sqrt{\frac{25^2}{32}}}\) and get a z-score of 1.13, or a probability of .1289.
This example asks you to apply the CLT directly. In other cases, you may have to use the CLT to solve for a parameter. Here is an example adapted from the study guide:
“A certain brand of lightbulb has a mean lifetime of 1000 hours with a standard deviation of 50 hours. If the bulbs are sold in boxes of 30, the parameters of the distribution of sample means are…”
We know, since the sample size is large enough, that \(\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\). Since we are given the population parameters, we then can solve for these parameters for the sample mean: just 1000 for the mean and \(\frac{50^2}{30}\) for the variance.
Remember, if a question asks about `sample means’ (or sample proportions, as we’ll see in a second) it is very likely that the question is testing knowledge of the CLT.
So, that’s the CLT for sample means. However, when taking samples, there are two basic ways to get information. One is taking the average value of the subjects (i.e., how much does each subject way). The other is measuring a proportion of the sample with a certain characteristic (i.e., what proportion has brown hair).
Every tool we learn for inference, then, will have two parts: one for means and one for proportions. So, the point here is that the CLT has something to say about sample proportions as well. We won’t go through the whole process again, but here is the result.
The Central Limit Theorem states that sample proportion \(\hat{p}\), for large enough sample size \(n\), has the distribution:
\[\hat{p} \sim N(p, \frac{pq}{n})\]
Where \(p\) = population (true) proportion and \(q = 1-p\).
Here, \(n\) is `large enough’ for the CLT to kick in if \(np\) and \(npq\) are both greater than or equal to 5.
That’s it! Same type of problems from before; using information about the underlying distribution to make a claim about the actual sample proportion.
That about wraps up the discussion of the CLT but for a few caveats. First, it’s clear that to use the CLT you need information about the population: it’s mean and variance. However, isn’t the point usually to estimate these parameters? If this confuses you, don’t worry, we will soon get to the purpose of these type of estimations. Usually, we are assuming a certain population parameter and testing it’s feasibility (is the average weight of people really 160 pounds? stuff like that). Again, don’t worry, that will show up in the next two sections, Confidence Intervals and Hypothesis Testing.
Finally, the CLT applies to underlying Normal distributions as well as every other distribution in the book. Specifically, the sum of Normals is Normally distributed. Doesn’t matter how many you add (or subtract, since this is technically a negative sum) the result is normal.
Where could this show up? Suppose the weights of boys are distributed \(B \sim N(150, 50)\) and the weights of girls are independently distributed\(G \sim N(120,30)\). What is the probability that a randomly selected boy outweighs a girl by at least 30 pounds?
What this question is asking, in probability terminology, is \(P(B - G > 30)\); that is, the difference between a boy and girl weight is over 30. We can use the CLT here; since the sum of Normals is Normal, \(B - G\) creates a new Normal random variable. We’ll call it \(D\) for difference.
So, we know \(D\) is normally distributed. How do we find it’s parameters? Recall that we can add or subtract means and we always add variances. So, we subtract the means to get \(150 - 120 = 30\), and add the variances to get \(50 + 30 = 80\). So, \(D \sim N(30,80)\).
Remember, since \(D = B - G\), we are looking for \(P(D > 30)\), which, since we know the distribution of \(D\), turns into a simple z-score problem. We find the z-score:
\[z = \frac{30 - 30}{\sqrt{80}} = 0\]
So, turns out that since the difference is actually centered at 0, the probability is exactly .5 (since the Normal is symmetric).
Be sure to become familiar with this type of problem (working with sums of Normals), since it will certainly show up! There is more than one way to do it so don’t fret if this seems too tricky.
Finally, we can apply the CLT to the Binomial; that is, adding enough Binomials creates a Normal. This is officially called the Normal approximation to the Binomial. Here is the result:
If \(X \sim Bin(n,p)\), \(np \geq 5\) and \(npq \geq 5\), then X can be approximated by:
\[X \sim N(np, npq)\]
You can likely see where this comes from; after all, the Binomial measures proportions, so the \(np\) and \(npq\) show up again.
When could you use this? Well, consider the following question:
``You flip a fair coin 100 times. What is the probability that you get at least 60 heads?"
This is clearly a Binomial (set number of trials, constant probability, two outcomes, independent). However, it seems a little tricky to do. We are looking for \(P(H \geq 60)\), where \(H\) gives the number of heads. We could use the probability mass function, which again gives the probability of each outcome (0 heads, 1 heads, 2 heads, etc.). How would we use this? Well, simply add the probabilities of getting 60, 61, 62… all the way to 100. Officially:
\[\sum\limits_{i=60}^{100} {100 \choose x_i} .5^{x_i}.5^{100 - x_i}\]
Which is just an application of the PMF we learned last unit.
I think you’ll agree, though, that this is pretty gross. No one wants to plug a complicated sum into the computer. It would be much faster to use the Normal Approximation.
First, can we use this approximation (is \(n\) large enough)? Well, \(np = (.5)(100) = 50\) and \(npq = (.5)(.5)(100) = 25\), and both are greater than 5, so full steam ahead.
So, we know \(H \sim N(np, npq)\), and after plugging in \(H \sim N(50, 25)\). This now becomes a simple z-score problem:
\[z = \frac{60 - 50}{5} = 2\]
So the z-score is 2, which means the probability of being higher (getting more than 60 heads) is approximately .025 (by the Empirical Rule).
5.4 Unbiased Estimators
You’re probably sick of the CLT, and thankfully it’s time to switch gears. The Unbiased Estimator is historically not really focused (traditionally only has 1 problem on problem sets) but it will likely show up on exams so it is worthwhile to learn it. Here is a quick review:
Questions about estimators are essentially just exercises in the mechanics of expected value and variance. It’s key to remember these two transformation formulas:
\[E(aX) = aE(X)\] \[Var(aX) = a^2Var(X)\]
Basically, in these problems, you will be given some ‘estimator’ that is supposed to be a tool for `estimating’ the actual parameters of the population. It’s usually made up of samples from the population: the logic sort of goes “hey, I want to build a formula to find the parameters of a population, let me take a few samples and form some sort of estimator out of them.” Your job will be to calculate the expected value and variance of this ‘estimator.’ If the expected value of the estimator equals the actual mean of the population (usually \(\mu\), but possibly \(np\) if we are working with a Binomial), we call the estimator “unbiased” (since it did it’s job correctly). When comparing variance, you always want the estimator with lower variance, since there’s less spread and essentially more accuracy.
So here’s a quick example. Say we take three samples from a population, call them \(X_1, X_2, X_3,\) and use them to make an estimator for the population called \(M\), where \(M\) has the following equation:
\[M = 1/3(X_1) + 1/2(X_2) + 1/6(X_3)\]
You are also given that the mean and variance of each sample is \(\mu\) and \(\sigma^2\).
To solve for the expected value and variance of \(M\), you would do:
\[E(M) = E(1/3(X_1)) + E(1/2(X_2)) + E(1/6(X_3))\]
Which, because of the formula \(E(aX) = aE(X)\), we know is equal to
\[E(M) = 1/3E(X_1) + 1/2E(X_2) + 1/6E(X_3)\]
And since we know that \(E(X_1) = E(X_2) = E(X_3 ) = \mu\)
\[E(M) = \mu(1/3 + 1/2 + 1/6) = \mu\]
So since the expected value of our estimator equals the mean of the population, the estimator `\(M\)’ is unbiased.
Then, to find the variance, we would do the same thing, but remember that when you pull out the coefficients 1/3, 1/2 and 1/6, you have to square them, since \(Var(aX) = a^2Var(X)\).
Some things to watch out for:
Always add variances! Even if an estimator subtracts some values, when you calculate the variance you have to add (because you are squaring something). This is different from means, which you can subtract.
A key to solving these problems is knowing the mean/variance of \(X_n\) or whatever your estimator is made up of. This may be given to you, but you may have to find it. One example would be a Binomial variable, where the mean is \(np\) and the variance is \(npq\).