# 8 The Central Limit Theorem

The central limit theorem is the powerhouse of statistical reasoning. What we’ve done so far is largely describe data that we have. We can calculate the average based on the data we possess, or graph it in various ways, and even look at the relationship between different datas. And that’s all good, but what if we could use data that we have to understand data that we don’t have. That would be cool, huh?

Here is where we push off into the sea of inferential statistics. Inferential statistics is all about making … inferences. Inferences means drawing a conclusion based on evidence. if you come home and find the garbage can knocked over and trash spread around your floor, you might infer your dog is the culprit. You wont know, but based on all evidence that is your best guess. Similarly, we can find evidence in our data that lets us draw conclusions, not about the behavior of your dog but about the population.

We often want to know about the population, whether that is a state or a neighborhood or the country or any other body. But asking all of those people is costly and difficult. It’d be easier if we just asked a few, and were able to infer that everyone else’s answers were similar. That is the goal of inferential statistics, to infer from a sample to a population.

Why can we do that you ask? Because of the central limit theorem.

## 8.1 Distributions

We need to start with some different distributions though. We’ve already met a few. In charter 4 we discussed skewed distributions and the normal distributions. But distributions come in many different shapes and sizes. Let’s start by looking at what we’ve already discussed.

``hist(rnorm(100000, 100, 5), breaks=50, xlab="", main="Normal Distribution")``

That’s a normal distribution, with an equal distribution above and below the mean. Another name for a normal distribution is a bell curve, because of the bell shape on creates.

In addition to normal distributions, in Chapter 4 we observed right skewed data

``hist(rbeta(100000, 2, 99), breaks=50, xlab="", main="Right Skewed Distribution")``

And data can also be left skewed.

``hist(rbeta(100000, 10, 2), breaks=50, xlab="", main="Left Skewed Distribution")``

Those three are perhaps the most typical distributions that are observed, but many others exist as well.

``````par(mfrow=c(1, 3))
hist(rbeta(1000, 10, 10), breaks=25, xlab="", main="Left Skewed Distribution")
hist(rbeta(1000, 30, 65), breaks=25, xlab="", main="Left Skewed Distribution")
hist(rbeta(1000, 3, 37), breaks=25, xlab="", main="Left Skewed Distribution")``````

Distributions come in all shapes and sizes. Data can arrange itself in many different shapes.

But here’s the important thing, the central limit theorem applies to data in any distribution, no matter what its shape is.

So what is the central limit theorem? I’ll start with the intuition of what it is, before giving the formal definition.

## 8.2 A Marathon Example

This example is a condensed version of the explanation of the central limit theorem in Charles Wheelan’s Naked Statsitics, which is my favorite book on stats. I tried to think of a similar explanation that would at least let me claim to be similar to, rather than a copy of what Wheelan wrote, but I couldn’t. His explanation was too clear, and any other example would muddy that.

Imagine that it’s the weekend of a big marathon in your city (think NYC marathon, but it doesn’t have to be NYC). Thousands of people from around the world have come to participate. Tour buses are provided to get runners from a sign in point to the starting line, in order to help manage the size of the crowds. However, one tour bus has gotten lost.

You’ve joined the search groups looking for this lost tour bus, full of soon-to-be marathoners. Pretty quickly you come across a broken down tour bus on the side of the street. Success! You’ve saved the day. Or have you?

You climb on the bus and notice something. All of the passengers are… larger. Not abnormally so, but they look to be overall average. It had just come up in bar trivia the night before, so you know that the average American’s woman’s weight is 168.5 ounds and that the average American man is 195.7 pounds.Some of the passengers weigh a bit more, some a bit less, but you quickly (somehow) surmise that the bus’s average weight is roughly 173 pounds.

Do you think that bus has the marathoners? It might not be impossible, but you quickly recognize that the bus you found is highly unlikely to be the one everyone is searching for. The driver confirms your suspicions, this bus had broken down on its way to a nearby amusement park.

Without knowing it you’ve applied the central limit theorem. There were thousands of runners in the city. The average marathoner’s weight will be closer to 150. Some will be larger, and some will be smaller. But that is the rough average for the population of marathoners in the city.

You an think of each tour bus, which holds about 50 runners, as an individual sample of the runners. If each tour bus was filled at random, some of the runners will be smaller than average on each bus, but some will be a bit larger, and the average of each tour bus will differ from the average of all runners somewhat. But what are the odds that one sample of runners, one tour bus, will be on average 30 pounds above average? It isn’t impossible. But it is highly unlikely.

We’re going to keep circling around a few terms, let’s make sure we’re clear:

• Population of marathoners: all marathoners in the city for the race
• Sample of marathoners: each tour bus of 50 runners

If you take a single sample of the marathoners, the average will differ slightly from the population mean. Let’s create a list of our runners, using the command rnorm(). Rnorm() generates random numbers, after we choose the number we want (n=), the mean (mean=), and the standard deviation (sd=). Once we enter that, it can output the random weight of all our runners, conforming that generally conform to that distribution.

``````Runners <- rnorm(n=2500, mean=150, sd=10)
hist( Runners , freq=FALSE, main="Distribution of 1500 Runner Weights")``````

Now we have the weights of all 1500 runners signed up for our marathon. And we want to treat each bus taking them to the start line as a sample of that population. We can take a sample of those 1500 using the command sample(). Let’s see what our first sample looks like; remember, each bus fits 50 runners.

``````x <- sample(Runners, 50)
x``````
``````##  [1] 155.7261 155.5724 149.3700 142.2469 157.1383 148.8565 160.7347
##  [8] 148.1110 147.9599 142.1228 158.4735 152.0184 158.9181 142.7303
## [15] 139.9376 136.3199 137.3517 149.4602 159.4728 150.4297 146.5684
## [22] 143.6961 152.7943 140.6383 152.8276 150.6971 143.5916 146.8363
## [29] 157.7129 146.6637 155.8063 150.1653 171.5741 162.7199 137.8371
## [36] 153.5971 154.3911 167.8031 143.8816 153.6109 142.6882 161.6318
## [43] 151.0576 158.8354 145.8326 150.9870 135.9106 130.9318 150.5245
## [50] 145.5593``````
``mean(x)``
``## [1] 150.0064``

So the mean isn’t 150. But it’s also not that far from 150 either in general. And Most wont be.

Overall, the mean of our samples, if we take enough, will equal the mean of the population. As we keep taking samples from the runners, the distribution of the means of those samples will stack up close to 150, forming a normal distribution. To illustrate, let’s take the mean weights of a bunch more tour buses. The code isn’t really worth explaining here, but I’m going to take multiple samples from the runners data, save them into a list of the samples, and figure out the mean of each sample. Then, I’ll graph those means.

``````runner.sample <- as.data.frame(t(replicate(50, sample(Runners, 50))))
runner.sample\$mean <- rowMeans(runner.sample, na.rm=TRUE)
hist(runner.sample\$mean)``````

And that’s exciting, very exciting, because we know how far data should fall from the mean of a normal distribution.

Remember the normal distribution. 34.1 percent of the data falls within 1 standard deviation above and below the mean. That’s on both sides, so a total of 68.2 percent of the data falls between 1 standard deviation below the mean and one standard deviation above the mean. 13.6 percent of the data is between 1 and 2 standard deviations. In total, we expect 95.4 percent of the data to be within two standard deviations, either above or below the mean.

When we’re calculating the standard deviation for the means, we divide the standard deviation for the population (10) by the square root for the sample size (50), which equals 1.41.

``10/sqrt(50)``
``## [1] 1.414214``

Taken together, that means that if we took the mean weight of each tour bus, we would expect it to be roughly 150, and we’d expect those means to have a standard deviation of 1.41. In addition, we know that most of the data will fall within 2 standard deviations, or to be more exact between 147.42 and 152.58.

``150-(2*1.41)``
``## [1] 147.18``
``150+(2*1.41)``
``## [1] 152.82``

So back to the bus we located. The mean of that bus was 173. We know now that roughly 95% of the buses/samples will have a mean weight between 147.42 and 152.58. What are the odds that one bus would be at 173? Not very good. That’s 16 standard errors above the mean.

``(173-150)/1.41``
``## [1] 16.31206``

Again, it is not impossible that the located bus is full of marathoners. What are the chances of getting a sample mean 16 standard errors above the population mean? We’ll calculate exactly ow unlikely in future chapters, but for now we can leave it at saying the chances are less than one in a million. Again, not impossible, but it is improbable.

## 8.3 Second Example With More Math

This is all important, so let’s work through a similar example. This example will have less of a story to it, and more of a focus on the math. But we can put it within a real world focus that might be more realistic than lost buses of runners.

The earlier example we did above about marathoners had a normal distribution for the data on runners. However, the central limit theorem applies to any distribution of data. No matter the distribution of the original data, the means taken from random samples will become a normal distribution. So, for this example let’s use the data on California schools from earlier. The variable for the number of students per school was skewed to the right.

``````library(AER)
data("CASchools")
hist(CASchools\$students, breaks=100)``````

So, the population here is all schools in California. Let’s calculate the population mean and standard deviation.

``mean(CASchools\$students)``
``## [1] 2628.793``
``sd(CASchools\$students)``
``## [1] 3913.105``

The central limit theorem states that if we take repeated random samples of that population, over time the means of those samples will conform to a normal distribution. Let’s do that.

The size of each sample we take makes a small difference. Normally, you want to take a sample larger than 30 in order to accurately measure the population. But the sample can also be much larger. Let’s use 30 for this example, just to show that it works. And remember, we need to take repeated samples in order for it to form a normal distribution. Let’s start by taking 10 samples, to see the shape of those means.

``````student.sample <- as.data.frame(t(replicate(10, sample(CASchools\$students, 30))))
student.sample\$mean <- rowMeans(student.sample, na.rm=TRUE)
hist(student.sample\$mean, ylim=c(0,10))``````

That doesn’t look very normal. Taking 10 samples isn’t enough to form a normal distribution, just like flipping a coin 10 times wasn’t in the chapter on probability. Let’s increase the number of samples to 1000 and look at the shape.

``````student.sample <- as.data.frame(t(replicate(1000, sample(CASchools\$students, 30))))
student.sample\$mean <- rowMeans(student.sample, na.rm=TRUE)
hist(student.sample\$mean, breaks=100)``````

More normal. And 10,000 samples?

``````student.sample <- as.data.frame(t(replicate(10000, sample(CASchools\$students, 30))))
student.sample\$mean <- rowMeans(student.sample, na.rm=TRUE)
hist(student.sample\$mean, breaks=100)``````

Fairly normal. Notice the slight outliers to the right side though. That’s because with only 30 schools in each sample, large schools are able to make those sample means larger than we’d expect. That’s fine, this distribution is what we’d expect and looks normal. So even data sampled from an extremely skewed sample can form a bell curve with enough samples drawn. What’s the standard deviation of those means? It’s the standard deviation of the population, divided by the square root of the sample size.

``sd(CASchools\$students) / sqrt(30)``
``## [1] 714.432``

I keep telling you how important and exciting the existence of the central limit theorem is. In the next chapter, we’ll start using it to judge more interesting things than the destinations of tour buses. The central limit theorem underlies everything that statistics tells us about the world, for better and worse.