2 Let’s Explore Probability Theorems

In this chapter we will continue to explore some essential properties of probability and begin to relate them to experiments. These concepts will come up frequently in other places so here we take some time to gain an intutition about them.

2.1 The Central Limit Theorem

The central limit theorem (CLT) describes the property of independent random variables to sum towards a normal distribution regardless of the distribution of the random variables themselves.5 Think back to the frog widget from chapter 1. Note that in that simulation, no normal data was generated. Every time you hit the progress button, each frog decided to jump to the left, jump to the right, or stay still. In aggregate, this led to frog position being approximately normally distributed as these random decisions were more likely to cancel out than add together. This is a perfect example of the central limit theorem in action.

2.1.1 Why is the Central Limit Theorem Important?

For experimentalists, the CLT is an extremely important concept. For many practical questions, we cannot get measurements for an entire population of interest, so we have to select a sample from which to draw conclusions. But have you ever considered what we know about the relationship between the sample and the population? How can we be confident that the conclusions we draw about the sample generalize across the population? Issues related to proper sampling aside, the central limit theorem allows us to make claims about the distribution of our sample means. We will continue to dive into this idea in the next chapter, but for now let’s work on internalizing the concept.

In the widget below, you can generate populations. The button that you press will determine what distribution this population is generated from. On the right plot, you will see the distribution of means of 500 samples from that population. Simulate a few populations from each of the possible distributions now.

Regardless of what distribution you choose, notice that the sample means still end up approximately normally distributed. Why is this the case? Consider what would be necessary in order to observe a sample mean of 3 when our population is uniform. As 3 is the largest possible value in our population, each of our sample values would need to be 3 for us to observe a sample mean of 3. Sounding familiar? This is the exact same logic we used when we tried to understand the frog simulation from chapter 1. The common idea here is that there are fewer paths to extreme values than there are to middling values, which leads to the middling values being observed more often.

2.2 The Law of Large Numbers

The law of large numbers is not strictly related to the central limit theorem, but they address similar questions and both ideas will be used in the next chapter. As discussed above, the central limit theorem dictates that the sum of many independent random variables will be normally distributed. We are particularly interested in what this means for samples from a population. If we take many independent random samples we can be confident that the means of these samples will be normally distributed. But what else do we know about this distribution? The law of large numbers will help us understand what the parameters of this normal distribution (μ and σ) will be.

First, let’s explore the law of large numbers in a simple case. Below is a widget that shows the result of flipping 1000 coins. We are interested in counting the number of heads. On the x-axis we will plot the flip number.6 Along the y-axis, we will plot the cumulative mean, or our average number of heads up to flip x. Because the coin has a 50% probability of landing on heads, we might expect that our average number of heads will be close to 0.5. But how close? Use the plot run button to simulate 1000 coin flips and plot the result. Take some time to study the pattern, then click the button a few more times to generate more runs.

There are a few things you hopefully noticed immediately. Early on there is a lot of variation in mean scores. However, as you move towards 1,000 flips, the runs will converge towards the expected probability of 0.5. Let’s step through a few values for x. At x = 1 there are only two possible values for y: 0 and 1. For x = 2, there are now 4 possible paths: 2 heads, 2 tails, 1 head and 1 tail, and 1 tail and 1 head. In terms of y, this leads to 3 possible values: 0, 1, and 0.5. Because there are 2 paths to 0.5, it is already more probable than either 0 or 1, but the extremes are still somewhat likely. When we go out to x = 10, it becomes very unlikely for y to be equal to 0 or 1 as it requires 10 coin flips to come up the same way in a row (a 0.5^10 chance or 0.001). By contrast, there are a large number of paths to end up at 0.5 (again, think back to the frog widget from chapter 1).

After plotting a few runs, you will see that by x = 50, y is usually in the range of 0.3 to 0.7. However, by x = 1000, runs become extremely close to 0.5. This is the fundamental intuition behind the law of large numbers. The more independent random events you have, the less variability there will be around the theoretically expected results. This holds even if we don’t know what the theoretical expectation is. For example, if we are interested in determining the likelihood a coin would flip heads because we are worried it is a trick coin, we can be confident that after 1000 flips our average number of heads will be very close to the true probability that coin will flip heads.

2.3 Regression to the Mean

Finally, I’d like to take the time to discuss regression to the mean. Whereas the CLT and law of large numbers will come into play immediately in hypothesis testing, regression to the mean is more essential for topics related to experimental design and replication. It is the idea that extreme values tend to be balanced out by less extreme values in the long run. For any process with a random component, an unusually good result is partly based on “luck.” As an example, an athlete who performs extremely well one year is not likely to repeat their strong performance.

It is important to separate this idea from gambler’s fallacy, which occurs when people incorrectly believe that extreme results lead to a change in the underlying probability of the process away from the extreme results that have already been observed. Let’s again think about flipping a coin. As discussed above, the odds of getting 10 heads in a row are extremely low. However, after getting 9 heads in a row it is just as likely to get a 10th head as you are to get your 1st tail.

Regression to the mean states that after observing 10 heads in a row, you are more likely to get 5 heads in your next 10 throws than 10 heads (of course, this was true for your first 10 flips as well). This fact means that your average number of heads is likely to regress towards the mean at some point as the initial extreme result was due to a random process that unlikely to continue to produce extreme values.

So why does this matter? Let’s look at an example more closely related to performing experiments. Imagine you are interested in improving SAT scores. You have a new study plan that you are sure is going to help increase scores of the lowest performers so you take the students with the lowest scores, have them use your study plan, and take the test again. To make sure that there isn’t an effect due to practice with the test, you make the highest scorers take the test again as well with no additional studying.

Here we assume that SAT scores are the result of some internal skill or knowledge ability as well as some randomness due to how well the students slept, if they were tested on words they had studied, or any other random factor you can think of that might change for test 2. Plotted on the left is the population of all student scores with black lines showing the cutoff scores. On the right are the scores on test 2 for both low scorers (top) and high scorers (bottom). The “new students” button will generate a new population for you test.

At first glance, it seems as if your study plan is working! The low performing students seem to be doing better on average. But what is going on with the high performing students? Have they gotten worse? In fact, the simulation doesn’t take the study plan into account at all, the improvement of the low performing students and the decline of the high performing students is all due to regression to the mean.

Performance on test 1 was somewhat based on random luck. On test 2, these groups of students ended up closer to mean performance on average. You may even notice that some low performers end up better than average and some high performers end up worse than average if you take a few samples. A better design for this experiment would have been to sample across the entire range of students.

As an aside, this is likely why replications often show smaller effect sizes than the original studies (if they show an effect at all). As researchers we are often excited by large, seemingly meanful results and journals are biased towards publishing them. But because we select for large effects and there is a random component in our experiments these studies are likely to be less impressive when repeated.


  1. Note that there are actually multiple central limit theorems that all describe slightly different situations mathematically. Here we are simply concerned with the general concept.

  2. A logarithmic scale is used to provide more insight into the early flips.