3 Let’s Explore Sampling Distributions

In this chapter, we will explore the 3 important distributions you need to understand in order to do hypothesis testing: the population distribution, the sample distribution, and the sampling distribution. Most importantly, we will explore the relationships between them, so that you internalize not only what they are but why they matter.

3.1 What is a Population Distribution?

A population is of course the entire group of individuals that you are interested in studying. This could be anything from all humans to a specific type of cell. The population distribution however, is a bit more narrow with its definition because it is specific to the measure you are interested in. So, if you were studying the heights of adult humans your population would be all adults but your population distribution would be all of the heights of every human in centimeters. Many resources don’t make this hard distinction, but if you think of population distributions in this way it will help you conceptualize exactly what your population parameters are.

3.1.1 What is a Population Parameter?

A population parameter is a number that describes the population distribution. In most real world experimental settings, you will likely be most interested in the population mean if your data is continuous. If you have categorical data you would instead be interested in the proportion of a population for which a certain trait applies. You may end up with binary or count data for which different parameters apply. Practically speaking though, researchers are often interested in means and if you understand the following principles in regards to population means it will be easy enough to transfer that knowledge to any other population parameter you are interested in. Note that, although we are interested in the population mean, that does not mean that we are assuming any shape of the population distribution (i.e. you can calculate means no matter what shape a distribution is). We will further discuss assumptions of hypothesis tests in future chapters.

In frequentist statistics (i.e. everything we are going to talk about until we get to Bayesian statistics) population parameters are fixed values. This makes intuitive sense, for example there is a true mean height for all humans. Of course, it is unusual to actually have access to these population parameters. One notable exception is when you work with standardized test data (such as IQ scores) where the mean of the test is designed to be a specific value. However, when you are observing something in nature you will not have access to the population values (if you did, you wouldn’t need statistics).

3.2 What is a Sample Distribution?

Because we don’t know population parameters, we have to estimate them using samples. Your sample is the only data you actually get to observe, whereas the other distributions are more like theoretical concepts. Your sample distribution is therefore your observed values from the population distribution you are trying to study. Instead of parameters, which are theoretical constants describing the population, we deal with statistics, which summarize our sample. This can be confusing as the mean of a population is a parameter but the mean of a sample is a statistic. To help clear things up, there are different mathematical notations for describing parameters and statistics. Common parameters are mean μ, standard deviation σ, or p proportion. The statistics equivalents are mean \(\bar{x}\) (read: “x bar”), standard deviation s, and proportion \(\hat{p}\) (read: “p hat”). While these notations are common, it is always good to check the context for whether a piece of text is referring to a parameter or a statistic.

In some fields, how you collect your sample is incredibly important. For example, when polling about political beliefs you want to be sure that your sample is representative of the entire body of people you are making claims about. A famous example of the importance of sampling comes from the Literary Digest who conducted a poll which oversampled rich Americans, leading to a prediction that Landon would win the US presidency in 1936. In fact, FDR won in a landslide. In other fields, it is expected that you conducted a simple random sample and other sampling methods are rare. As this topic is more about experimental design than statistics per se, I will not go into the different types of sampling here, but if you work with populations with large individual differences (e.g. people) I encourage you to read more into the topic. From here on out, whenever we discuss a sample we will assume it is representative of the population.

Finally, I’d like to touch quickly on outliers. In some fields, it is common practice to remove outliers that are a certain distance from your sample mean (often 2 or 3 sd, or 1.5 IQR). While these methods can be used to identify possible outliers, I strongly recommend you not use these methods unilaterally and without thought. If you record 50 data points from a normal distribution you should expect to observe at least one “outlier” (if any point greater than 3 sd from the mean is an outlier) about 12.5% of the time.⁷ In my opinion, the only reason to remove an outlier is if you believe that is was not generated by your population distribution. For example, if you had subjects complete a psychological task but one subject slept through it resulting in chance performance, it would of course make sense to remove them from your analysis. But if your task has an average performance of 90% and 1 subject is at 80%, removing them without some additional cause can bias your results. Regardless, whenever you are considering removing an outlier, it is best to perform your analysis both with outliers present and removed. Then you can compare results to see if the decision actually matters. If it doesn’t, choose whichever method makes more sense to you, but if it does matter be sure to report that your analysis is heavily influenced by the potential outliers in the data.

3.2.1 The Relationship between Population Distributions and Sample Distributions

Your sample distribution will be an approximate representation of your population distribution. This makes logical sense. If your population is coin flips your sample must therefore also be binary. Furthermore, if your population distribution has a probability of flipping heads at 50%, you would expect around 50% of your sample to be heads. Below is a widget that will allow you to generate a single sample from a population. Similar to the widget from last chapter, you can control the type of distribution that generates the population. However, you now also have control of the size of the sample that is generated using the slider. Take a moment to explore the effects of different sample sizes on the shape of the sample distribution.

Hopefully you noticed that as you increase sample size, the sample population starts to look more like the population data. With large sample sizes, the sample mean and sd will be a good approximation of the population mean and sd. However, you might be surprised with how poor a representation the sample distribution is at lower set sizes. At n = 20, it is even hard to tell the normal distribution apart from the uniform distribution and it is impossible to tell apart skew in the skewed distribution from random noise. This is not to say that you should never run experiments with sample sizes of 20, as we will see we can learn a lot even from small samples. But one take away is that you should be wary of making conclusions about the shape of your population distribution if your sample size is on the smaller side. I often see questions posted online from researchers worried about the non-normality of their data when it is basically impossible for a sample of that size to come out looking normal.⁸ In the next chapters we will look at the assumptions of various tests and what happens when you break them so that you know what rules can be bent and how far. For now, go back up to the widget and determine at what sample size you would feel comfortable making claims about the shape of your population distribution based on a sample you collected.

3.3 What is a Sampling Distribution?

Probably, the ideas behind population and sample distributions make some intuitive sense to you. Sampling distributions are where people tend to run into trouble, which is unfortunate as they are the most important to understand moving forward (of course, the naming doesn’t help things). The sampling distribution is the theoretical distribution of possible values for a sample statistic. Let’s return to the coin flipping example. As we have seen previously, it is possible but unlikely to observe a sample with 10/10 heads whereas it is much more likely to observe a sample with 5/10 heads. This is the sort of information that our sampling distribution contains.

Why is the concept of a sampling distribution so important? In frequentist statistics, all of the information about uncertainty in our estimates comes from the information that we know about the sampling distribution. Because our sample is from a theoretical sampling distribution, we can work backwards to make claims about the sampling distribution. And, by working backwards from our sampling distribution, we can make claims about the uncertainty in our population parameter estimates.

3.3.1 The Relationship between Sampling Distributions and Population Distributions

Most of the time in statistics, we are dealing with a single sample.⁹ Without the probability theorems discussed in the last chapter, we would not be able to make any statements about the sampling distribution. Luckily, thanks to the central limit theorem we know that our sampling distribution of the mean will be normal. And thanks to the law of large numbers, we know that as our n increases, the variation around the population mean will decrease.

The below widget will allow you to explore how changes in a population distribution affect the sampling distribution. The sampling distribution shows a distribution of sample means where each sample has an n of 25. This widget is identical to the CLT widget, but you now have the ability to adjust the mean and standard deviation of the population distribution. Take a moment to see how these changes impact the sampling distribution.

As you hopefully noticed, changing the mean of the population distribution results in a change in the center of the sampling distribution. What does this mean? It means that sample means are more likely to be close to the population mean than far away. Thanks to the CLT, we know that the error around the population mean will be normally distributed. However, the sd of the sampling distribution is not an estimate of the population σ. Still, there is a clear relationship there which we will continue to explore in the following sections.

3.3.2 The Relationship between Sampling Distributions and Sample Distributions

Here we will further explore the impact of sample size on your sampling distribution. In the following widget, a sample from a population as well as the sampling distribution are plotted side by side. You are once again able to control the population standard deviation (which directly impacts the sample sd), but now you can also change the size of the sample that has been collected. Compare how the shape of the sampling distribution changes as you adjust both of these values.

First, notice that even with small set sizes, we end up with a normal looking sampling distribution. There is a oft-repeated “rule” that the central limit theorem kicks in at n = 30 (or, more recently, even 50) and that hypothesis tests are not valid without a sample size this large. As you can see, this rule does not apply when our population is normally distributed, even an n of 10 is sufficient for a nicely normal sampling distribution. If you were to repeat this process with a binomial distribution, you would probably find that even an n of 50 is further from normal than you might expect.¹⁰ Do not blindly trust rules of thumb. Instead, take the time to do these simulations for yourself, for your particular situation.

Hopefully you also saw that both smaller values for sd and larger values for n lead to skinnier sampling distributions. What is the significance of this? Remember that this is the distribution of means from which our observed sample mean would come from. A skinnier distribution means that our sample mean is likely closer to our population mean than it otherwise would have been. Put another way, the less variable our estimate and the more samples we have, the better our estimate of our population parameter will be. Return to the widget to answer the following questions. Assuming you have the resources to either decrease the sample noise (leading to smaller sample sd) or increase the number of samples, which should you prefer? What intuition does this give towards the mathematical definition of the standard deviation of the sampling distribution?

3.3.3 Standard Error

The standard deviation of the sampling distribution is known as the standard error (SE). There are multiple formulas for standard error depending of exactly what your sampling distribution is, but here we will discuss the math behind the standard error shown above. The standard error of a simple mean estimate is the population sd divided by the square root of the sample size n (3.1).

\[\begin{equation} \LARGE \tag{3.1} SE = \frac{\sigma}{\sqrt{n}} \end{equation}\]

However, because we usually do not have access to the population parameter sigma, we instead use the sample standard deviation, as it is an estimate of the population standard deviation. With extremely small samples (i.e. less than 5) there is bias in this estimate, but as n increases it becomes largely negligible.

One consequence of this formula is that it is normally better to decrease noise rather than increase sample size, if you have the choice. As sample sizes increase, there are diminishing returns due to the square root in the denominator. In psychology experiments, this means that it is possibly better to have 50 subjects complete 50 trials of your experiment compared to 100 subjects complete 25 trials.¹¹

I encourage you to look into the derivations of both the mean and the standard deviation of the sampling distribution if you are interested. They are very straight forward, and help solidify the importance of our sample containing independent and identically distributed (iid) random variables. However, if all you take away from this section is a full understand of the definition of standard error you will still be in a good place moving forward.

You can calculate this simply for any sample size you are interested in. The probability of a single data point being an outlier (greater than 3 sd from the mean) is about 0.0027 (consider how to calculate this using the normal cdf). The odds of having no outliers in your entire sample is then (1 - 0.0027)^n where n is your sample size. Finally, to get the odds of at least one outlier, you simply take 1 minus this value. For example, the odds of at least one outlier in your sample of 50 is 1 - ((1 - 0.0027)^50). You could also simulate this value with something like mean(replicate(100000, {x = rnorm(50); any(x > 3 | x < -3)}))↩
At least, in a histogram. QQ plots do a much better job of showing true deviations from normality.↩
Even if data is coming from multiple different places, you would normally still treat it as a single sample and include a factor or use clusters to account for the differences.↩
Check for yourself in R with x = rbinom(1000, 50, 0.5); qqnorm(x); qqline(x).↩
The specifics of this will depend on how sd changes with the number of trial you include, so it is worth doing some downsampling analyses if there is a large dataset available.↩