8 Sampling distributions

In this chapter, we focus on understanding sample estimates and their distributions. The concepts behind sampling form the basis for statistical inference, which is the area of statistics concerned with making statements about a population based on a sample.

8.1 Population and sample

We can think of a population as the set of all possible cases of interest, together with their characteristics of interest (variables). For example, a population of interest could be all house sales in Colorado Springs and variables of interest could be number of bedrooms, living area, sale price, etc. We can also think of a population as an underlying process that governs possible outcomes. For example, if the population of interest are all mango trees and their heights, instead of thinking of the population as all currently living mango trees, we can think of the underlying process that governs the growth of mango trees and all possible heights that it can generate as the “population”. Very rarely we have access to the entire population of interest, so we usually collect a sample that we hope to be representative subset of the target population.

There are several techniques for collecting samples that are representative of a target population. Throughout these notes, the expression random sample refers to a subset of a population in which each member of the subset has an equal probability of being chosen.¹⁷

Sampling can be performed with or without replacement. When sampling with replacement, members of a population can be selected more than once to be part of the sample, whereas when sampling without replacement, members can only be selected once.

A desirable characteristic of a random sample is that its cases are independent. When randomly sampling from a finite population, sampling with replacement leads to an independent sample, wheres sampling without replacement does not. However, if the sample size is much smaller than the population size, sampling without replacement leads to an approximately independent sample. When randomly sampling from infinite populations, with or without replacement, the sample is independent.

We say that a sample $X_1, X_2, \dots, X_n$ is independent and identically distributed if the $X_i$ s are independent and they come from the same distribution. In short, we say that such sample is iid.

We usually use $\mu$ and $\sigma^2$ to denote the mean and variance of the target population.¹⁸ So for an iid sample, $E(X_i)=\mu$ and $Var(X_i)=\sigma^2$ . Summaries of a population, such as $\mu$ and $\sigma$ , are called population parameters. Sample estimates for population parameters are called point estimates (or simply estimates) or statistics.

8.2 Estimators

Definition 8.1 (Estimator) An estimator $\hat{\theta}$ is a sample statistic that estimates a population parameter $\theta$ .

For example, $\overline{X}$ is an estimator for $\mu$ . Sometimes estimators have special symbols, like $\overline{X}$ and $S$ , but in general, estimators are differentiated from population parameters by having a “^” on them (for example, $\hat{b}$ and $\hat{Y}_i$ .)

Estimators are random variables because each different sample gives a different value for the estimator. Since they are random variables, we must have information about their distributions in order to do statistical inference about a population parameter.

Definition 8.2 (Sampling distribution) The sampling distribution of an estimator is the probability distribution of the estimator. That is, it describes the distribution of all possible values of the estimator.

Note: In real-world applications, we never actually observe the sampling distribution, yet it is useful to always think of a point estimate as coming from such a hypothetical distribution. Understanding the sampling distribution helps us characterize and make sense of the point estimates that we do observe.

In statistics, it is not only important to know the sampling distribution of an estimator, but to know some key summaries of an estimator. For instance, what is $E(\overline{X})?$ What is $Var(\overline{X})?$ These questions are addressed in the next theorem.

Theorem 8.1 (Expected value and variance of the sample mean) Let $X_1, X_2, \dots, X_n$ be an iid sample such that $E(X_i) = \mu$ and $Var(X_i) = \sigma^2.$ Then $E(\overline{X}) = \mu$ and $Var(\overline{X}) = \frac{\sigma^2}{n}.$

Proof. For the first part, $E(\overline{X}) = E\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n}\sum E(X_i) = \frac{1}{n}n\mu = \mu.$ For the second part, $Var(\overline{X}) = Var\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2}Var\left(\sum X_i\right) = \frac{1}{n^2}\sum Var(X_i) = \frac{1}{n^2}n\sigma^2 = \frac{\sigma^2}{n}.$ Here it was used that $Var(\sum X_i) = \sum Var(X_i)$ because the $X_i$ s are independent.

Definition 8.3 (Unbiased estimator) We say that $\hat{\theta}$ is an unbiased estimator of $\hat{\theta}$ if $E(\hat{\theta}) = \theta.$ Otherwise, we say that $\hat{\theta}$ is biased.

Example 8.1 For an iid sample, the sample mean is an unbiased estimator of the population mean. This follows from the fact that $E(\overline{X}) = \mu$ , as shown in theorem 8.1.

Example 8.2 For an iid sample, the sample variance is an unbiased estimator of the population variance.

Proof. Let $X_1, X_2, \dots, X_n$ be an iid sample such that $E(X_i) = \mu$ and $Var(X_i) = \sigma^2.$ We need to show that $E(S^2) = \sigma^2$ , that is, $E(\frac{1}{n-1}\sum (X_i-\overline{X})^2) = \sigma^2.$ The first step is to show that $\sum (X_i-\overline{X})^2 = \sum(X_i-\mu)^2 - n(\overline{X}-\mu)^2.$ This is done by adding and subtracting $\mu$ inside the parentheses and then simplifying. Then it follows that $\begin{eqnarray} E\left(\sum(X_i-\overline{X})^2\right) &=& \sum E (X_i-\mu)^2 - nE(\overline{X}-\mu)^2\\ &=& n\sigma^2 - n Var(\overline{X}) = n\sigma^2 - n\frac{\sigma^2}{n} = (n-1)\sigma^2. \end{eqnarray}$ Finally, $E\left(\frac{1}{n-1}\sum(X_i-\overline{X})^2\right) = \frac{1}{n-1}(n-1)\sigma^2 = \sigma^2.$

Notice that if the denominator in the formula for $S^2$ was $n$ instead of $n-1$ we would not have an unbiased estimator.

We have seen that when building statistical models, one way to handle categorical explanatory variables is to create indicator variables, that is, variables that take only the values 0 and 1. When a sample $X_1, X_2, \dots, X_n$ takes only the values 0 and 1, then $\overline{X}$ is the proportion of 1s in the sample. For example, the mean of the sample $1,0,0,0,1,0$ is $\frac{1+0+0+0+1+0}{6} = \frac{1}{3} = 0.33.$ Therefore, for an indicator variable, we use the notation $\hat{p}$ in the place of $\overline{X}$ and call it the sample proportion. We use $p$ instead of $\mu$ to denote the population proportion. The population variance is $p(1-p)$ , which is the variance of an indicator/Bernoulli random variable (see theorem 7.7).

From theorem 8.1, it follows that $E(\hat{p}) = p$ and $Var(\hat{p}) = \frac{p(1-p)}{n}.$

8.3 Central Limit Theorem

In the previous section, we calculated the expected value and variance of the estimator $\overline{X}$ (and consequently, $\hat{p}$ .) We also showed that $\overline{X}$ and $S^2$ are unbiased estimators of $\mu$ and $\sigma^2$ . However, we still haven’t discussed what is the sampling distribution of $\overline{X}$ (and $\hat{p}$ ).

For an iid sample $X_1, X_2, \dots, X_n$ , the exact distribution of $\overline{X}$ depends on the distribution of the $X_i$ s. However, for a large enough sample, the distribution of $\overline{X}$ is approximately normal. This is the main result of this section, which is stated below without proof¹⁹.

Theorem 8.2 (Central Limit Theorem (CLT)) Let $X_1, X_2, \dots, X_n$ be an iid sample with $E(X_i)=\mu$ and $Var(X_i) = \sigma^2.$ Then the PDF of $\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}$ approaches the PDF of a standard normal distribution as $n$ approaches $\infty$ . That is, the sampling distribution of $\overline{X}$ is approximately normal with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$ for sufficiently large $n$ .

Usually, when the $X_i$ ’s are numerical, $n\geq 30$ yields a good approximation. When the $X_i$ ’s take the values 0 and 1, we use the the success-failure condition instead. The success-failure condition states that if the expected number of successes and failures, $np$ and $n(1-p)$ , are sufficiently large (usually $\geq 10$ suffices, but $\geq 5$ is also commonly used), then the binomial random variable $\sum X_i$ can be approximated by a normal distribution with mean $np$ and variance $n p(1-p).$ In that case, the CLT states that the sampling distribution of $\hat{p}$ is approximately normal with mean $p$ and standard deviation $\sqrt{\frac{p(1-p)}{n}}$ .

Note: Since $\sum X_i = n\overline{X}$ , then the CLT implies that the sum of iid random variables is approximately normal for large enough $n$ .

In summary, the Central Limit Theorem gives an approximate sampling distribution for $\overline{X}$ and $\hat{p}$ when the sample is “large enough”. Many tools for statistical inference are derived from this theorem, making it one of the most important in the theory of statistics.

Example 8.3 In the 2022-2023 academic year, 20.5% of the students enrolled at Colorado College were from Colorado. You want to randomly sample 50 students for a class project and, among other questions, you want ask them what is their home state. What is the probability that at least 15% of the students in your sample will be from Colorado?

Denote by $\hat{p}$ the proportion of students in your sample who are from Colorado. We can use the CLT to address this question if $n$ and $p$ pass the success-failure condition.

Expected number of students from CO (success): $np = 50\times 0.205 = 10.25.$

Expected number of students not from CO (failure): $n(1-p) = 50\times (1-0.205) = 39.75.$

Since both expected values are large enough ( $\geq 10$ ), by the CLT we can approximate the sampling distribution of $\hat{p}$ with a normal distribution with mean 0.205 and standard deviation $\sqrt{\frac{0.205(1-0.205)}{50}} = 0.057.$

This means that $P(\hat{p} \geq 0.15) = 1-P(\hat{p} < 0.15) \approx 1 - pnorm(0.15, 0.205, 0.057) = 0.8327.$ That is, the probability that at least 15% of the students in your sample are from Colorado is approximately 0.8327.

Example 8.4 After 1964, quarters in the United States were manufactured so that their weights have a mean of $5.67 g$ and a standard deviation of $0.06 g.$ You have 35 quarters in a bag and some free time, so you decide to weight your quarters with a small scale. What is the probability that the average weight of your coins is within $0.01 g$ from $5.67 g$ ? That is, what is $P(5.66 \leq \overline{X} \leq 5.68)$ ?

Since $n=35$ is “large enough” ( $\geq 30$ ), the CLT says that the sampling distribution of $\overline{X}$ is approximately normal with mean 5.57 and standard deviation $0.06/\sqrt{35} = 0.01014$ . This gives

$\begin{eqnarray} P(5.66 \leq \overline{X} \leq 5.68) &=& P(\overline{X} \leq 5.68) - P(\overline{X} < 5.66) \\ &\approx& pnorm(5.68, 5.67, 0.01014) - pnorm(5.66, 5.67, 0.01014) \\ &=& 0.676. \end{eqnarray}$

8.4 t distribution

In example 8.4, we were able to use a given standard deviation for the population to approximate the sampling distribution of $\overline{X}$ . The t distribution (also known as Student’s t distribution) is a continuous distribution that resembles the standard normal one, but its PDF has thicker tails. It has only one parameter, its degrees of freedom. It is derived from the normal distribution in the following way:

Consider a sample $X_1, X_2, \dots, X_n$ from a normal population with mean $\mu$ and standard deviation $\sigma$ . That is $X_i \sim N(\mu, \sigma)$ . If we approximate $\sigma$ with the sample standard deviation $S = \sqrt{\frac{\sum (X_i-\overline{X})^2}{n-1}}$ , then $\frac{\overline{X} - \mu}{S/\sqrt{n}}$ has a t distribution with $n-1$ degrees of freedom. We use the notation: $\frac{\overline{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}.$

The PDF of a t distribution with $df$ degrees of freedom is given by:

$f(x) = \frac{\Gamma(\frac{df+1}{2})}{\sqrt{df\pi}\Gamma(\frac{df}{2})}\left(1+\frac{x^2}{df}\right)^{-\frac{df+1}{2}}, \quad -\infty < x < \infty,$

where $\Gamma$ is the gamma function²⁰.

Notation: We use the notation $X \sim t_{df}$ to say that $X$ has a t distribution with $df$ degrees of freedom.

In R, the PDF, CDF, and inverse CDF of a t-distribution are the functions dt(x, df), pt(x,df), and qt(x, df). The higher the degrees of freedom, the closer the t-distribution is to the standard Normal, as illustrated in the PDF plots below.

library(tidyverse)
ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
  stat_function(aes(color = "normal"), fun = dnorm, args = list(mean = 0, sd = 1), linewidth = 1.5) + 
  stat_function(aes(color = "df = 2"), fun = dt, args = list(df = 1)) + 
  stat_function(aes(color = "df = 5"), fun = dt, args = list(df = 5)) + 
  stat_function(aes(color = "df = 9"), fun = dt, args = list(df = 10))

When to use a t distribution?

When a numerical sample size is not large enough for the CLT (<30) and the sample doesn’t show severe skew or extreme outliers (that is, it’s reasonable to think that the sample could have come from a normal distribution), then the t distribution can be used to approximate the sampling distribution of $\frac{\overline{X} - \mu}{S/\sqrt{n}}.$ Even for large samples ( $\geq 30$ ), since the t distribution approaches the standard normal as $n$ increases, it is common to default to using the t distribution any time we use $S$ to approximate $\sigma$ .

8.5 Simulation techniques

8.5.1 Bootstrap

The bootstrap method is a technique that can be used to construct approximate sampling distributions.

As an example, suppose that we want to estimate the sampling distribution of the slope $\hat{b}$ of the least squares line that predicts the interest rate in loans from debt-to-income ratio (see the loans dataset introduced in chapter 4. The slope and intercept for these data is:

library(openintro)
loans <- loans_full_schema
lm(interest_rate ~ debt_to_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ debt_to_income, data = loans)
## 
## Coefficients:
##    (Intercept)  debt_to_income  
##       11.51145         0.04718

To construct a sampling distribution for $\hat{b}$ using a simulation approach, we “collect” many samples of size $1000$ and then calculate $\hat{b}$ for each sample. We think of this collection of values of $\hat{b}$ as an approximate sampling distribution for $\hat{b}$ . But how do we collect many samples if we don’t have access to the population? It turns out that we can treat the original sample as the population and sample from the original sample with replacement. This resampling technique is called bootstrap and a sample obtained in this way is called a bootstrap sample. It may seem counter intuitive that sampling from a sample can yield useful information, but indeed one can approximate the sampling distribution of an estimate in that way. The mathematical theory that backs up this property is well-developed since it’s initial dissemination in 1979. It’s main contributor, Bradley Efron, received the International Prize in Statistics, which is one of the two highest honors in statistics, in 2019 “for the bootstrap”.

For instance, we can obtain one bootstrap sample from the loans dataset (which has 10000 rows) by using the funcion sample to select which cases will part of a bootstrap sample (a case can be selected more than once). In the code below, we generate 10000 numbers by sampling from the list of numbers 1 to 10000, with replacement, which means that a number can be selected more than once. The line bs[1:10] shows the first 10 numbers selected.

bs <- sample(1:10000, replace = TRUE);
bs[1:10]

##  [1] 4986 6091 5296 7635 4561 9444 4037 3421 2631 7570

Then the dataset loans[bs,] is now a bootstrap sample. Let’s construct a linear model to predict interest_rate from debt_to_income using this bootstrap sample:

lm(interest_rate ~ debt_to_income, data = loans[bs,])

## 
## Call:
## lm(formula = interest_rate ~ debt_to_income, data = loans[bs, 
##     ])
## 
## Coefficients:
##    (Intercept)  debt_to_income  
##       11.67827         0.03827

Notice that the values in the summary are slightly different than those obtained with the original sample. We can repeat this process several times (say one thousand times) and save the coefficients generated by the LS model. In the code below, we record 1000 slopes generated by the 1000 bootstrap samples.

set.seed(217)
b <- NA
for (i in 1:1000) {
  bs <- sample(1:10000, replace = TRUE)
  m <- lm(interest_rate ~ debt_to_income, data = loans[bs,])
  b[i] <- m$coefficients[2]
}

Let’s look at a histogram of the slopes generated:

ggplot(data.frame(b = b), aes(x = b, y = ..density..)) + 
  geom_histogram(alpha = 0.3, color = "black") +
  geom_density()

These values approximate the sampling distribution of $\hat{b}$ , which are the possible values we can get for $\hat{b}$ when sampling 10000 loans from the population. Could this sampling distribution be approximated with a normal distribution? We’ll address this question in a later chapter.

In this course, we don’t go over methodological details of sampling techniques. Introductory information on sampling techniques can be found in Openintro Statistics and a more detailed coverage can be found in textbooks/chapters on methodology for specific disciplines.↩︎
We have also used $\mu$ and $\sigma$ as the mean and standard deviation of a normal distribution. In this chapter we use them to represent population parameters.↩︎
A proof of the CLT is beyond the scope of this course. A course in probability or mathematical statistics should cover this proof.↩︎
The formula for the gamma function is $\Gamma(x) = \int_0^\infty u^{x-1}e^{-u}du.$ ↩︎