10 Sampling Distributions

The sampling distribution of a statistic S is its probability distribution:

Definition: Let \(\small X=(x_1, x_2, x-3, ..., x_n)\) be a random sample and \(\small S(x)\) some statistic computed from that sample (for example the mean). The probability distribution of \(\small S(x)\) is called the sampling distribution of S. There are three basic approaches:

  • simulations
  • exact calculations (e.g by exhaustive calculations or formulas)
  • formula approximations \(\mapsto\) traditional hypothesis testing

10.1 Simulated sampling distributions

To simulate a sampling distribution of a parameter, we use R to generate many sample, find the parameter of interest from each sample, and then plot the empirical distribution. Example: Toss a fair coin 10 times and record the number of times \(\small H\) that heads came up. Repeat many times. If you plot the result, your have the (approximation) to the sampling distribution of \(\small H\). Below is the code for N=10000 simulated flips. To make the calculations easy, we set Head = 1.

N <-10000
count <- numeric(N)
for (i in 1:N)
 count[i] <- sum(sample(c(0,1),10,replace=T))

hist(count, breaks=seq(-0.5,10.5,1), freq=FALSE,xlab="Number of Heads", main="Empirical probability function",ylab="Probability")
plot(ecdf(count),xlab="Number of Heads", main="Empirical cumulative probability function",ylab="Probability")

Just for practice, here is the empirical probability function graphed using ggplot.

#> Warning: package 'ggplot2' was built under R version 4.3.2
N <-10000
count <- numeric(N)
for (i in 1:N)
 count[i] <- sum(sample(c(0,1),10,replace=T))
countdata <- data.frame(count)
ggplot(data=countdata, aes(x=count, y=after_stat(density)))+geom_histogram(binwidth=1, color="grey",fill="white",)+
  labs(title="probabiliy function, sampling distribution", x="number of heads",y="probability")

If you want the exact or theoretical sampling distribution, you need to make some assumption. In the case of the coin flip, we might assume that the coin is fair, never lands on its edge, and that each flip is identical. If those assumptions are met, we have a binomial distribution with \(n=10\) and \(p=q=0.5\).

Definition The standard deviation of the sampling distribution is called the standard error.

Another example An exponential distribution with mean \(\lambda\) and variance \(\lambda ^2\) where \(\small x \ge 0\) and \(\small \lambda > 0\) has density function \[\small f(x) = \frac{1}{\lambda} e^{-\frac{x}{\lambda}}\] Let’s say we want to simulate the sampling distribution of the parameter \(\lambda\) in samples drawn form an exponential distribution with rate parameter 3. We will use 1000 samples of size 100. (Remember, \(lambda\) is the mean of the exponential distribution.)

N <- 1000
lambda <- 3
s.means <- numeric(N)
for (i in 1:N)
  s.means[i] <- 1/mean(rexp(100,rate=lambda))
hist(s.means,freq=FALSE,xlab="lambda", xlim=c(1,5),ylab="probability", main="Density function, sampling distribution of lambda")
plot(ecdf(s.means),xlab="lambda", main="Empirical cumulative probability distribution function",ylab="Probability")

Yet another example Simulate the distribution function for the minimum of a sample of size 12 drawn from a uniform distribution on [1,10].

N <- 1000

sample.min <- numeric(N)
for (i in 1:N)
  sample.min[i] <- min(runif(12,1,10))

hist(sample.min,freq=FALSE, xlab="sample min", ylab="probability", main="Density function, sampling distribution of sample minimum")
plot(ecdf(sample.min),xlab="sample min", main="Empirical cumulative probability distribution function",ylab="Probability")

10.2 Exact sampling distributions

Sometimes it is possible to determine sampling distributions exactly.

Example Assume \(\small X_i\) ~ i.i.d. Unif[0,1], \(i=1,...,12\) (This means that we have twelve random variables \(\small X_1\) through \(\small X_{12}\) that are independent and identically distributed according to a uniform distribution on the interval [0,1]). Let’s compute the cumulative probability distribution for the maximum of all twelve \(\small X_i\).

First, convince yourself that for any \(\small X_i\), \(\small P(X_i \le x) = x\) for \(\small 0 \le x \le 1\).

The cdf \(\small F(x)\) is computed as: \(\small F(x)\)

\(\small = P(max (X_1,X_2,X_3,...,X_{12}) \le x)\)

\(\small = P(X_1 \le x, X_2 \le x, ..., X_{12} \le x)\)

\(\small = P(X_1 \le x) \cdot P(X_2 \le x) \cdot , ..., \cdot P(X_{12} \le x)\) by independence

\(\small = x^{12}\)

The density function \(\small f(x)\) is given/defined as \(\small f(x)= F'(x)\), so \(\small f(x)= 12 \cdot x^{11}\).

Let’s compare the exact cumulative probability function found above to the cumulative probability function found by simulation:

N <- 100 
s.max <- numeric(N)
for (i in 1:N)
  s.max[i] <- max(runif(12,0,1))
y <- ecdf(s.max)
plot(y,lty=1, main = "empirical and exact F(X), sample maximum",xlab="sample minimum",ylab="F(x)")

legend('topleft',legend=c("exact F(x)","simulated F(x)"),fill=c("red","black"))
x <- seq(0,1,0.01)
y <- x^12
lines(x, y, col="red")

The above example is a special case of this theorem:

Theorem Suppose we have continuous random variables \(\small X_1, X_2, ..., X_n\) that are i.i.d. with pdf \(\small f(x)\) and cdf \(\small F(x).\) Define \[\small X_{min} = min ( X_1,X_2,X_3,...,X_n) \text{ and } \small X_{max} = max ( X_1,X_2,X_3,...,X_n)\] Then the pdf’s for \(\small X_{min}\) and \(\small X_{max}\) are \[ \small f_{min} = n \cdot (1-F(x))^{n-1} \cdot f(x)\] and \[ \small f_{max} = n \cdot \lgroup F(x) \rgroup ^{n-1} \cdot f(x)\]

10.3 Assignment

Assume \(x_1, x_2, x_3\) are uniform discrete integer random variables \(0 \le x_i \le 10\). Find the simulated and exact sampling distribution for \(z=max\{x_1, x_2, x_3\}\).

10.4 Theoretical approximations

We cover this topic in detail later. Much of traditional hypothesis testing makes use of the fact that, for large enough samples, the distribution of a certain parameter such as the mean can be approximated by a known distribution. One very useful theorem is the Central Limit Theorem:

10.4.1 Central limit theorem CLT

The central limit theorem states that, under certain conditions, the distribution of sample means approaches the standard normal distribution: Let \(\small X_1, X_2, X_3, ...,X_n\) be independent and identically distributed random variables with finite mean \(\small \mu\) and finite variance \(\small \sigma^2\). \(\overline{X}_n\) denotes the sample mean \(\sum_{i=1}^n \small X_i\). Then, for any constant z, \[ lim_{n \rightarrow \infty } P( \frac{ \overline{X}_n - \mu}{\frac{\sigma}{\sqrt{n}}} \le z)=\Phi(z)\] where \(\small \Phi(z)\) is the cumulative distribution function of the standard normal distribution. We also write \[ \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma} \rightarrow \mathcal{N} (0,1).\]

Using properties of the standard deviation and mean, we get

\[ \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma} \rightarrow \mathcal{N} (0,1)\] \[\Rightarrow (\overline{X}_n - \mu) \rightarrow \mathcal{N} (0,\frac{\sigma}{\sqrt{n}})\] \[\Rightarrow \overline{X}_n \rightarrow \mathcal{N} (\mu,\frac{\sigma}{\sqrt{n}}).\]

In other words, the distribution of \(\overline{X}_n\) approaches a normal distribution with the same mean \(\mu\) as the \(\small X_i\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).

A simulation of the CLT using R

We demonstrate the central limit theorem, first using 100 samples of size 4, then 1000 samples of size 100 pulled from uniform[2,5] distributions.

First, we define the sample size and number of samples to use.

group.size <- 4
number.samples <- 100

Next, we generate the sample data.

a <- 2
b <- 5
sample.data <- runif(group.size*number.samples, a,b)

We set up storage vector, compute the means, and store them.

store.result <- numeric(number.samples)

for (i in 1:number.samples)
  current_sample  <- seq((i-1)*group.size+1 , i*group.size)
  store.result[i] <- mean(sample.data[current_sample])

Comparison of means

theoretical.mean <- (a+b)/2
mean_sample_means <- mean(store.result)
print(paste("The theoretical mean of the sample means is ",theoretical.mean))
#> [1] "The theoretical mean of the sample means is  3.5"
print(paste("The empirical mean of the sample means is ",mean_sample_means))
#> [1] "The empirical mean of the sample means is  3.46081261865736"

Comparison of variances

theoretical.var <- (b-a)^2/12/group.size
variance_sample_means <- var(store.result)

print(paste("The theoretical variance of the sample means is ",theoretical.var))
#> [1] "The theoretical variance of the sample means is  0.1875"
print(paste("The empirical variance  of the sample means is ",variance_sample_means))
#> [1] "The empirical variance  of the sample means is  0.184631999437904"

Finally, we plot the ECDF of the sample means and of a normal distribution with the theoretical mean and standard deviation. You should see that the two ECDFs are similar, but not identical, but then again, we used only 100 samples and only of size 4.

plot(ecdf(store.result),lty=1,pch=NA, main = "empirical and exact F(X), sample means",xlab="sample mean",ylab="F(x)")
legend('topleft',legend=c("exact F(x)","empirical F(x)"),fill=c("red","black"))
x <- seq(2.5,5,0.1)
lines(x,pnorm(x,theoretical.mean, sqrt(theoretical.var)), col="red")

Another example Lets generate 1000 samples of size 100 from a uniform [2,5] distribution and find the sampling distribution of the mean.

reps <- 1000
sample.means <- numeric(reps)
for (i in 1:reps){
  sample.means[i] = mean(runif(n=n, min=a, max=b))
sample <- data.frame(sample.means)
ggplot(data=sample,aes(x=sample.means, y=pnorm(sample.means,mean=(a+b)/2, sd=(b-a)^2/12/sqrt(n))))+geom_line(color="green")+
  stat_ecdf(geom="step", color="red")+
  labs(title = "cumulative distribution functions",subtitle="theoretical (green) and empirical (red)", y="uniform distribution function")