16 Day 16
Review
All random variables have a random probability distribution
As all statistics are random variables:
- All statistics arise from a random probability distribution
The probability distribution of a sample statistic is the sampling distribution
Let \(\bar{x}\) be the mean of a random sample of size \(n\), drawn from a population with mean \(\mu\) and standard deviation \(\sigma\)
Since \(\bar{x}\) is a random variable, it has the mean and the standard deviation
- The mean of \(\bar{x}\) is \(\mu\). That is,
\[\mu_{\bar{x}} = \mu = \text{population mean}\]
- The standard deviation of \(\bar{x}\) is \(\sigma / \sqrt{n}\). That is,
\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{\text{population std. deviation}}{\sqrt{\text{sample size}}}\]
Sampling Distribution
Given any population \(Y \sim N(\mu,\sigma^2)\)
Sample \(X \sim N(\mu_X,\sigma^2_X)\)
Sample mean \(\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)
Where: \(\mu_{\bar{x}} = \mu\)
And: \({\sqrt{\sigma^2_{\bar{x}}}}=\sigma_{\bar{x}} = {\sigma \over \sqrt{n}}\)
The intuition behind this may not be self evident, but it’s easy to visualize:
Example:
Suppose we take a simple random sample of size 25 from a normal population with a mean of 20 and a standard deviation of 4.
a. What is the distribution of \(\bar{x}\)?
\[\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\]
\[\mu_{\bar{x}} = 20 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{4}{\sqrt{25}} = 0.8\] \[\bar{x} \sim N(20, 0.8^2)\]
b. Find the probability that we will observe a sample mean over 22.
\[\text{"Over"} \quad \Rightarrow \quad P(\bar{x} > 22)\] \[= P \left(Z > \frac{22 - 20}{0.8}\right) = P(Z > 2.50)\]
\[= 1 - P(Z < 2.50) \quad z\text{-table} = 0.0062\]
c. Find the 95th percentile of \(\bar{x}\).
Look up 0.95 in the body of z-table:
\[z = 1.64 \quad \text{or} \quad z = 1.65\]
\[\text{take the midpoint} \quad z \approx 1.645\]
Convert \(z\) to \(\bar{x}\) as follows:
\[\bar{x} = \mu_{\bar{x}} + z \sigma_{\bar{x}} = 20 + 1.645(0.8) = 21.316\]
What if the population we are sampling from isn’t normal
- It’s easier to find a way to assume that \(\bar{x}\) is a normal random variable
Given the Central Limit Theorem, we can do that under certain assumptions
Central Limit Theorem
Let \(\bar{x}\) be the mean of a large random sample (\(n > 30\)) from any population
- With mean \(\mu\) and standard deviation \(\sigma\)
The distribution of \(\bar{x}\) is approximately normal
Mean \(\mu_{\bar{x}} = \mu\)
Standard deviation \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\).
If \(n\) is large enough, we have:
\[\bar{x} \sim N(\mu, \frac{\sigma^2}{n})\]
- Regardless of the original population’s distribution
How large does \(n\) need to be?
This is an on-going debate in statistics
As the skew of the distribution increases, our requirements for larger \(n\) increases
As a general rule of thumb, \(n > 30\) should be sufficient
Example:
Recent data from the U.S. Census indicates that the mean age of college students is \(\mu = 25\) years, with a standard deviation of \(\sigma = 9.5\) years. A simple random sample of 125 students is drawn. If \(\bar{x} =\) the sample mean age of the students, what is the distribution of \(\bar{x}\)? (Justify your answer.)
Since \(n > 30\):
- \(\bar{x} \approx N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)
\[\mu_{\bar{x}} = 25 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{9.5}{\sqrt{125}} \approx 0.85.\]
So:
\[\bar{x} \sim N(25, 0.85^2)\]
Example:
The Internal Revenue Service reports that the mean federal income tax paid in a recent year was \(\$8000\). Assume that the standard deviation is \(\$5000\). The IRS plans to draw a sample of \(625\) tax returns to study the effect of a new tax law.
Let \(\bar{x} =\) the mean tax for the \(625\) sampled tax returns
- Then \(\bar{x} \approx N(8000, 200^2)\) by the CLT
a. What is the probability that the sample mean tax is between \(7600\) and \(7900\)?
\[P(7600 < \bar{x} < 7900) \approx P\left(\frac{7600 - 8000}{200} < Z < \frac{7900 - 8000}{200}\right)\]
\[= P(-2 < Z < -0.5) \quad \text{z-table} = 0.2857\]
b. Would it be unusual if the sample mean were less than \(7500\)?
\[P(\bar{x} < 7500) \approx P\left(Z < \frac{7500 - 8000}{200}\right) \quad \text{z-table} = 0.0062\]
Yes, because \(P(\bar{x} < 7500) < 1\%\)
Population Proportion
Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:
Proportions are just percentages of the population
We’ve dealt with this a lot
Say the percentage of the population who participate in early voting is \(40\%\)
\({60\over 100}=0.60\)
The proportion of the population who early vote, \(p=0.60\)
If we poll a sample of 100 Manhattan residents and find that \(31\%\) early vote:
- The proportion of our sample who early vote, \(\hat{p}=0.31\)
Just like every other statistic, sample proportions are random variables
- So their distribution is the sampling distribution of the proportion
All of our previous rules and ideas apply
As we take samples from our population we will see they aren’t consistent
The more we sample the closer we get to true values
- Mean of the sample proportion \(\hat{p}\) is:
\[\mu_{\hat{p}} = p \quad \text{(population proportion)}\]
- Standard deviation of sample proportion \(\hat{p}\) is:
\[\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]
The Central Limit Theorem will tell us the “shape” of the distribution of \(\hat{p}\)
Proportion Central Limit Theorem
If \(np \geq 10\) and \(n(1 - p) \geq 10\)
Distribution of \(\hat{p}\) is approximately normal
Mean \(\mu_{\hat{p}} = p\)
Standard deviation \(\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)
So:
\[\hat{p} \sim N \left( p, \frac{p(1 - p)}{n} \right)\]
Example:
According to a Harris Poll, chocolate is the favorite ice cream flavor for 27% of Americans. If a sample of 100 Americans is taken, what is the probability that the sample proportion of those who prefer chocolate is greater than 0.30?
Since \(np = (100)(0.27) = 27 \geq 10\) and \(n(1 - p) = (100)(0.73) = 73 \geq 10\), we can apply the CLT. By the CLT, the distribution of \(\hat{p}\) is approximately normal with:
\[\mu_{\hat{p}} = 0.27 \quad \text{and} \quad \sigma_{\hat{p}} = \sqrt{\frac{0.27(1 - 0.27)}{100}} \approx 0.0444\]
Then,
\[P(\hat{p} > 0.30) \approx P\left( Z > \frac{0.30 - 0.27}{0.0444} \right)\]
\[= P(Z > 0.68)\]
\[= 1 - P(Z < 0.68)\]
\[= 1 - 0.7517 = 0.2483\]
We’ve studied point estimates — single number estimates — to estimate population parameters (e.g., sample mean, sample proportion)
Point estimates are a deterministic result
- Statistics deals with probabilistic results
It would be more informative to provide a range of values
We generally call these confidence intervals and we’ll be talking about them more in-depth next lecture
- Go away