16 Day 16

Announcements

  • Roughly 2 weeks until midterm 2

    • Start studying

    • Don’t make your grade an emergency

Review

  • All random variables have a random probability distribution

    • As all statistics are random variables:

      • All statistics arise from a random probability distribution


The probability distribution of a sample statistic is the sampling distribution


Let \(\bar{x}\) be the mean of a random sample of size \(n\), drawn from a population with mean \(\mu\) and standard deviation \(\sigma\)

Since \(\bar{x}\) is a random variable, it has the mean and the standard deviation

  • The mean of \(\bar{x}\) is \(\mu\). That is,

\[\mu_{\bar{x}} = \mu = \text{population mean}\]

  • The standard deviation of \(\bar{x}\) is \(\sigma / \sqrt{n}\). That is,

\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{\text{population std. deviation}}{\sqrt{\text{sample size}}}\]

Sampling Distribution

Given any population \(Y \sim N(\mu,\sigma^2)\)

  • Sample \(X \sim N(\mu_X,\sigma^2_X)\)

  • Sample mean \(\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)

    • Where: \(\mu_{\bar{x}} = \mu\)

    • And: \({\sqrt{\sigma^2_{\bar{x}}}}=\sigma_{\bar{x}} = {\sigma \over \sqrt{n}}\)


The intuition behind this may not be self evident, but it’s easy to visualize:


Example:

Suppose we take a simple random sample of size 25 from a normal population with a mean of 20 and a standard deviation of 4.

a. What is the distribution of \(\bar{x}\)?

\[\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\]

\[\mu_{\bar{x}} = 20 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{4}{\sqrt{25}} = 0.8\] \[\bar{x} \sim N(20, 0.8^2)\]


b. Find the probability that we will observe a sample mean over 22.

\[\text{"Over"} \quad \Rightarrow \quad P(\bar{x} > 22)\] \[= P \left(Z > \frac{22 - 20}{0.8}\right) = P(Z > 2.50)\]

\[= 1 - P(Z < 2.50) \quad z\text{-table} = 0.0062\]


c. Find the 95th percentile of \(\bar{x}\).

Look up 0.95 in the body of z-table:

\[z = 1.64 \quad \text{or} \quad z = 1.65\]

\[\text{take the midpoint} \quad z \approx 1.645\]

Convert \(z\) to \(\bar{x}\) as follows:

\[\bar{x} = \mu_{\bar{x}} + z \sigma_{\bar{x}} = 20 + 1.645(0.8) = 21.316\]


  • What if the population we are sampling from isn’t normal

    • It’s easier to find a way to assume that \(\bar{x}\) is a normal random variable
  • Given the Central Limit Theorem, we can do that under certain assumptions



Central Limit Theorem

  • Let \(\bar{x}\) be the mean of a large random sample (\(n > 30\)) from any population

    • With mean \(\mu\) and standard deviation \(\sigma\)
  • The distribution of \(\bar{x}\) is approximately normal

    • Mean \(\mu_{\bar{x}} = \mu\)

    • Standard deviation \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\).


If \(n\) is large enough, we have:

\[\bar{x} \sim N(\mu, \frac{\sigma^2}{n})\]

  • Regardless of the original population’s distribution


How large does \(n\) need to be?

  • This is an on-going debate in statistics

    • As the skew of the distribution increases, our requirements for larger \(n\) increases

    • As a general rule of thumb, \(n > 30\) should be sufficient



Example:

Recent data from the U.S. Census indicates that the mean age of college students is \(\mu = 25\) years, with a standard deviation of \(\sigma = 9.5\) years. A simple random sample of 125 students is drawn. If \(\bar{x} =\) the sample mean age of the students, what is the distribution of \(\bar{x}\)? (Justify your answer.)

  • Since \(n > 30\):

    • \(\bar{x} \approx N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)

\[\mu_{\bar{x}} = 25 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{9.5}{\sqrt{125}} \approx 0.85.\]

So:

\[\bar{x} \sim N(25, 0.85^2)\]


Example:

The Internal Revenue Service reports that the mean federal income tax paid in a recent year was \(\$8000\). Assume that the standard deviation is \(\$5000\). The IRS plans to draw a sample of \(625\) tax returns to study the effect of a new tax law.

Let \(\bar{x} =\) the mean tax for the \(625\) sampled tax returns

  • Then \(\bar{x} \approx N(8000, 200^2)\) by the CLT


a. What is the probability that the sample mean tax is between \(7600\) and \(7900\)?

\[P(7600 < \bar{x} < 7900) \approx P\left(\frac{7600 - 8000}{200} < Z < \frac{7900 - 8000}{200}\right)\]

\[= P(-2 < Z < -0.5) \quad \text{z-table} = 0.2857\]

b. Would it be unusual if the sample mean were less than \(7500\)?

\[P(\bar{x} < 7500) \approx P\left(Z < \frac{7500 - 8000}{200}\right) \quad \text{z-table} = 0.0062\]

Yes, because \(P(\bar{x} < 7500) < 1\%\)



Population Proportion

Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:

  • Proportions are just percentages of the population

  • We’ve dealt with this a lot


  • Say the percentage of the population who participate in early voting is \(40\%\)

    • \({60\over 100}=0.60\)

    • The proportion of the population who early vote, \(p=0.60\)

  • If we poll a sample of 100 Manhattan residents and find that \(31\%\) early vote:

    • The proportion of our sample who early vote, \(\hat{p}=0.31\)


Just like every other statistic, sample proportions are random variables

  • So their distribution is the sampling distribution of the proportion


All of our previous rules and ideas apply

  • As we take samples from our population we will see they aren’t consistent

  • The more we sample the closer we get to true values


  • Mean of the sample proportion \(\hat{p}\) is:

\[\mu_{\hat{p}} = p \quad \text{(population proportion)}\]

  • Standard deviation of sample proportion \(\hat{p}\) is:

\[\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]

The Central Limit Theorem will tell us the “shape” of the distribution of \(\hat{p}\)


Proportion Central Limit Theorem

  • If \(np \geq 10\) and \(n(1 - p) \geq 10\)

  • Distribution of \(\hat{p}\) is approximately normal

    • Mean \(\mu_{\hat{p}} = p\)

    • Standard deviation \(\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)

So:

\[\hat{p} \sim N \left( p, \frac{p(1 - p)}{n} \right)\]


Example:

According to a Harris Poll, chocolate is the favorite ice cream flavor for 27% of Americans. If a sample of 100 Americans is taken, what is the probability that the sample proportion of those who prefer chocolate is greater than 0.30?

Since \(np = (100)(0.27) = 27 \geq 10\) and \(n(1 - p) = (100)(0.73) = 73 \geq 10\), we can apply the CLT. By the CLT, the distribution of \(\hat{p}\) is approximately normal with:

\[\mu_{\hat{p}} = 0.27 \quad \text{and} \quad \sigma_{\hat{p}} = \sqrt{\frac{0.27(1 - 0.27)}{100}} \approx 0.0444\]

Then,

\[P(\hat{p} > 0.30) \approx P\left( Z > \frac{0.30 - 0.27}{0.0444} \right)\]

\[= P(Z > 0.68)\]

\[= 1 - P(Z < 0.68)\]

\[= 1 - 0.7517 = 0.2483\]


  • We’ve studied point estimates — single number estimates — to estimate population parameters (e.g., sample mean, sample proportion)

  • Point estimates are a deterministic result

    • Statistics deals with probabilistic results
  • It would be more informative to provide a range of values

  • We generally call these confidence intervals and we’ll be talking about them more in-depth next lecture


  • Go away