17 Day 16

Announcements

  • Roughly 2 weeks until midterm 2

    • Start studying

    • Don’t make your grade an emergency

Review

  • All random variables have a random probability distribution

    • As all statistics are random variables:

      • All statistics arise from a random probability distribution


The probability distribution of a sample statistic is the sampling distribution


Let x¯ be the mean of a random sample of size n, drawn from a population with mean μ and standard deviation σ

Since x¯ is a random variable, it has the mean and the standard deviation

  • The mean of x¯ is μ. That is,

μx¯=μ=population mean

  • The standard deviation of x¯ is σ/n. That is,

σx¯=σn=population std. deviationsample size

Sampling Distribution

Given any population YN(μ,σ2)

  • Sample XN(μX,σX2)

  • Sample mean x¯N(μx¯,σx¯2)

    • Where: μx¯=μ

    • And: σx¯2=σx¯=σn


The intuition behind this may not be self evident, but it’s easy to visualize:


Example:

Suppose we take a simple random sample of size 25 from a normal population with a mean of 20 and a standard deviation of 4.

a. What is the distribution of x¯?

x¯N(μx¯,σx¯2)

μx¯=20andσx¯=425=0.8 x¯N(20,0.82)


b. Find the probability that we will observe a sample mean over 22.

"Over"P(x¯>22) =P(Z>22200.8)=P(Z>2.50)

=1P(Z<2.50)z-table=0.0062


c. Find the 95th percentile of x¯.

Look up 0.95 in the body of z-table:

z=1.64orz=1.65

take the midpointz1.645

Convert z to x¯ as follows:

x¯=μx¯+zσx¯=20+1.645(0.8)=21.316


  • What if the population we are sampling from isn’t normal

    • It’s easier to find a way to assume that x¯ is a normal random variable
  • Given the Central Limit Theorem, we can do that under certain assumptions



Central Limit Theorem

  • Let x¯ be the mean of a large random sample (n>30) from any population

    • With mean μ and standard deviation σ
  • The distribution of x¯ is approximately normal

    • Mean μx¯=μ

    • Standard deviation σx¯=σn.


If n is large enough, we have:

x¯N(μ,σ2n)

  • Regardless of the original population’s distribution


How large does n need to be?

  • This is an on-going debate in statistics

    • As the skew of the distribution increases, our requirements for larger n increases

    • As a general rule of thumb, n>30 should be sufficient



Example:

Recent data from the U.S. Census indicates that the mean age of college students is μ=25 years, with a standard deviation of σ=9.5 years. A simple random sample of 125 students is drawn. If x¯= the sample mean age of the students, what is the distribution of x¯? (Justify your answer.)

  • Since n>30:

    • x¯N(μx¯,σx¯2)

μx¯=25andσx¯=9.51250.85.

So:

x¯N(25,0.852)


Example:

The Internal Revenue Service reports that the mean federal income tax paid in a recent year was $8000. Assume that the standard deviation is $5000. The IRS plans to draw a sample of 625 tax returns to study the effect of a new tax law.

Let x¯= the mean tax for the 625 sampled tax returns

  • Then x¯N(8000,2002) by the CLT


a. What is the probability that the sample mean tax is between 7600 and 7900?

P(7600<x¯<7900)P(76008000200<Z<79008000200)

=P(2<Z<0.5)z-table=0.2857

b. Would it be unusual if the sample mean were less than 7500?

P(x¯<7500)P(Z<75008000200)z-table=0.0062

Yes, because P(x¯<7500)<1%



Population Proportion

Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:

  • Proportions are just percentages of the population

  • We’ve dealt with this a lot


  • Say the percentage of the population who participate in early voting is 40%

    • 60100=0.60

    • The proportion of the population who early vote, p=0.60

  • If we poll a sample of 100 Manhattan residents and find that 31% early vote:

    • The proportion of our sample who early vote, p^=0.31


Just like every other statistic, sample proportions are random variables

  • So their distribution is the sampling distribution of the proportion


All of our previous rules and ideas apply

  • As we take samples from our population we will see they aren’t consistent

  • The more we sample the closer we get to true values


  • Mean of the sample proportion p^ is:

μp^=p(population proportion)

  • Standard deviation of sample proportion p^ is:

σp^=p(1p)n

The Central Limit Theorem will tell us the “shape” of the distribution of p^


Proportion Central Limit Theorem

  • If np10 and n(1p)10

  • Distribution of p^ is approximately normal

    • Mean μp^=p

    • Standard deviation σp^=p(1p)n

So:

p^N(p,p(1p)n)


Example:

According to a Harris Poll, chocolate is the favorite ice cream flavor for 27% of Americans. If a sample of 100 Americans is taken, what is the probability that the sample proportion of those who prefer chocolate is greater than 0.30?

Since np=(100)(0.27)=2710 and n(1p)=(100)(0.73)=7310, we can apply the CLT. By the CLT, the distribution of p^ is approximately normal with:

μp^=0.27andσp^=0.27(10.27)1000.0444

Then,

P(p^>0.30)P(Z>0.300.270.0444)

=P(Z>0.68)

=1P(Z<0.68)

=10.7517=0.2483


  • We’ve studied point estimates — single number estimates — to estimate population parameters (e.g., sample mean, sample proportion)

  • Point estimates are a deterministic result

    • Statistics deals with probabilistic results
  • It would be more informative to provide a range of values

  • We generally call these confidence intervals and we’ll be talking about them more in-depth next lecture


  • Go away