Module 8 Inference for a Proportion

In this module, we will continue our discussion on statistical inference with a discussion on hypothesis testing. In hypothesis testing, we take a more active approach to our data by asking questions about population parameters and developing a framework to answer those questions. We will root this discussion in confidence intervals before learning about several other approaches to hypothesis testing.

Module Learning Outcomes/Objectives

Perform and interpret inference for a population proportion.

R Objectives

Generate hypothesis tests for a proportion.
Interpret R output for tests of a proportion.

This module’s outcomes correspond to course outcomes (6) apply statistical inference techniques of parameter estimation such as point estimation and confidence interval estimation and (7) apply techniques of testing various statistical hypotheses concerning population parameters.

8.1 Confidence Intervals for a Proportion

Inference for a proportion is really similar to inference for a mean! It turns out we can apply the Central Limit Theorem to the sampling distribution for a proportion. But wait - isn’t our Central Limit Theorem only for means?

Think back to the binomial distribution (Section 4.3). A binomial experiment is made up of a series of Bernoulli trials, which result in 0s and 1s. If we add up these values, we get the number of successes $x$ . If we take the mean of these successes, we get the proportion of successes. In short, $\bar{x} = \hat{p}$ and we can work with the sampling distribution for a sample mean!

The mean of a Bernoulli random variable is $\mu = p$ and the standard deviation is $\sigma = \sqrt{p(1-p)}$ . So if we apply the Central Limit Theorem, $\hat{p}$ is approximately normally distributed with mean $\mu_{\hat{p}} = p$ and standard error $\sigma_{\hat{p}} = \frac{\sqrt{p(1-p)}}{\sqrt{n}} = \sqrt{\frac{p(1-p)}{n}}$

Each of the confidence intervals for a mean uses the same logic: $\text{estimate }\pm\text{ critical value }\times\text{ standard error }$ Confidence intervals for a proportion will do the same. We do not know the true value of $p$ for the standard error, so we will plug in $\hat{p}$ .

A $100(1-\alpha)\%$ confidence interval for $p$ .

$\hat{p}\pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

To use this formula, we need to check that $n\hat{p} > 10$ and $n(1-\hat{p})>10$ . (Note that $n\hat{p}$ is the number of successes and $n(1-\hat{p})$ is the number of failures, so this is another way to check this condition!)

Why? This relies on a normal approximation that does not work well if either of those quantities is less than or equal to 10. (This a topic which we have skipped, but the theory behind it is similar to the theory presented here for why we can use the Central Limit Theorem with proportions.)

Example: Suppose we take a random sample of 27 US households and find that 15 of them have dogs. Find a 95% confidence interval for the proportion of US households with dogs.

Solution: From the problem statement, $\alpha = 0.05$ . Also, $\hat{p} = 15/27 = 0.56$ . The number of successes (households with dogs) in the sample is 15 and the number of failures is 12, both greater than 10, so our assumptions are satisfied.

The critical value is $z_{\alpha/2}$ . Using the normal distribution applet with $\alpha = 0.05$ , this yields a value of 1.96. Plugging everything in, $\hat{p}\pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = 0.56 \pm 1.96\sqrt{\frac{0.56\times0.44}{27}} = 0.56 \pm 0.19$ or a 95% confidence interval of (0.37, 0.75).

Based on our sample, we can be 95% confident that the proportion of US households with dogs is between 0.37 and 0.75.

The concepts and interpretation behind these confidence intervals are the same as those for confidence intervals for a mean. Refer back to Module 6 for details.

Section Exercises

A survey of Stat 1 students resulted in the following data:

Year	Count
Freshman	16
Sophomore	11
Junior	3
Senior	5

We wish to find a 95% confidence interval for the proportion of freshmen.
1. Based on the same data provided above, find $\hat{p}$ , the sample proportion of freshmen.
2. Confirm the condition that $n\hat{p} > 10$ and $n(1-\hat{p}) > 10$ .
3. Determine the critical value.
4. Find the 95% confidence interval for the proportion of freshmen.
5. Interpret your interval in the context of the problem.
Find a 98% confidence interval for the proportion of sophomores.
Can you use the methods from this section to find a confidence interval for the proportion of juniors? Explain your thought process.

8.2 Hypothesis Tests for a Proportion

For a single proportion, the null and alternative hypotheses are

$H_0: p = p_0$
$H_A: p \ne p_0$

We can perform a hypothesis test for $p$ using the confidence interval, critical value, or p-value approach we covered previously. The concepts and interpretation are the same as those described in Module 7. You will also notice that the steps for each approach have not changed! The only modifications we need to make are to our setting, assumption, and a couple of formulas.

Setting and Assumptions: $p$ is target parameter, $np_0 > 10$ , $n(1-p_0)>10$ .

8.2.1 Confidence Interval Approach

The $100(1-\alpha)\%$ confidence interval for $p$ is $\hat{p}\pm z_{\alpha/2}\sqrt{\frac{p_0(1-p_0)}{n}}$ Notice that we use $p_0$ in the standard error and not the sample proportion. This is different from how we dealt with the standard error when calculating confidence intervals outside of a hypothesis testing context. We do this because the standard error is calculated based on the distribution based on the null hypothesis, which says that $p=p_0$ .

Steps:

State null and alternative hypotheses.
Decide on significance level $\alpha$ . Check assumptions.
Find the critical value.
Compute confidence interval.
If the null value is not in the confidence interval, reject the null hypothesis. Otherwise, do not reject.
Interpret results in the context of the problem.

Example: A quick internet search suggests that 38.4% of US households have dogs. Based on the sample described previously, is it reasonable to assume that the internet search is correct? Test at the 0.05 level of significance using a confidence interval approach.

Solution: We know from the previous example that $\hat{p} = 0.56$ and $n=27$ .

We want to see if the internet search is correct, so the null and alternative hypotheses are $H_0: p = 0.384$ $H_A: p \ne 0.384$

From the problem statement, $\alpha = 0.05$ . Also, $np_0 = 27(0.384)=10.4$ and $n(1-p_0)=27(0.616)=16.6$ , both greater than 10, so our assumptions are satisfied.

The critical value is $z_{\alpha/2}$ . Using the normal distribution applet with $\alpha = 0.05$ , this yields a value of 1.96.

Plugging everything in, $\hat{p}\pm z_{\alpha/2}\sqrt{\frac{p_0(1-p_0)}{n}} = 0.56 \pm 1.96\sqrt{\frac{0.384\times(1-.384)}{27}} = 0.56 \pm 0.18$ or a 95% confidence interval of (0.38, 0.74).

The null value is in the interval, so we fail to reject $H_0$ .

At the 0.05 level of significance, the data provide insufficient evidence to conclude that the proportion of US Households with dogs differs from 0.384.

8.2.2 Critical Value Approach

The critical value is $z_{\alpha/2}$ and the test statistic is $z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$ Notice that we again plug in $p_0$ for the standard error!

Steps:

State the null and alternative hypotheses.
Determine the significance level $\alpha$ . Check assumptions.
Compute the value of the test statistic.
Determine the critical values.
If the test statistic is in the rejection region, reject the null hypothesis. Otherwise, do not reject.
Interpret results.

Example: In 2007, the proportion of US adults who had ever had chickenpox was 61.4%. Since the chickenpox vaccine was introduced in 1995, it is reasonable to wonder if this value has decreased over time. A 2020 random sample of 100 US adults resulted in 13 with chickenpox. Use the critical value approach to test (at the 0.01 level of significance) whether the proportion of US adults who have ever had chickenpox is still 61.4%.

Solution: From the problem statement, $n=100$ and $\hat{p} = \frac{13}{100} = 0.13$ .

We want to know if the true proportion of US adults who have ever had chickenpox is 0.614, so the null and alternative hypotheses are $H_0: p = 0.614$ $H_A: p \ne 0.614$

From the problem statement, $\alpha = 0.01$ ; for our assumptions, $np_0 = 100\times 0.614 = 61.4$ and $n(1-p_0) = 100\times 0.386 = 38.6$ , both greater than 10.

The test statistic is $z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.13 - 0.614}{\sqrt{\frac{0.614(1-0.614)}{100}}} = -9.94$

The critical value is $z_{\alpha/2} = z_{0.01/2} = 2.58$

The rejection region is represented by values which are outside of $(-2.58, 2.58)$ . So, the test statistic $z = -9.94$ is in the rejection region and we will reject the null hypothesis.

At the 0.01 level of significance, the data provide sufficient evidence to conclude that the true proportion of US adults who have ever had chickenpox is less than the 0.614 observed in 2007.

8.2.3 P-Value Approach

The p-value is $2P(Z > |z|)$ where $z$ is the test statistic described above.

Steps:

State the null and alternative hypotheses.
Determine the significance level $\alpha$ . Check assumptions.
Compute the value of the test statistic.
Determine the p-value.
If $\text{p-value} < \alpha$ , reject the null hypothesis. Otherwise, do not reject.
Interpret results.

Example: The 2020 US census suggested that 18.9% of the population identifies as Hispanic or Latino. A random sample of 53 Californians resulted in 20 who identified as Hispanic or Latino. Is the proportion of Hispanic or Latino Californians different from that of the US as a whole? Test at the 0.05 level of significance using the p-value approach.

Solution: From the problem statement, $n=53$ and $\hat{p} = 20/53 = 0.377$ .

We want to know if the proportion of Californians who identify as Hispanic or Latino differs from 0.189, so the null and alternative hypotheses are $H_0: p = 0.189$ $H_A: p \ne 0.189$

From the problem statement, $\alpha = 0.05$ ; For our assumptions, $np_0 = 53\times 0.189 = 10.02$ and $n(1-p_0) = 53\times0.811=42.98$ , both greater than 10, so our assumptions are satisfied.

The test statistic is $z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.377 - 0.189}{\sqrt{\frac{0.189(1-0.189)}{53}}} = 3.50$

The p-value is $2P(Z > |z|) = 2P(Z > 3.50) = 2\times 0.0002 = 0.0004$ .

Since the p-value $0.0004 < \alpha = 0.05$ , reject the null hypothesis.

At the 0.05 level of significance, the data provide sufficient evidence to conclude the proportion of Californias who identify as Hispanic or Latino is greater than that of the US as a whole.

Section Exercises

Another survey of Stat 1 students resulted in the following data:

Year	Count
Freshman	26
Sophomore	21
Junior	13
Senior	15

Suppose we wish to test the hypothesis that 50% of all Stat 1 students are freshmen using the confidence interval approach. We will test at the 0.05 level of significance.
1. State the null and alternative hypothesis.
2. Find $\hat{p}$ , the sample proportion of freshmen.
3. Determine $\alpha$ and check assumptions.
4. Determine the critical value.
5. How will this confidence interval formula differ from the one you found in Section 8.1 Exercise 1?
6. Compute the confidence interval.
7. What is your conclusion?
8. Interpret your results in the context of the problem.
Suppose we want to know if 25% of Stat 1 students are sophomores. Use the confidence interval approach to complete this test at the 0.1 level of significance.
Use the critical value approach at the 0.05 level of significance to test the hypothesis that 25% of all Stat 1 students are juniors.
1. State the null and alternative hypothesis.
2. Find $\hat{p}$ , the sample proportion of juniors.
3. Determine $\alpha$ and check assumptions.
4. Compute the test statistic.
5. Determine the critical values.
6. Sketch a normal curve, label the critical values, and indicate the rejection regions.
7. What is your conclusion?
8. Interpret your results in the context of the problem.
Use the critical value approach at the 0.02 level of significance to test the hypothesis that 20% of all Stat 1 students are seniors.
Use the p-value approach at the 0.01 level of significance to test the hypothesis that 40% of all Stat 1 students are freshmen.
1. State the null and alternative hypothesis.
2. Find $\hat{p}$ , the sample proportion of freshmen.
3. Determine $\alpha$ and check assumptions.
4. Compute the test statistic.
5. Use an applet to determine the p-value.
6. What is your conclusion?
7. Interpret your results in the context of the problem.
Use the p-value approach at the 0.05 level of significance to test the hypothesis that 10% of all Stat 1 students are seniors.

R: Hypothesis Tests for a Proportion

To generate confidence intervals and hypothesis tests for a proportion, we will use the command binom.test. This will give us slightly different results than the z-test we used throughout this module, but it is actually going to be more exact! This approach also does not have any limitations on the values $n\hat{p}$ or $np_0$ . We use the z-test when working by hand because the exact binomial test is difficult to do on paper. The arguments we need are:

x: the number of successes.
n: the number of trials.
p: the null value $p_0$ .
conf.level: the desired confidence level ( $1-\alpha$ ).

Let’s continue to use the example seen throughout this module. We have a random sample of 27 US households and 15 of them have dogs. We also have the claim that, in fact, 38.4% of US households have dogs. We will use a significance level of $\alpha=0.05$ .

Based on the prompt, there are $x = 15$ successes; $n=27$ trials; and $p_0=0.384$ . So the R command will look like

binom.test(x = 15, n = 27, p = 0.384, conf.level = 0.95)

## 
##  Exact binomial test
## 
## data:  15 and 27
## number of successes = 15, number of trials = 27, p-value = 0.07591
## alternative hypothesis: true probability of success is not equal to 0.384
## 95 percent confidence interval:
##  0.3532642 0.7452012
## sample estimates:
## probability of success 
##              0.5555556

The output shows (top to bottom):

a summary of the data we entered, along with the p-value.
the alternative hypothesis.
a 95% confidence interval for $p$ .
the sample proportion $\hat{p}$ .

Since this is slightly different from the test used when we discussed doing these calculations by hand, when we do hypothesis tests for a proportion using R, we will not use the critical value approach. Based on the confidence interval and p-value, at the 0.05 level of significance, the data provide insufficient evidence to conclude that the proportion of US Households with dogs differs from 0.384. (In general, we will come to the same conclusion whether we do these tests by hand or using R.)