Chapter 4 Statistical Inference (FQA)

4.1 Interval Estimates

4.1.1 Proportions

Suppose that in a survey of one hundred adult cell phone users, 30% switched carriers in the past two years. We can calculate a confidence interval for this proportion using the prop.test() command:

prop.test(30, 100, conf.level = 0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  30 out of 100, null probability 0.5
## X-squared = 15.21, df = 1, p-value = 9.619e-05
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2145426 0.4010604
## sample estimates:
##   p 
## 0.3

In this output, the line 95 percent confidence interval: tell us that our 95% confidence interval is (21.45%; 40.11%).

4.1.2 Means

To calculate a confidence interval for the average salary in data, we can use the t.test() function:

t.test(data$Salary, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  data$Salary
## t = 120.22, df = 919, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  153931.5 159040.5
## sample estimates:
## mean of x 
##    156486

From this output we see that our 95% confidence interval is ($153,931.5; $159,040.5).

4.2 Hypothesis Testing

4.2.1 One Sample

4.2.1.1 Proportions

Suppose we have the following null and alternative hypotheses:

$H_o$ : In the past two years, the proportion of adult cell phone users who switched carriers equals 35%.
$H_a$ : In the past two years, the proportion of adult cell phone users who switched carriers does not equal 35%.

We then survey one hundred of adult cell phone users, and thirty of them report that they switched carriers in the past two years. We can run this hypothesis test in R using the prop.test() function. In the code below, the argument p specifies the value in the null hypothesis (35%).

prop.test(30, 100, p = 0.35)

## 
##  1-sample proportions test with continuity correction
## 
## data:  30 out of 100, null probability 0.35
## X-squared = 0.89011, df = 1, p-value = 0.3454
## alternative hypothesis: true p is not equal to 0.35
## 95 percent confidence interval:
##  0.2145426 0.4010604
## sample estimates:
##   p 
## 0.3

In this output the p-value (0.3454) is relatively large, so we fail to reject the null hypothesis and cannot conclude that the true proportion is different than 35%.

4.2.1.2 Means

Suppose we have the following null and alternative hypotheses:

$H_o$ : The true average rating of all employees at a company equals five.
$H_a$ : The true average rating of all employees at a company does not equal five.

The data set data contains a sample of employees from the company, and the Rating column contains each employee’s rating. We can run this hypothesis test in R using the t.test() function. In the code below, the argument mu specifies the value in the null hypothesis (5).

t.test(data$Rating, mu = 5)

## 
##  One Sample t-test
## 
## data:  data$Rating
## t = 33.04, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  6.87463 7.11137
## sample estimates:
## mean of x 
##     6.993

In this output the p-value (< 2.2e-16) is quite small, so we reject the null hypothesis and conclude that the true average rating is likely different than five.

4.2.2 Two Sample

4.2.2.1 Proportions

Suppose that Professor Yael and Professor Michael were both given a section of entering students for a statistics boot camp before fall classes started. After the boot camp ended, a survey was given to all the participants. Of the 75 who had Yael as an instructor, 45 said they were satisfied, whereas 48 of the 90 who had Michael were satisfied. Is there a significant difference in the percentage of students who were satisfied between the two instructors? To test this, our null and alternative hypotheses would be:

$H_o$ : There is no difference in the proportion of satisfied students in Michael and Yael’s classes.
$H_a$ : : There is a difference in the proportion of satisfied students in Michael and Yael’s classes.

We can use prop.test() in R to calculate the appropriate p-value from this sample data:

prop.test(x = c(45, 48), n = c(75, 90))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(45, 48) out of c(75, 90)
## X-squared = 0.49304, df = 1, p-value = 0.4826
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.09693574  0.23026908
## sample estimates:
##    prop 1    prop 2 
## 0.6000000 0.5333333

In this output the p-value (0.4826) is relatively large, so we fail to reject the null hypothesis and cannot conclude that there is a difference in the proportion of satisfied students in Michael and Yael’s classes.

4.2.2.2 Means

The data set gss contains data from the General Social Survey, which tracks American attitudes on a wide variety of topics. Within gss, the INCOME column records the income of each respondent (a quantitative variable) and WRKGOVT indicates whether each respondent works for the government or in the private sector (a categorical variable). Suppose we have the following null and alternative hypotheses:

$H_o$ : On average, government workers earn the same as those in the private sector.
$H_a$ : On average, government workers do not earn the same as those in the private sector.

We can run this hypothesis test in R using the t.test() function:

t.test(gss$INCOME ~ gss$WRKGOVT)

## 
##  Welch Two Sample t-test
## 
## data:  gss$INCOME by gss$WRKGOVT
## t = 1.497, df = 343.21, p-value = 0.1353
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1184.580  8732.605
## sample estimates:
## mean in group 1 mean in group 2 
##        44621.83        40847.81

In this output the p-value (0.1353) is greater than 0.05, so we fail to reject the null hypothesis at a significance level of 0.05 (or 0.10). This means we cannot conclude that there is a difference in the income of government and private sector workers.