7 Inference about two populations

Many experiments involve a comparison of two populations. For instance:

  • A real estate company may want to estimate the difference in mean sales price between city and suburban homes.
  • A consumer group might test whether two major brands of food freezers differ in the average amount of electricity they use.
  • A television market researcher wants to estimate the difference in the proportions of younger and older viewers who regularly watch a popular TV program.

The same procedures that are used to estimate and test hypotheses about a single population can be modified to make inferences about two populations.

Determining the Target Parameter

Parameter Key words Type of Data
\(\mu_1-\mu_2\) Mean difference; difference in averages Quantitative
\(p_1-p2\) Difference between proportions, percentages, fractions, or rates; compare proportions Qualitative
\(\sigma_1^2/\sigma_2^2\) Ratio of variances; difference in variability or spread; compare variation Quantitative

7.1 Population Mean Between Two Matched Samples

Two data samples are matched if they come from repeated observations of the same subject. Here, we assume that the data populations follow the normal distribution. Using the paired t-test, we can obtain an interval estimate of the difference of the population means.

Paired samples: The sample selected from the first population is related to the corresponding sample from the second population.

It is important to distinguish independent samples and paired samples. Some examples are given as follows.

Compare the time that males and females spend watching TV.

Example

  • We randomly select 20 males and 20 females and compare the average time they spend watching TV. Is this an independent sample or paired sample?
    • Independent -We randomly select 20 couples and compare the time the husbands and wives spend watching TV. Is this an independent sample or paired sample?
    • Paired

Example: Drinking Water

Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water

df <- tribble(
  ~bottom, ~surface, 
  .430,.415, 
  .266,.238,
 .567 ,.410,
 .531,.605,
 .707,.609,
 .716,.632,
 .651,.523,
 .589,.411,
 .469,.612
)

head(df)
## # A tibble: 6 x 2
##   bottom surface
##    <dbl>   <dbl>
## 1  0.430   0.415
## 2  0.266   0.238
## 3  0.567   0.410
## 4  0.531   0.605
## 5  0.707   0.609
## 6  0.716   0.632

Does the data suggest that the true average concentration in the bottom water exceeds that of surface water?

t.test(df$bottom, df$surface, paired=TRUE) 
## 
##  Paired t-test
## 
## data:  df$bottom and df$surface
## t = 1.4667, df = 8, p-value = 0.1806
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02994557  0.13461224
## sample estimates:
## mean of the differences 
##              0.05233333

7.2 Comparing Two Population Means: Independent Sampling

In this section we develop both large-sample and small-sample methodologies for comparing two population means.

  • In the small-sample case we use the t-statistic.

Population Mean Between Two Independent Samples Two data samples are independent if they come from unrelated populations and the samples does not affect each other. Here, we assume that the data populations follow the normal distribution.

Using the unpaired t-test, we can obtain an interval estimate of the difference between two population means.

Example

In the data frame column mpg of the data set mtcars, there are gas mileage data of various 1974 U.S. automobiles.

head(mtcars$mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1

Meanwhile, another data column in mtcars, named am, indicates the transmission type of the automobile model (0 = automatic, 1 = manual).

head(mtcars$am)
## [1] 1 1 1 0 0 0

In particular, the gas mileage for manual and automatic transmissions are two independent data populations.

Assuming that the data in mtcars follows the normal distribution, find the 95% confidence interval estimate of the difference between the mean gas mileage of manual and automatic transmissions.

We can now apply the t.test function to compute the difference in means of the two sample data.

t.test(am_1$mpg, am_0$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  am_1$mpg and am_0$mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

In mtcars, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg. The 95% confidence interval of the difference in mean gas mileage is between 3.2097 and 11.2802 mpg.

7.3 Comparison of Two Population Proportions

A survey conducted in two distinct populations will produce different results. It is often necessary to compare the survey response proportion between the two populations. Here, we assume that the data populations follow the normal distribution.

Example

Children from an Australian town is classified by ethnic background, gender, age, learning status and the number of days absent from school.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
head(quine) 
##   Eth Sex Age Lrn Days
## 1   A   M  F0  SL    2
## 2   A   M  F0  SL   11
## 3   A   M  F0  SL   14
## 4   A   M  F0  AL    5
## 5   A   M  F0  AL    5
## 6   A   M  F0  AL   13

In effect, the data frame column Eth indicates whether the student is Aboriginal or Not (“A” or “N”), and the column Sex indicates Male or Female (“M” or “F”).

In R, we can tally the student ethnicity against the gender with the table function. As the result shows, within the Aboriginal student population, 38 students are female. Whereas within the Non-Aboriginal student population, 42 are female.

table(quine$Eth, quine$Sex) 
##    
##      F  M
##   A 38 31
##   N 42 35

Assuming that the data in quine follows the normal distribution, find the 95% confidence interval estimate of the difference between the female proportion of Aboriginal students and the female proportion of Non-Aboriginal students, each within their own ethnic group.

We apply the prop.test function to compute the difference in female proportions. The Yates’s continuity correction is disabled for pedagogical reasons.

prop.test(table(quine$Eth, quine$Sex), correct=FALSE) 
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(quine$Eth, quine$Sex)
## X-squared = 0.0040803, df = 1, p-value = 0.9491
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1564218  0.1669620
## sample estimates:
##    prop 1    prop 2 
## 0.5507246 0.5454545

The 95% confidence interval estimate of the difference between the female proportion of Aboriginal students and the female proportion of Non-Aboriginal students is between -15.6% and 16.7%.