Chapter 10 Hypothesis Testing with Two Groups

10.1 Getting Ready

This chapter extends the discussion of hypothesis testing to include the comparison of means and proportions from population subgroups. To follow along in R, you should load the anes20.rda data set. We will be using a lot of techniques from separate packages, so also make sure you attach the following libraries: DescTools, Hmisc, gplots, descr, and effectsize.

10.2 Testing Hypotheses about Two Means

A common use of hypothesis testing involves examining the difference between two sample means. For instance, instead of testing hypotheses about the level of support for a political candidate or some group in the population as a whole, it is often more interesting to speculate about differences in support across population subgroups, such as men and women, whites and people of color, city dwellers and suburbanites, religious and secular voters, etc. When comparing subgroups like these, we are really asking if variables such as sex, race, place of residence, and religiosity influence or are related to differences in the dependent variable.

Let’s begin looking at sex-based differences in political attitudes, commonly referred to as the gender gap. We’ll start with a somewhat obvious dependent variable for testing the presence of a gender gap in political attitudes, the feeling thermometer rating for feminists. As a quick reminder, the feeling thermometers in the American National Election Study surveys ask people to rate how they feel about certain groups and individuals on a 0 (negative, “cool” feelings) to 100 (positive, “warm” feelings) scale. Which group do you think is likely to rate feminists the highest, men or women?28 Although both women and men can be feminists (or anti-feminists), the connection between feminism and the fight for the rights of women leads quite reasonably to the expectation that women support feminists at higher levels than men do.

Before shedding light on this with data, let’s rename the ANES measures for respondent sex (anes20$V201600) and the feminist feeling thermometer (anes20$V202160) so they are a bit easier to use in subsequent commands

#Create new respondent sex variable
anes20$Rsex<-factor(anes20$V201600)
#Assign category labels
levels(anes20$Rsex)<-c("Male", "Female")
#Create new feminist feeling thermometer variable
anes20$femFT<-anes20$V202160

10.2.1 Generating Subgroup Means

There are a couple relatively simple ways to examine subgroup means, in this case the mean levels of support for Feminists among men and women. First, you can use the aggregate function to get the subgroup means or many other statistics. Since we are thinking of subgroup analysis as a way of testing hypotheses, it is useful to think of the format of this function as: aggregate(dependent, by=list(independent), FUN=stat_you_want). In this case, we are telling R to generate the mean outcomes of the dependent variable for different categories of the independent variable. For the gender gap in the Feminist Feeling Thermometer:

#Store the mean Feminist FT, by sex, in a new object
agg_femFT <-aggregate(anes20$femFT, by=list(anes20$Rsex), 
                      FUN=(mean), na.rm=TRUE)
#List the results of the aggregate command
agg_femFT
  Group.1     x
1    Male 54.54
2  Female 62.55

What this table shows is that the average feeling thermometer for feminists was 62.55 for women and 54.44 for men, a difference of about eight points on a scale from 0 to 100. So, it looks like there is a difference in attitudes toward feminists, with women viewing them more positively than men. At the same time, it is important to note that both groups, on average, have positive feelings toward feminists.

Another useful R function we can use to compare the means of these two groups is compmeans, which produces subgroup means and standard deviations, as well as a boxplot that shows the distribution of the dependent variable for each value of the independent variable. The format for this command is: compmeans(dependent, independent). You should also include axis labels and other plot commands (see below) since a boxplot is included by default (you can suppress it with plot=F). When you run this command, you will get a warning message telling you that there are missing data. Don’t worry about this for now unless the number of missing cases seems relatively large compared to the sample size.

#List the dependent variable first, then the independent variable.
#Add graph commands
compmeans(anes20$femFT, anes20$Rsex,
          xlab="Sex", 
          ylab="Feminist Feeling Thermometer",
          main="Feminist Feeling Thermometer, by Sex") 
Warning in compmeans(anes20$femFT, anes20$Rsex, xlab = "Sex", ylab = "Feminist
Feeling Thermometer", : 1007 rows with missing values dropped

Mean value of "POST: Feeling thermometer: feminists" according to "anes20$Rsex"
        Mean    N Std. Dev.
Male   54.54 3321     26.11
Female 62.55 3952     26.75
Total  58.89 7273     26.76

First, note that the subgroup means are the same as those produced using the aggregate command. The key difference in the numeric output is that we also get information on the standard deviation in both subgroups, as well as the mean and standard deviation for the full sample. In addition, the boxplot produced by the compmeans command provides a visualization of the two distributions side-by-side, giving us a chance to see how similar or dissimilar they are. Remember, the box plots do not show the differences in means, but they do show differences in medians and interquartile ranges, both of which can be indicative of group-based differences in outcomes on the dependent variable. The side-by-side distributions in the box plot do appear to be different, with the distribution of outcomes concentrated a bit more at the high end of the feeling thermometer among women than among men.

So, it looks like there is a gender gap in attitudes toward feminists, with women rating feminists about eight points (8.01) higher than men rate them. But there is a potential problem with this conclusion. The main problem is that we are using sample data and we know from sampling theory that it is possible to find a difference in the sample data even if there really is no difference in the population. The question we need to answer is whether the difference we observe in this sample is large enough that we can reject the possibility that there is no difference between the two groups in the population. In other words, we need to expand the logic of hypothesis test developed earlier to incorporate differences between two sample means.

10.3 Hypothesis Testing with Two means

When comparing means, the language of hypothesis test changes just a bit.

H0:\(\mu_1=\mu_2\)     There is no relationship. The means are equal in the population.

Alternative hypotheses state that there is a relationship in the population:

H1:\(\mu_1\ne\mu_2\)     The means differ in the population (two-tailed).

H1:\(\mu_1<\mu_2\)     There is a negative difference in the population (one-tailed).

H1:\(\mu_1>\mu_2\)      There is a positive difference in the population (one-tailed).

The logic of hypothesis testing for mean differences is very much the same as that for a single mean: If we observe a difference between two sample subgroup means, we must ask if there is really a difference between these groups in the population, or if the sample difference is due to random variation. In other words, is the difference large enough that we can attribute it to something other than sampling error? If so, then we can reject the null hypothesis.

The difference between the two sample means is a sample statistic so the sampling distribution for the difference between the two groups (\(\bar{x}_1 -\bar{x}_2\)) has the same properties as the sampling distribution for a single mean: if the sample difference comes from a large, random sample, the sampling distribution will follow a normal curve and the mean will equal the difference between the the two subgroup in the population (\(\mu_1-\mu_2\)).

10.3.1 A Theoretical Example

The figure below extends the example used in Chapter 9 to illustrate the logic of hypothesis testing for mean differences. If H0 is true, and there is no difference between the two groups in the population, how likely is it that we would get a sample difference of the magnitude of \(\bar{x}_1 -\bar{x}_2\)?

The Logic of Hypothesis Testing

Figure 10.1: The Logic of Hypothesis Testing

To answer this, we need to transform \(\bar{x}_1 -\bar{x}_2\) from a raw score for the difference into a standard score difference (t or z-score). Since we are working with a sample, we focus on t-scores in this application. Recall, though, that the calculation for a t-score is the same as for a z-score:

\[t=\frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_1-\mu_2)}{S_{\bar{x}_{1}-\bar{x}_{2}}}\]

However, since we always assume that \(\mu_{1} - \mu_{2}\) = 0, we are asking if the sample finding is different from 0, and the equation becomes:

\[t=\frac{(\bar{x}_{1}-\bar{x}_{2})}{S_{\bar{x}_{1}-\bar{x}_{2}}}\]

So, as we did with a single mean, we divide the raw score difference by the standard error of the sampling distribution to convert the raw difference into a t-score. The standard error of the difference is a function of the variance in both subgroups, along with the sample sizes. Since we do not know the population variances, we rely on sample variances to estimate them:

\[S_{\bar{x}_{1}-\bar{x}_{2}}=\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}\]

Where \(S_1^2\) and \(S_2^2\) are the sample variances of the sub-groups.

The standard error represents the standard deviation of the sampling distribution that would result from repeated (large, random) samples from which we would calculate differences between the group means. When we calculate a t-score based on this we are asking how many standard errors our sample finding is from the population parameter (\(\mu_1 - \mu_2 = 0\))

In the theoretical figure above, the t score for the difference is -1.96. What is the probability of getting a t-score of this magnitude if there is really no difference between the two groups? That probability is equal to the area to the left of t=-1.96 (the same for both t- and z-scores with a large sample), which is .025. So, with a one-tailed hypothesis we would reject H0 because there is less than a .05 probability that it is true.

10.3.2 Returning to the Empirical Example

Okay, now let’s apply this to the gender-based differences in feeling thermometer rating for feminists. The null hypothesis, of course, states that there is no difference between the groups.

H0: \(\mu_{W} = \mu_{M}\)

What about the alternative hypothesis? Do we expect the mean for women to be higher or lower than that of men? Since high values signify more positive evaluations, I anticipate that the mean for women is higher than the mean for men:

H1: \(\mu_{W}>\mu_{M}\)

We can use the same process we used to test hypotheses about single means.

  1. Choose a p-value (\(\alpha\) area) for determining level of statistical significance required for rejecting \(H_0\). (Usually .05)

  2. Find the critical value of t associated with \(\alpha\) (depends on degrees of freedom)

When testing the difference between two means, degrees of freedom is equal to \(n-2\), reflecting the fact that we are using information from two means instead of one. In this case, the degrees of freedom is a very large number (\(7273-2=7271\)), and the critical value for a one-tailed test is -1.645, essentially the same as if we were using the z-distribution. Recall that as the sample size increases the t-distribution grows increasingly similar to the z-distribution.

#Get critical values (t) for p=.05, df=7271
qt(.05, 7271)
[1] -1.645
  1. Calculate the t-score from the sample data.

  2. Compare t-score to the critical value. If \(|t| > c.v.\), then reject \(H_0\); if \(|t| < c.v.\), then fail to reject.

10.3.3 Calculating the t-score

First, we’ll plug the appropriate numbers into the t-score formula to illustrate a bit more concretely how we arrive at the t-score for the difference. All of the input for these calculations are taken from the compmeans results.

\[t=\frac{54.54-62.55}{\sqrt{\frac{26.11^2}{3321}+\frac{26.75^2}{3952}}}=\frac{-8.01}{.6216}=-12.89\]

You may have noticed that I subtracted the mean for women from the mean for men in the numerator, leading to a negative t-score. The reason for this is that the R function for conducting t-tests subtracts the second value it encounters from the first value by default, so the calculation above is set up to reflect what we should expect to find when using R to do the work for us. The negative value makes sense in the context of our expectations, since it means that the value for women is higher than the value for men.

In this case, the t-score far exceeds the critical value (-1.645), so we reject the null hypothesis and conclude that there is a gender gap in evaluations of feminists on the feeling thermometer scale, with women providing higher ratings of feminists than those provided by men.

T-test in R. The R command for conducting a t-test (t.test) is straightforward and easy to use. The format is t.test(dependent~independent. The ~ symbol is used in this and other functions to signal that you are using a formula that specifies a dependent and independent variable.

#use t.test to get t-score for Feminist FT by Sex
t.test(anes20$femFT~anes20$Rsex)

    Welch Two Sample t-test

data:  anes20$femFT by anes20$Rsex
t = -13, df = 7111, p-value <2e-16
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -9.231 -6.794
sample estimates:
  mean in group Male mean in group Female 
               54.54                62.55 

There are a few important things to pick up on here. First, the reported t-score (-13) is very close to the one we calculated (12.89), the difference due to rounding in the R output. Second, the reported p-value is 2e-16. Recall from earlier that scientific notation like this is used as a shortcut when the actual numbers have several digits. In this case, the notation means that the p-value is less than .0000000000000002. Since R uses a two-tailed test by default, this is the total area under the curve at the two tails combined (to the outside of t=-13 and t=+13). This means that there is virtually no chance the null hypothesis is true. Of course, for a one-tailed test, we still reject the null hypothesis since the p-value is even lower. Third, the t.test output also provides a 95% confidence interval around the sample estimate of the difference between the two groups. The way to interpret this is that based on these results, you can be 95% certain that in the population the gender gap in ratings of feminists is between -9.231 and -6.794 points. Importantly, since the confidence interval does not include 0 (no difference), we can also use this as a basis for rejecting the null hypothesis. Finally, you probably noticed that the reported degrees of freedom (7111) is different than what I calculated above (7271). This is because one of the assumptions underlying t-tests is that the variances in the two groups are the same. If they are not, then some corrections need to be made, including adjustments to the degrees of freedom. The Welch’s two-sample test used by R does not assume that the two sample variances are the same and, by default, always makes the correction. In this case, with such a large sample, the findings are not really affected by the correction other than the degrees of freedom. This makes the Welch’s two-sample test a slightly more conservative test, which I see as a virtue.

In the output below, I run the same t-test but specify a one-tailed test (alternative="less") and assume that the variances are equal (var.equal=T). As you can see, the results are virtually identical, except that now \(df=7271\) and the confidence interval is now negative infinity to -6.987.

#t.test with one-tailed test and equal variance
t.test(anes20$femFT~anes20$Rsex, var.equal=T, alternative="less")

    Two Sample t-test

data:  anes20$femFT by anes20$Rsex
t = -13, df = 7271, p-value <2e-16
alternative hypothesis: true difference in means between group Male and group Female is less than 0
95 percent confidence interval:
   -Inf -6.987
sample estimates:
  mean in group Male mean in group Female 
               54.54                62.55 

Let’s take a quick look at another application of the gender gap, but this time using a less obvious dependent variable, the feeling thermometer for “Big Business” (anes20$V202163). While this variable is not as directly connected to gender-related issues, it is reasonable to expect that female respondents are less supportive of big business than are male respondents if for no other reason than that women tend to be more liberal than men. Again, though, this connection is not as obvious as it was in the previous example.

#Create new object, "anes20$busFT
anes20$busFT<-anes20$V202163
#T-test for sex-based differences in 'busFT'
t.test(anes20$busFT~anes20$Rsex)

    Welch Two Sample t-test

data:  anes20$busFT by anes20$Rsex
t = 2.1, df = 7015, p-value = 0.04
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 0.07143 2.16569
sample estimates:
  mean in group Male mean in group Female 
               48.41                47.29 

First, there is a statistically significant difference in support for big business between male and female respondents. The t-score is 2.1 and the p-value is .04 (less than .05), so we reject the null hypothesis. The average rating among male respondents is 48.41 and the average among female respondents is 47.29, a difference of 1.12. The 95% confidence interval for this difference ranges from .07143 to 2.16569. As expected, given the p-value, this confidence interval does not include 0. There is a relationship between sex and support for big business.

10.3.4 Statistical Significance vs. Effect Size

The example of attitudes toward big business illustrates an important issue related to statistical significance: sometimes, statistically significant findings represent relatively small substantive effects. In this case, we have a statistically significant difference between two groups, but that difference is only 1.12 on a dependent variable that is scaled from 0 to 100. Yes, male and female respondents hold different attitudes toward big business, but just barely different!

Let’s put this finding in the context of the sample size and other results. As we discussed earlier in Chapter 8, two important factors that influence the size of the t-score are the magnitude of the effect and the sample size. As a consequence, when the sample size is very large, as in this case (n>7200), the standard error for the difference between two groups is so small that sometimes even relatively trivial subgroup differences are statistically significant; the relationship exists, but it is of little consequence. We can appreciate this by comparing this result to the earlier example using the feminist feeling thermometer, which also used the same 0-to-100 scale. The difference between men and women on the feminist feeling thermometer was 8.01, more than seven times the size of the difference in ratings of big business, yet both findings are statistically significant. This is an important issue in statistical analysis, as it is often the case that the focus on statistical significance leave substantive importance unattended to.

What this discussion points to is the need to complement the findings related to statistical significance with a measure of the size of the effect. One such statistic that is used a lot in conjunction with t-tests is Cohen’s D. The most direct way to calculate D is to express the difference between the two group means relative to the size of the pooled standard deviation.29

\[D=\frac{\bar{x}_1 -\bar{x}_2}{S}\] All of this information can be obtained from the following compmeans output:

#Get means and standard deviations using 'compmeans'
compmeans(anes20$femFT,anes20$Rsex, plot=F)
Mean value of "POST: Feeling thermometer: feminists" according to "anes20$Rsex"
        Mean    N Std. Dev.
Male   54.54 3321     26.11
Female 62.55 3952     26.75
Total  58.89 7273     26.76
compmeans(anes20$busFT,anes20$Rsex, plot=F)
Mean value of "POST: Feeling thermometer: big business" according to
"anes20$Rsex"
        Mean    N Std. Dev.
Male   48.41 3337     23.01
Female 47.29 3953     22.38
Total  47.80 7290     22.67

The values for Cohen’s D in the Feminist and Big Business feeling thermometer examples are calculated below:

#Cohen's D for Feminist FT
(54.54-62.55)/26.76
[1] -0.2993
#Cohen's D for Big Business FT
(48.41-47.29)/22.67
[1] 0.0494

Let’s check our work with the R command for getting Cohen’s D:

#Get Cohens D for impact of sex on 'femFT' and 'busFT'
cohens_d(anes20$femFT~anes20$Rsex)
Cohen's d |         95% CI
--------------------------
-0.30     | [-0.35, -0.26]

- Estimated using pooled SD.
cohens_d(anes20$busFT~anes20$Rsex)
Cohen's d |       95% CI
------------------------
0.05      | [0.00, 0.10]

- Estimated using pooled SD.

As expected, the effect size is much greater for the gender gap in the feminist feeling thermometer than it is for the big business feeling thermometer. Notice also that the R output shows a negative effect for the feminist feeling thermometer model and a positive effect for the big business model, reflecting the direction of the difference between group means.

Although we have shown that the impact of respondent sex is much greater when looking at the feminist feeling thermometer than when using the big business feeling thermometer, this still doesn’t tell us if the effect is strong or weak, other than in comparison to the meager impact in the case of the big business feeling thermometer. In absolute terms, how strong is this effect? Does \(d=-.30\) indicate a strong or weak effect on its own? The table below provides some conventional guidelines for evaluating the substantive meaning of Cohen’s D values.
Table 10.1. Cohen’s D and Effect Size Interpretations
Cohen’s D Effect Size
.1
.2 Small
.3
.4
.5 Medium
.6
.7
.8 Large

Note that these are all positive values, but the finding of \(D=-.3\) should be treated the same as \(D=3\). Using the guideline in Table 1, it is fair to describe the impact of respondent sex on the feminist feeling thermometer rating as small while the impact on big business ratings is tiny at best.

10.4 Difference in Proportions

Finally, we can also extend difference in means hypothesis testing to differences between group proportions. For instance, suppose I’m interested in the gender gap in abortion attitudes, a topic that is frequently assumed to be an important issue on which men and women disagree. The ANES has a few variables measuring abortion attitudes, including one that it has used for the past few decades. I personally think this variable is a bit hard to use in its original state. Have a look at the categories for yourself. Respondents are asked which of the listed position best agrees with their position on abortion.

   “1. By law, abortion should never be permitted”

   “2. The law should permit abortion only in case of rape, incest, or when the woman’s life is in danger”

   “3. The law should permit abortion other than for rape/incest/danger to woman but only after need clearly established”

   “4. By law, a woman should always be able to obtain an abortion as a matter of personal choice”
   “Other”

One thing we can do with this variable is focus on whether respondents think abortion should never be permitted (the first category) and create a new variable distinguishing between those who do and do not think abortion should be banned. To do this, I relabeled the categories and then created a numeric dichotomous variable scored 1 for those who think abortion should never be permitted and 0 for all other responses.

#Create abortion attitude variable
anes20$banAb<-factor(anes20$V201336)
#Change levels to create two-category variable
levels(anes20$banAb)<-c("Illegal","Other","Other","Other","Other")
#Create numeric indicator for "Illegal"
anes20$banAb.n<-as.numeric(anes20$banAb=="Illegal")

The mean of this variable is the proportion who think abortion should never be permitted. Based on conventional wisdom, and on the gender gaps reported in earlier in this chapter, the expectation is that the proportion of women who think abortion should never be permitted is lower than the proportion of men who support this position.

Since these means are actually proportions:

H0: \(P_{W}=P_{M}\)

H1: \(P_{W}<P_{M}\)

Let’s see what the sample statistics tell us about the sex-based difference in the proportions who support banning all abortions. In this case, we suppress the boxplot because it is not a useful tool for examining variation in dichotomous outcomes (Go ahead and generate a boxplot if you want to see what I mean).

#Generate means (proportions), by sex
compmeans(anes20$banAb.n, anes20$Rsex, plot=F)
Warning in compmeans(anes20$banAb.n, anes20$Rsex, plot = F): 130 rows with
missing values dropped
Mean value of "anes20$banAb.n" according to "anes20$Rsex"
         Mean    N Std. Dev.
Male   0.1041 3728    0.3054
Female 0.1074 4422    0.3097
Total  0.1059 8150    0.3077

Two things stand out from this table. First, banning abortions completely is not a popular position; only 10.59% of respondents support this position. Second, there doesn’t seem to be much real difference between men and women on this issue (just .0033), and women are ever-so-slightly more likely than men to take this position.

Still, even with a difference so small, the question for us is whether this represents a real difference in the population or is due to random error. Given the sample size, even small differences could be statistically significant. To figure this out, we need to estimate the probability of getting a sample difference of this magnitude from a population in which there is no difference between the two groups.30

So, we need to go through the process again of calculating a t-score for the difference between the two groups and compare it to the critical value (1.65 or 1.96). The formula should look very familiar to you:

\[t=\frac{(p_1-p_2)-(P_1-P_2)}{S_{p_1-p_2}}=\frac{(p_1-p_2)}{S_{p_1-p_2}}= \frac{.1041-.1074}{S_{p_1-p_2}}\] Fair enough, this all looks good. We just divide the difference between the two sample proportions by the standard error of the difference. It gets a bit more complicated when calculating the standard error of the difference:

\[S_{p_1-p_2}= \sqrt{p_u(1-p_u)} * \sqrt{\frac{N_1+N_2}{N_1N_2}}\] Here, \(p_u\) is the estimate of the population proportion (\(P\)), which can get from the compmeans table. The proportion in the full sample supporting a ban on abortions is .1059, so \(S_{p_1-p_2}\) is:

#Calculate the standard error for the difference
sqrt(.1059*(1-.1059))*sqrt((3728+4422)/(3728*4422))
[1] 0.006842

\(S_{p_1-p_2}\)=.006842, so:

\[t=\frac{.0033}{.006842}=.4823\]

Okay, now let’s see what R tells us. First, we will use a t-test and treat this as just another difference in means test:

#Test for difference in banAb.n, by sex
t.test(anes20$banAb.n~anes20$Rsex)

    Welch Two Sample t-test

data:  anes20$banAb.n by anes20$Rsex
t = -0.49, df = 7952, p-value = 0.6
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -0.01674  0.01006
sample estimates:
  mean in group Male mean in group Female 
              0.1041               0.1074 

It looks like our calculations were just about spot on. There is no significant relationship between respondent sex and supporting a ban on abortions. None whatsoever. The t-score is only -.49, far less than a critical value of either 1.96 or 1.65, and the reported p-value is .60, meaning there is a pretty good chance of drawing a sample difference of this magnitude from a population in which there is no real difference. This is also reflected in the confidence interval for the difference (-.0167 to .01), which includes the value of 0 (no difference). So, we fail to reject the null hypothesis.

What about conventional wisdom? Doesn’t everyone know that there is a huge gender gap on abortion? Sometimes, conventional wisdom meets data and conventional wisdom loses. Results similar to the one presented above are not unusual in quantitative studies of public opinion on this issue. Sometimes there is no gender gap, and sometimes there is a gap, but it tends to be a small one. For instance, if we focus on the other end of the original abortion variable and create a dichotomous variable indicating those who think abortion generally should be available as a matter of choice, we find a significant gender gap:

#Create "choice" variable
anes20$choice<-factor(anes20$V201336)
#Change levels to create two-category variable
levels(anes20$choice)<-c("Other","Other","Other","Choice by Law","Other")
#Create numeric indicator for "Choice by law"
anes20$choice.n<-as.numeric(anes20$choice=="Choice by Law")
#Test for differences in "choice.n", by sex
t.test(anes20$choice.n~anes20$Rsex)

    Welch Two Sample t-test

data:  anes20$choice.n by anes20$Rsex
t = -4.5, df = 7923, p-value = 0.000008
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -0.07134 -0.02782
sample estimates:
  mean in group Male mean in group Female 
              0.4622               0.5118 

Here, we see that there is a statistically significant difference between male and female respondents on this position, with about 46% of men and 51% of women favoring abortion availability as a matter of choice.31. But the substantive difference is not very large; a bit less than half the male respondents and just more than half the female respondents support this position. We can confirm the limited effect size with Cohen’s D:

#Get Cohens D for impact of sex on "choice.n"
cohens_d(anes20$choice.n~anes20$Rsex)
Cohen's d |         95% CI
--------------------------
-0.10     | [-0.14, -0.06]

- Estimated using pooled SD.

Yep, \(D=-.10\) confirms that though this is a statistically significant relationship, it is not a very strong one. So, we have one non-significant finding and one that is significant but weak. This combination of findings is in keeping with research in this area. We will return to this issue in chapter 13, where we will utilize information from all categories of the abortion variable to provide a more thorough evaluation of the relationship between sex and abortion attitudes.

10.5 Plotting Mean Differences

As you saw earlier in the chapter, you can get boxplot comparison of the two groups with the compmeans command. This mode of comparison is useful for getting a sense of where the middle of the distributions are located, as well as how much the distributions overlap for the two groups. One thing that’s missing from a side-by-side boxplot graph, however, is a graphic comparison of the means themselves. Since this whole discussion is centered on a comparison of means, it is good to have alternatives to the boxplots. It is also good to explore alternatives to the boxplots because not all audiences to which you present your research are going to have experience reading boxplots.

You are already familiar with one popular alternative, the bar chart. In this case, we want the height of the bars to represent the mean levels of the dependent variable for each of the two subgroups. Recall from chapter three that the barplot command for looking at the distribution of a single variable required using the results of the table() as input. In order to plot subgroup means, we use the results from the aggregate() command used earlier, which stored the group means in an object called agg_femFT. To review, here are the contents of this object:

#Show contents of "agg_fem"
agg_femFT
  Group.1     x
1    Male 54.54
2  Female 62.55

R stored this object as a data.frame with two variables,agg_femFT$x and agg_femFT$Group.1. We use this information to create a bar chart using the format barplot(dependent~independent) and add some labels:

#Use 'barplot' to show the mean outcomes of "femFT" by sex
barplot(agg_femFT$x~agg_femFT$Group.1,
        xlab="Sex of Respondent",
        ylab="Mean Feminist Feeling Thermometer")

I think this bar plot does a nice job of showing the difference between the two groups while also communicating that difference in the context of the scale of the dependent variable. It shows that there is a difference, but a somewhat modest difference.

Another alternative for graphing the mean differences is plotmeans, a function found in the gplots package. The structure of this command is straight forward: plotmeans(dependent~independent). Let’s take a close look at the means plot for sex-based differences in the feminist feeling thermometer.

#Use 'plotmeans' to show the mean outcomes of "femFT" by sex
plotmeans(anes20$femFT~anes20$Rsex,
          n.label=F, #Do not include the number of observations
          ylab="Mean Feminist Feeling Thermometer", 
          xlab="Respondent Sex")

What you see here are two small circles representing the mean outcomes on the dependent variable for each of the two independent variable categories and error bars (vertical lines within end caps) representing the confidence intervals around each of the two subgroup means. As you can see, there appears to be a substantial difference between the two groups. This is represented by the vertical distance between the group means (the circles), but also by the fact that neither of the confidence intervals overlaps with the other group mean. If the confidence interval of one group overlaps with the mean of the other, then the two groups are not statistically different from each other.

I like the means plot as a graphing tool, but one drawback is that it can create the impression that the difference between the two groups is more meaningful than it is. Note in this graph that the span of the y-axis is only as wide as the lowest and highest confidence limits. There is nothing technically wrong with this. However, since the scale of the dependent variable is from 0 to 100, restricting the view to these narrow limits can make it seem that there is a more important difference between the two groups than there actually is. This is why it is important to measure the effect size with something like Cohen’s D and to pay attention to the scale of the dependent variable when evaluating the size of the subgroup differences.

The plot means graph can be altered to give a more realistic sense of the magnitude of the effect. In the figure below, I expand the y-axis so the limits are now 45 and 70 (using ylim=c()); not the full range of the variable, but a much wider range than before. As a result, you can still see that there is clearly a difference in outcomes between the two groups, but the magnitude of the difference is, I think, more realistically displayed, given the scale of the variable.32 The actual difference between the two groups is the same in both figures, but the figures give different impressions of the effect size. Scale matters.

#Expand the y-axis in the 'plotmeans' command
plotmeans(anes20$femFT~anes20$Rsex,
          n.label=F, #Do not include the number of observations
          ylab="Mean Feminist Feeling Thermometer", 
          xlab="Respondent Sex",
          ylim=c(45,70)) #Expand y-axis

10.6 What’s Next?

It is quite common to make comparisons between two groups as we have done in this chapter. However, we are frequently interested in comparisons across more than just two groups. For instance, if you think back to some of the other group characteristics mentioned at the beginning of this chapter–race and ethnicity, place of residence, and religiosity–it is easy to see how we could compare several subgroups at the same time. In the case of race and ethnicity, while the dominant comparison tends to be between whites and people of color, it is probably more useful to take full advantage of the data and compare outcomes among several groups–whites, blacks, Hispanics, Asian-Americans and Pacific Islanders, and other identities. While t-tests play a role in these types of comparisons, a more appropriate method is Analysis of Variance (ANOVA), a statistical technique we take up in the next chapter.


10.7 Exercises

10.7.1 Concepts and Calculations

  1. The means plot below illustrates the mean feeling feminist thermometer rating for two groups of respondents, those aged 18 to 49, and those aged 50 and older. Based just on this graph, does there appear to be relationship between age and support for feminists? Justify and explain your answer.

  2. In response to the student survey that was used for the exercises in Chapters 8 and 9, a potential donor wants to provide campus bookstore gift certificates as a way of defraying the cost of books and supplies. In consultation with the student government leaders, the donor decides to prioritize first and second year students for this program because they think that upper-class students spend as less than other on books and supplies. Before finalizing the decision, the student government wants to test whether there really is a difference in the spending patterns of the two groups of students.

    A. What are the null and alternative hypotheses for this problem? Explain.

    B. Using the data listed below, test the null hypothesis and summarize your findings for the student government. Is there a significant relationship between class standing and expenditures? Is the relationship strong?

    C. Based on these findings, should the donor prioritize first and second year students for assistance? Be sure to go beyond just reciting the statistics when you answer this question.

Expenditures on Books and Supply, by Class Status
1st & 2nd Year Upper-class
Mean $358 $340
Std. Dev 77 79
n 165 135
  1. On a number or cultural, demographic, and political measure, the states in the American South stand out as somewhat different from the rest of the country. The results shown here summarize the mean levels of several different variables in southern and non-southern states, along with the overall standard deviation for those variables. Use this information to calculate the difference between the means of the two groups of states for each variable, along with Cohen’s D. For which variable is the impact of region the strongest? For which variable is the impact of region the weakest? Do any of these differences surprise you?
Comparison of Southern and Non-Southern States
Variable South Mean Non-South Mean Standard Deviation
Gallons of Beer Per Capita 32.8 32.1 5.3
Congregations per 10k 16.9 12.1 5.3
Diabetes % 12.6 9.7 1.9
Gun Deaths per 100k 16.6 11.8 4.9
Tax Burden 8.8 9.6 1.3

10.7.2 R Problems

For these problems, use the county20large data set to examine how county-level educational attainment is related to COVID-19 cases per100k population. You need to load the following libraries: dplyr,Hmisc, gplots, descr, effectsize.

  1. The first thing you need to do is take a sample of 500 counties from the counties20large data set and store that sample in a new data set, covid500, using the command listed below.
set.seed(1234) 
#create a sample of 500 rows of data from the county20large data set
covid500<-sample_n(county20large, 500)

The sample_n command samples rows of data from the data set, so we now have 500 randomly selected counties with data on all of the variables in the data set. The dependent variable in this assignment is covid500$cases100k_sept821 (cumulative COVID-19 cases per 100,000 people, up to September 8, 2021), and the independent variable is covid500$postgrad, the percent of adults in the county with a post-graduate degree. The expectation that case rates are lower in counties with relatively high levels of education than in other counties.

  1. Transform covid500$postgrad into a two-category variable with a roughly equal number of counties in each category. Store this variable in a new object named covid500$postgrad2 and label the categories “Low Education” and “High Education”. The generic format is data$newvariable<-cut2(data$oldvariable, g=# of groups). If you are unclear about how to do this, go back and take a quick look at the variable transformation section of Chapter 4 for a refresher. Produce a frequency table for covid500$postgrad2 to check on the transformation.

  2. State a null and alternative hypothesis for this pair of variables.

  3. Use the compmeans command to estimate the level of COVID-19 cases per 100k in low and high education counties. Describe the results. What do the data and boxplot tell you? Make sure to use clear, intuitive labels for the boxplot and make specific references to the group means.

  4. Conduct a t-test for the difference in COVID-19 rates between low and high education counties. Interpret the results.

  5. Add a means plot (plotmeans command) and Cohen’s D (cohens_d command) and discuss what additional insights they provide.


  1. Here, it is important to acknowledge multiple other forms of gender identity and gender expression, including but not limited to transgender, gender-fluid, non-binary, and intersex. The survey question used in the 2020 ANES, as well as in most research on “gender gap” issues, utilizes a narrow sense of biological sex, relying on response categories of “Male” and “Female.”↩︎

  2. You also can calculate \(D\) with information provided in the R t.test output: \(D=\frac{2*t}{\sqrt{df}}\). This formula uses the t-score as an estimate of impact but then deflates it by taking into account sample size via \(df\).↩︎

  3. Technically, since the mean difference is opposite of what we expect, we should use a two-tailed test.↩︎

  4. Notice how easily I shifted to talking about percentages instead of proportions. It’s fine to do that as long as you remember that any calculation need to be done using proportions↩︎

  5. The confidence intervals are difficult to see because they are very small, relative to the size of the y-axis, making hard for R to print them. This will create multiple warnings if you run the code. Don’t worry about these warnings↩︎