3.6 An Application to the Gender Gap of Earnings
This section discusses how to reproduce the results presented in the box The Gender Gap of Earnings of College Graduates in the United States of the book.
In order to reproduce Table 3.1 of the book you need to download the replication data which are hosted by Pearson and can be downloaded here. Download the data for Chapter 3 as an excel spreadsheet (cps_ch3.xlsx). This dataset contains data that range from \(1992\) to \(2008\) and earnings are reported in prices of \(2008\).
There are several ways to import the .xlsx-files into R. Our suggestion is the function read_excel() from the readxl package (Wickham & Bryan, 2018). The package is not part of R’s base version and has to be installed manually.
# load the 'readxl' package
library(readxl)
You are now ready to import the dataset. Make sure you use the correct path to import the downloaded file! In our example, the file is saved in a sub folder (data) of the working directory. If you are not sure what your current working directory is, use getwd(), see also ?getwd. This will give you the path that points to the place R is currently looking for files.
# import the data into R
cps <- read_excel(path = 'data/cps_ch3.xlsx')
Next, install and load the package dyplr (Wickham, François, Henry, & Müller, 2018). This package provides some handy functions that simplify data wrangling a lot. It makes use of the %>% operator.
# load the 'dplyr' package
library(dplyr)
First, get an overview over the dataset. Next, use %>% and some functions from the dplyr package to group the observations by gender and year and compute descriptive statistics for both groups.
# get an overview of the data structure
head(cps)
## # A tibble: 6 x 3
## a_sex year ahe08
## <dbl> <dbl> <dbl>
## 1 1 1992 17.2
## 2 1 1992 15.3
## 3 1 1992 22.9
## 4 2 1992 13.3
## 5 1 1992 22.1
## 6 2 1992 12.2
# group data by gender and year and compute the mean, standard deviation
# and number of observations for each group
avgs <- cps %>%
group_by(a_sex, year) %>%
summarise(mean(ahe08),
sd(ahe08),
n())
# print the results to the console
print(avgs)
## # A tibble: 10 x 5
## # Groups: a_sex [?]
## a_sex year `mean(ahe08)` `sd(ahe08)` `n()`
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 1992 23.3 10.2 1594
## 2 1 1996 22.5 10.1 1379
## 3 1 2000 24.9 11.6 1303
## 4 1 2004 25.1 12.0 1894
## 5 1 2008 25.0 11.8 1838
## 6 2 1992 20.0 7.87 1368
## 7 2 1996 19.0 7.95 1230
## 8 2 2000 20.7 9.36 1181
## 9 2 2004 21.0 9.36 1735
## 10 2 2008 20.9 9.66 1871
With the pipe operator %>% we simply chain different R functions that produce compatible input and output. In the code above, we take the dataset cps and use it as an input for the function group_by(). The output of group_by is subsequently used as an input for summarise() and so forth.
Now that we have computed the statistics of interest for both genders, we can investigate how the gap in earnings between both groups evolves over time.
# split the dataset by gender
male <- avgs %>% filter(a_sex == 1)
female <- avgs %>% filter(a_sex == 2)
# rename columns of both splits
colnames(male) <- c("Sex", "Year", "Y_bar_m", "s_m", "n_m")
colnames(female) <- c("Sex", "Year", "Y_bar_f", "s_f", "n_f")
# estimate gender gaps, compute standard errors and confidence intervals for all dates
gap <- male$Y_bar_m - female$Y_bar_f
gap_se <- sqrt(male$s_m^2 / male$n_m + female$s_f^2 / female$n_f)
gap_ci_l <- gap - 1.96 * gap_se
gap_ci_u <- gap + 1.96 * gap_se
result <- cbind(male[,-1], female[,-(1:2)], gap, gap_se, gap_ci_l, gap_ci_u)
# print the results to the console
print(result, digits = 3)
## Year Y_bar_m s_m n_m Y_bar_f s_f n_f gap gap_se gap_ci_l gap_ci_u
## 1 1992 23.3 10.2 1594 20.0 7.87 1368 3.23 0.332 2.58 3.88
## 2 1996 22.5 10.1 1379 19.0 7.95 1230 3.49 0.354 2.80 4.19
## 3 2000 24.9 11.6 1303 20.7 9.36 1181 4.14 0.421 3.32 4.97
## 4 2004 25.1 12.0 1894 21.0 9.36 1735 4.10 0.356 3.40 4.80
## 5 2008 25.0 11.8 1838 20.9 9.66 1871 4.10 0.354 3.41 4.80
We observe virtually the same results as the ones presented in the book. The computed statistics suggest that there is a gender gap in earnings. Note that we can reject the null hypothesis that the gap is zero for all periods. Further, estimates of the gap and limits of the \(95\%\) confidence intervals indicate that the gap has been quite stable in the recent past.
References
Wickham, H., & Bryan, J. (2018). readxl: Read Excel Files (Version 1.1.0). Retrieved from https://CRAN.R-project.org/package=readxl
Wickham, H., François, R., Henry, L., & Müller, K. (2018). dplyr: A Grammar of Data Manipulation (Version 0.7.6). Retrieved from https://CRAN.R-project.org/package=dplyr