Chapter 2 Correlations and Group Means
In this chapter, we’ll introduce the basic simple linear regression framework. You’ve probably seen this in your Business Statistics class, but we’ll begin with the basics. Since this is an applied class, it makes sense to begin with a dataset and ask, “What can we learn from this data?”
2.1 Some data
We’re going to begin with data provided by Stock and Watson (2019) that comes from the March 2013 iteration of the Bureau of Labor Statistics’ Current Population Survey (CPS).
# Load the Data
library(readxl)
<- read_excel("data/ch8_cps.xlsx")
cps_data head(cps_data)
## # A tibble: 6 × 8
## ahe yrseduc female age northeast midwest south west
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12.5 14 0 40 1 0 0 0
## 2 19.2 12 0 30 1 0 0 0
## 3 17.5 12 0 29 1 0 0 0
## 4 13.9 12 0 40 1 0 0 0
## 5 7.21 14 1 40 1 0 0 0
## 6 7.60 12 1 35 1 0 0 0
We have 59485 observations of workers from randomly selected households. For each worker, we observe their age, male-female status, average hourly earnings in 2004, years of education, and their region of residence.
2.1.1 Quick Questions
- What is the population from which this data was sampled?
- What is the unit of observation in this dataset?
2.2 Sample Correlation
There are probably many interesting questions that we could investigate using this data, but we will focus on the relationship between education and average hourly earnings (a.k.a. average wage). First, let’s compute the sample correlation. For compactness, let \(X\) represent years of education and \(Y\) represent average wage.
\[\begin{gather} r_{X,Y} = \frac{\sum_{i=1}^n \left(X_i - \bar{X}\right)\left(Y_i - \bar{Y}\right)}{s_X s_Y} \tag{2.1} \end{gather}\]
where
\[\begin{gather} s_X = \sqrt{\frac{1}{n-1} \sum_{i=1}^n \left(X_i - \bar{X}\right)^2} \end{gather}\]
Sample correlation estimates the population correlation, \(\rho_{X,Y}\), and both must lie between -1 and 1. To understand how to interpret the sample correlation, let’s work out the first four terms of the sum in the numerator of (2.1).
We’ll need the sample means of wage and schooling. They are $23.89 and 14.1 years, respectively. Then, we have
\[\begin{multline} \left(-11.39 \times -0.1\right) + \left(-4.6592299 \times -2.1\right) + \\ \left(-6.3419234 \times -2.1\right) +\left(-9.9476925 \times -2.1\right) + \ldots \end{multline}\]
In these first four terms, each product is positive. In fact, a term is positive whenever both \(X_i\) and \(Y_i\) are below or above their respective means. A term is negative whenever \(X_i\) is above its mean and \(Y_i\) is below its mean, or vice versa. So on par, if more terms are positive then the sample correlation is positive, and if more terms are negative then the sample correlation is negative.
A positive correlation means that if one variable is above its mean, then the other variable tends to be(but is not always) above its mean. A negative correlation means that if one variable is above its mean, then the other variable tends to be (but is not always) below its mean. Finally, dividing this sum by the standard deviations of each variable is a sort of normalization that gets rid of the units–dollars and years of schooling, in this case–and allows us to meaningfully interpret correlations more easily.
The sample correlation of wages and schooling in this sample from 1994 equals 0.4461587. Thus, workers that had more schooling tended to have higher wages.
2.2.1 Quick Questions
- What does it mean if a sample correlation is 0.95 versus 0.05?
- You’ve probably heard that “correlation does not imply causation” ad nauseum. How would this adage apply to what we’ve observed so far about the relationship between schooling and wages?
2.3 Group Averages
Let’s now ask the following, what is the difference in mean wages between workers with 12 and 13 years of schooling? To answer this, we grab just the observations in our sample with 13 years of schooling and compute the mean wage of this subsample. We do the same for 12 years of schooling, and then we simply take the difference. In the code below, I go ahead and compute sample mean wages for each group of workers with the same years of schooling.
<- cps_data %>%
group_mean_wage select(ahe, yrseduc) %>%
group_by(yrseduc) %>%
summarise(n = n(), mean_wage = mean(ahe))
group_mean_wage
## # A tibble: 12 × 3
## yrseduc n mean_wage
## <dbl> <int> <dbl>
## 1 6 696 11.8
## 2 8 549 12.5
## 3 9 637 12.7
## 4 10 570 14.5
## 5 11 759 13.9
## 6 12 16338 17.8
## 7 13 10210 20.3
## 8 14 6870 21.8
## 9 16 14752 28.9
## 10 18 5975 35.0
## 11 19 984 46.0
## 12 20 1145 43.1
The difference in mean wages between workers with 13 and 12 years of schooling is $2.52.
Let’s keep going, and compute the difference in mean wages from one year of schooling to the next for each level of schooling.
<- group_mean_wage %>%
group_mean_wage mutate(lag_mw = lag(mean_wage)) %>%
mutate(diff_mean_wage = mean_wage - lag_mw)
group_mean_wage
## # A tibble: 12 × 5
## yrseduc n mean_wage lag_mw diff_mean_wage
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 6 696 11.8 NA NA
## 2 8 549 12.5 11.8 0.754
## 3 9 637 12.7 12.5 0.181
## 4 10 570 14.5 12.7 1.78
## 5 11 759 13.9 14.5 -0.554
## 6 12 16338 17.8 13.9 3.87
## 7 13 10210 20.3 17.8 2.52
## 8 14 6870 21.8 20.3 1.48
## 9 16 14752 28.9 21.8 7.13
## 10 18 5975 35.0 28.9 6.03
## 11 19 984 46.0 35.0 11.0
## 12 20 1145 43.1 46.0 -2.87
The last column gives us some interesting insights. For example, in all but one case, the average wage is higher for workers with an additional year of schooling. (The exception is the difference between workers with 10 and 11 years of schooling.) A few of these differences need extra care. For example, the last two rows skip over 17 years of education. One way to deal with this is to divide the reported difference in wages between those with 18 and 16 years of schooling by 2, which yields the average difference over each of those two years. The largest one year difference seems to occur between 14 and 16 years, where this difference is about $3.56.
We could capitulate these differences into a single summary of the difference in average wages associated with one more year of schooling by computing a weighted average of the last column (named diff_mean_wage), where the weights equal the number of workers in each group. (We would need to impute average differences for missing levels of schooling using the procedure discussed in the previous paragraph.)
It turns out that this is what simple linear regression does!
2.3.1 Conditional Expections
Before we move on to linear regression, let’s cover conditional expectations. Mathematically, we define a conditional expectation as follows.
\[\begin{gather} E\left[Y|X=x\right] = \sum_{j=1}^m y_j \times Pr(Y = y_j | X = x) \end{gather}\]
The conditional expectation is read as, “the expected value of Y given that X equals x.” Capital letters, like \(Y\), represent random variables; \(Y\) represents a person’s hourly wage. Lower-case letters represent specific possible values that the corresponding variable can take; \(Y\) can take on values \(\{y_1, y_2, \ldots, y_m\}\).
From the above work, we know that the sample mean average hourly wage of workers with 13 years of schooling is $20.33. This group sample mean is an estimate of the conditional expectation of average hourly wage given that a worker has 13 years: \(E\left[Y|X=13\right]\).
2.3.2 Quick Questions
- What is an estimate of \(E\left[Y|X=16\right]\)?
- In the example data we’ve been working with, the average hourly wage is in dollars. Can you determine whether those dollars are in current-year prices (2012) or something else?