# Chapter 2 Correlations and Group Means

In this chapter, we’ll introduce the basic simple linear regression framework. You’ve probably seen this in your Business Statistics class, but we’ll begin with the basics. Since this is an applied class, it makes sense to begin with a dataset and ask, “What can we learn from this data?”

## 2.1 Some data

We’re going to begin with data provided by that comes from the March 2013 iteration of the Bureau of Labor Statistics’ Current Population Survey (CPS).

# Load the Data
head(cps_data)
## # A tibble: 6 × 8
##     ahe yrseduc female   age northeast midwest south  west
##   <dbl>   <dbl>  <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1 12.5       14      0    40         1       0     0     0
## 2 19.2       12      0    30         1       0     0     0
## 3 17.5       12      0    29         1       0     0     0
## 4 13.9       12      0    40         1       0     0     0
## 5  7.21      14      1    40         1       0     0     0
## 6  7.60      12      1    35         1       0     0     0

We have 59485 observations of workers from randomly selected households. For each worker, we observe their age, male-female status, average hourly earnings in 2004, years of education, and their region of residence.

### 2.1.1 Quick Questions

1. What is the population from which this data was sampled?
2. What is the unit of observation in this dataset?

## 2.2 Sample Correlation

There are probably many interesting questions that we could investigate using this data, but we will focus on the relationship between education and average hourly earnings (a.k.a. average wage). First, let’s compute the sample correlation. For compactness, let $$X$$ represent years of education and $$Y$$ represent average wage.

$\begin{gather} r_{X,Y} = \frac{\sum_{i=1}^n \left(X_i - \bar{X}\right)\left(Y_i - \bar{Y}\right)}{s_X s_Y} \tag{2.1} \end{gather}$

where

$\begin{gather} s_X = \sqrt{\frac{1}{n-1} \sum_{i=1}^n \left(X_i - \bar{X}\right)^2} \end{gather}$

Sample correlation estimates the population correlation, $$\rho_{X,Y}$$, and both must lie between -1 and 1. To understand how to interpret the sample correlation, let’s work out the first four terms of the sum in the numerator of (2.1).

We’ll need the sample means of wage and schooling. They are $23.89 and 14.1 years, respectively. Then, we have $\begin{multline} \left(-11.39 \times -0.1\right) + \left(-4.6592299 \times -2.1\right) + \\ \left(-6.3419234 \times -2.1\right) +\left(-9.9476925 \times -2.1\right) + \ldots \end{multline}$ In these first four terms, each product is positive. In fact, a term is positive whenever both $$X_i$$ and $$Y_i$$ are below or above their respective means. A term is negative whenever $$X_i$$ is above its mean and $$Y_i$$ is below its mean, or vice versa. So on par, if more terms are positive then the sample correlation is positive, and if more terms are negative then the sample correlation is negative. A positive correlation means that if one variable is above its mean, then the other variable tends to be(but is not always) above its mean. A negative correlation means that if one variable is above its mean, then the other variable tends to be (but is not always) below its mean. Finally, dividing this sum by the standard deviations of each variable is a sort of normalization that gets rid of the units–dollars and years of schooling, in this case–and allows us to meaningfully interpret correlations more easily. The sample correlation of wages and schooling in this sample from 1994 equals 0.4461587. Thus, workers that had more schooling tended to have higher wages. ### 2.2.1 Quick Questions 1. What does it mean if a sample correlation is 0.95 versus 0.05? 2. You’ve probably heard that “correlation does not imply causation” ad nauseum. How would this adage apply to what we’ve observed so far about the relationship between schooling and wages? ## 2.3 Group Averages Let’s now ask the following, what is the difference in mean wages between workers with 12 and 13 years of schooling? To answer this, we grab just the observations in our sample with 13 years of schooling and compute the mean wage of this subsample. We do the same for 12 years of schooling, and then we simply take the difference. In the code below, I go ahead and compute sample mean wages for each group of workers with the same years of schooling. group_mean_wage <- cps_data %>% select(ahe, yrseduc) %>% group_by(yrseduc) %>% summarise(n = n(), mean_wage = mean(ahe)) group_mean_wage ## # A tibble: 12 × 3 ## yrseduc n mean_wage ## <dbl> <int> <dbl> ## 1 6 696 11.8 ## 2 8 549 12.5 ## 3 9 637 12.7 ## 4 10 570 14.5 ## 5 11 759 13.9 ## 6 12 16338 17.8 ## 7 13 10210 20.3 ## 8 14 6870 21.8 ## 9 16 14752 28.9 ## 10 18 5975 35.0 ## 11 19 984 46.0 ## 12 20 1145 43.1 The difference in mean wages between workers with 13 and 12 years of schooling is$2.52.

Let’s keep going, and compute the difference in mean wages from one year of schooling to the next for each level of schooling.

group_mean_wage <- group_mean_wage %>%
mutate(lag_mw = lag(mean_wage)) %>%
mutate(diff_mean_wage = mean_wage - lag_mw)
group_mean_wage
## # A tibble: 12 × 5
##    yrseduc     n mean_wage lag_mw diff_mean_wage
##      <dbl> <int>     <dbl>  <dbl>          <dbl>
##  1       6   696      11.8   NA           NA
##  2       8   549      12.5   11.8          0.754
##  3       9   637      12.7   12.5          0.181
##  4      10   570      14.5   12.7          1.78
##  5      11   759      13.9   14.5         -0.554
##  6      12 16338      17.8   13.9          3.87
##  7      13 10210      20.3   17.8          2.52
##  8      14  6870      21.8   20.3          1.48
##  9      16 14752      28.9   21.8          7.13
## 10      18  5975      35.0   28.9          6.03
## 11      19   984      46.0   35.0         11.0
## 12      20  1145      43.1   46.0         -2.87

### 2.3.2 Quick Questions

1. What is an estimate of $$E\left[Y|X=16\right]$$?
2. In the example data we’ve been working with, the average hourly wage is in dollars. Can you determine whether those dollars are in current-year prices (2012) or something else?

### References

Stock, James H, and Mark W Watson. 2019. Introduction to Econometrics. Vol. 4. Pearson New York.