Chapter 6 Multiple Regression

So far, we’ve written down models with only one regressor. However, we can add additional regressors and everything still works. In other words, we still estimate the model by choosing the parameters (the \(\beta\)’s) to minimize the sum of squared residuals. First, I’ll introduce a general multiple regression model and explain its interpretation. Then, we’ll work through an example.

6.1 A General Model

Here is a general multiple regression model with two explanatory variables:

\[\begin{gather} Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i. \tag{6.1} \end{gather}\]

We have two explanatory variables now, \(X_1\) and \(X_2\). For example, we might regress wages on an indicator for gender and years of schooling:

\[\begin{gather} Wages_i = \beta_0 + \beta_1 Female_i + \beta_2 Schooling_i + u_i. \tag{3.6} \end{gather}\]

Before we carry on with this example, let’s discuss the interpretation of (6.1). I think the first step is to work out the marginal effects of each variable on the right hand side. The marginal effect of \(X_1\), for example, is the average effect on \(Y\) of a one-unit increase in \(X_1\), holding the other regressors constant.

What is the effect on average \(Y\) of a one-unit increase in \(X_1\)? I’m going to assume (for the moment) that \(\text{E}[u| X_1, X_2] = 0\), so that

\[\begin{gather} \text{E}[Y| X_1, X_2] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \end{gather}\]

Here’s the algebraic solution to the marginal effect question:

\[\begin{multline} E[Y|X_1 = x_1+1, X_2 = x_2] - E[Y|X_1 = x_1, X_2 = x_2] = \\ \beta_0 + \beta_1(x_1+1) + \beta_2(x_2) - \left(\beta_0 + \beta_1(x_1) + \beta_2(x_2)\right). \end{multline}\]

After cancelling terms, we’re left with \(\beta_1\). It’s really that simple. And the result is similar for the marginal effect of \(X_2\).

In a multiple regression model like \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + u_i\), the marginal effect of \(X_1\) is \(\beta_1\). The marginal effect of \(X_2\) is \(\beta_2\).

6.2 An Example

Consider the following model.

\[ Wages = \beta_0 + \beta_1 Female + u \tag{6.2} \]

Again, using the CPS data from 2013, here’s the estimated model:

wages_fem <- lm_robust(ahe~female, data = ch8_cps)
summary(wages_fem)
## 
## Call:
## lm_robust(formula = ahe ~ female, data = ch8_cps)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper    DF
## (Intercept)   26.020     0.0929   280.1        0   25.838   26.202 59483
## female        -4.865     0.1254   -38.8        0   -5.111   -4.619 59483
## 
## Multiple R-squared:  0.02342 ,   Adjusted R-squared:  0.02341 
## F-statistic:  1505 on 1 and 59483 DF,  p-value: < 2.2e-16

The main take-away is about the coefficient on \(female\): On average, women earned hourly wages that were $4.86 less than men. This difference is statistically significant at the 1% level.2

It turns out that in this sample of data of American workers in 2013, women tended to have more years of schooling then men.

educ_female <- lm_robust(yrseduc~female, data = ch8_cps)
summary(educ_female)
## 
## Call:
## lm_robust(formula = yrseduc ~ female, data = ch8_cps)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper    DF
## (Intercept)  13.9264    0.01437  968.95 0.000e+00  13.8982  13.9546 59483
## female        0.3865    0.02098   18.42 1.507e-75   0.3453   0.4276 59483
## 
## Multiple R-squared:  0.005586 ,  Adjusted R-squared:  0.005569 
## F-statistic: 339.3 on 1 and 59483 DF,  p-value: < 2.2e-16

The average woman had 0.39 more years of schooling than the average man. This difference is statistically significant at the 1% level.

If schooling is a determinant of wages, then the OLS estimate of (6.2) will be biased. Since schooling is likely associated with higher wages, OLS will overestimate \(\beta_1\).3

Since we observe the years of schooling for each worker in this sample of data, we can include it as an additional regressor in the model. Now, we have,

\[ Wages = \beta_0 + \beta_1 Female + \beta_2 Yrseduc+ u \tag{6.3} \]

It is as if we have now pulled schooling out of the error term and explicitly included it in the model. Now, even if female status and schooling are positively correlated, this will not induce correlation between female status and the error term, since schooling is no longer subsumed in the error term. (The effects of other factors that affect wages, like motivation and parents education are still captured by the error term.)

Before we estimate (6.3), let’s review the key things to look for. \(\beta_1\) represents the difference in wages for females and males, holding years of schooling constant. In other words, suppose we take males and females who have 13 years of schooling. \(\beta_1\) would represent the average difference in wages between these two groups (who both have the same years of schooling). Therefore, if \(\beta_1\) is nonzero, then there is a difference in average earnings between men and women, and this difference cannot be explained by differences in schooling (since we’ve held it constant in the comparisons). For shorthand, we might say that \(\beta_1\) is the difference in wages for females and males, controlling for years of schooling.

Okay, now let’s estimate (6.3). Let’s pay attention and see whether our new estimate of the effect of female status on wages gets smaller (moves towards the left on the number line).

wages_fem_sch <- lm_robust(ahe ~ female + yrseduc, data = ch8_cps)
summary(wages_fem_sch)
## 
## Call:
## lm_robust(formula = ahe ~ female + yrseduc, data = ch8_cps)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper    DF
## (Intercept)  -13.376    0.33832  -39.54        0  -14.039  -12.712 59482
## female        -5.958    0.11278  -52.83        0   -6.179   -5.737 59482
## yrseduc        2.829    0.02607  108.53        0    2.778    2.880 59482
## 
## Multiple R-squared:  0.234 , Adjusted R-squared:  0.234 
## F-statistic:  6343 on 2 and 59482 DF,  p-value: < 2.2e-16

Let’s write out the estimated model,

\[ \widehat{Wages} = -13.4 - \underset{(0.11)}{5.96} Female + \underset{(0.03)}{2.83} Yrseduc. \tag{6.4} \]

Here are a few things that take away, in order.

  1. Controlling for years of schooling, the average difference in wages for women and men is -$5.96.
  2. The estimated average difference is statistically significant(ly different from zero).4
  3. Indeed, the estimated coefficient on \(Female\) was overestimated in (6.2). It was $-4.86 there and, after controlling for schooling, it is $-5.96.
  4. A question: might there be other determinants of wages that are correlated with female status, which continue to induce bias in the estimated difference in wages between women and men?

6.3 Revisiting Categorical Variables

Now that we can include multiple regressors, we can incorporate categorical variables with more than two categories. Let’s add the worker’s region of residence. There are four possible regions: northeast, midwest, south, and west. To include this region variable in our model, we need to create a one-zero indicator (or dummy) variable for each category. The dataset we are working with already has an indicator for each region.

head(ch8_cps)
## # A tibble: 6 × 8
##     ahe yrseduc female   age northeast midwest south  west
##   <dbl>   <dbl>  <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1 12.5       14      0    40         1       0     0     0
## 2 19.2       12      0    30         1       0     0     0
## 3 17.5       12      0    29         1       0     0     0
## 4 13.9       12      0    40         1       0     0     0
## 5  7.21      14      1    40         1       0     0     0
## 6  7.60      12      1    35         1       0     0     0

The first 6 workers in this data are from the northeast. An important observation is that for each worker only one of these four indicator variables can equal one. Therefore, if we know the value of three of these four indicators, we automatically know the value of the fourth.

\[ \text{midwest} = 1 - \text{northeast} - \text{west} - \text{south} \] Since we can write one of these four indicators as an exact linear function of the other three, we will only need to explicitly incorporate three of these indicators in our regression model. For the simplicity at the moment, let’s omit schooling and female status. (We’ll add them back later.)

\[\begin{gather} Wages = \beta_0 + \beta_1 Northeast + \beta_2 West + \beta_3 South + u \tag{6.5} \end{gather}\]

Keep in mind that even though \(midwest\) is not in the model, we still get predicted wages for workers from the midwest by setting the three included region indicators to zero!

\[\begin{align} \text{E}[Wages|Northeast = 0, West = 0, South = 0] &= \beta_0 \\ \text{E}[Wages|Northeast = 1] &= \beta_0 + \beta_1 \\ \text{E}[Wages|West = 1] &= \beta_0 + \beta_2 \\ \text{E}[Wages|South = 1] &= \beta_0 + \beta_3 \end{align}\]

From this, you can also establish our interpretation of the coefficients on each region indicator.

  • An average Worker from the North earns \(\beta_1\) dollars more than one from the Midwest.
  • An average Worker from the West earns \(\beta_2\) dollars more than one from the Midwest.
  • An average Worker from the South earns \(\beta_3\) dollars more than one from the Midwest.

In other words, the coefficient on a indicator (or dummy) variable reflects the average difference in the outcome variable between that category and the reference category. The reference category is the category whose indicator variable is left out of the model, which in this case is the Midwest category.

Now, let’s add schooling and female status back to the model, estimate, and interpret.

\[ Wages = \beta_0 + \beta_1 Female + \beta_2 Yrseduc+ \beta_3 Northeast + \beta_4 West + \beta_5 South + u \tag{6.6} \]

wages_fem_sch_reg <- lm_robust(ahe ~ female + yrseduc + northeast + west + south, data = cps_data)
summary(wages_fem_sch_reg)
## 
## Call:
## lm_robust(formula = ahe ~ female + yrseduc + northeast + west + 
##     south, data = cps_data)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper    DF
## (Intercept) -14.4795    0.35846 -40.394 0.000e+00 -15.1821 -13.7770 59479
## female       -5.9334    0.11251 -52.738 0.000e+00  -6.1539  -5.7129 59479
## yrseduc       2.8201    0.02597 108.575 0.000e+00   2.7692   2.8710 59479
## northeast     2.7078    0.17744  15.260 1.768e-52   2.3600   3.0556 59479
## west          1.9490    0.15641  12.461 1.356e-35   1.6424   2.2556 59479
## south         0.6299    0.14529   4.335 1.458e-05   0.3451   0.9146 59479
## 
## Multiple R-squared:  0.238 , Adjusted R-squared:  0.238 
## F-statistic:  2581 on 5 and 59479 DF,  p-value: < 2.2e-16

Here are a few observations:

  1. The average difference in wages between males and females didn’t change much from (6.4). It’s still about $6 less for women.
  2. The marginal effect of schooling also didn’t change much from (6.4). It’s still about $2.80.
  3. Workers from the Northeast make the most on average: about $2.71 more than those from the Midwest.
  4. The average worker from the west earns about $1.95 more than the average worker from the Midwest.
  5. The average worker from the South earns about $.63 more than the average worker from the Midwest.

6.4 The adjusted \(R^2\)

Recall that the \(R^2\) measures the fraction of variation in the outcome variable that is explained by the model we’ve estimated. Usually we’ll want to adjust the \(R^2\) in models with multiple regressors. The fundamental reason for this adjustment, I think, is due to possibility for spurious correlations in a finite sample of data.

To understand spurious correlation, imagine two random variables that are known to be completely independent. For example, Liana flips a fair coin ten times and computes the proportion of times the coin lands on Tails, and Demond does the same. Each of Liana and Demond do this 50 times, and we compute the paired sample correlation. Intuitively, we know that these are independent experiments. Just because Liana gets a higher proportion of Tails should say nothing about whether Demond will get a higher or lower proportion of Tails on that round. But let’s see what happens.

First, I show the proportion of times the coin lands on Tails for each of Liana and Demand in the first 10 rounds (of each flipping a coin 10 times). I did this for 40 more rounds, for a total of 50 rows.

Proportion of flips landing on Tails
The first 10 rounds shown
Liana Demand
0.4 0.5
0.4 0.6
0.3 0.3
0.4 0.8
0.5 0.4
0.4 0.6
0.7 0.6
0.7 0.5
0.4 0.5
0.4 0.3

The sample correlation in all 50 rounds was -0.17. Even though Liana’s and Demond’s 10 coin flips were independent, in a finite sample of 50 rounds, the sample correlation was nonzero. This is an example of spurious correlation. Spurious correlation is less and less likely as the sample size grows. If we have Liana and Demond do 10,000 rounds, the sample correlation is very likely to be within 0.01 of zero.

Now, why does this matter for \(R^2\)? As we add additinal regressors, it becomes more likely that we accidentally take advantage of spurious correlations and artifacts of our particular finite sample that may not be present in the population. To account for this, we should adjust the \(R^2\) downward as we add more regressors. This gives us the adjusted \(R^2\), or \(\bar{R}^2\).

The adjusted \(R^2\) may increase or decrease as we add more regressors. If we add a regressor that helps explain a lot of variation in the outcome variable, then \(\bar{R}^2\) will increase even after the “penalty.” However, if we add a regressor that only helps to explain a small amount of variation in R, then \(\bar{R}^2\) will decline.

The upshot is that when we have a multiple regression model, we will use \(\bar{R}^2\) instead of \(R^2\).

6.5 Conditional Mean Independence

Consider the model in (6.6), and suppose we are mainly interested in the causal effect of female status.5 The requirement for an unbiased estimate of the causal parameter, \(\beta_1\), is that female status be independent of the error term. It is okay if other variables in the model are related to the error term. We can write this out as,

\[ \text{E}[u|female, yrseduc, northeast, west, south] = \text{E}[u|yrseduc, northeast, west, south] \]

This mathematical statement says that it is okay if schooling and region is related to other determinants of wages (and consequently informs the expected error), but we need female status to be unrelated to other factors that affect wages (and yield no information about the expected error).

This sort of conditional is called conditional mean independence.

Stock, James H, and Mark W Watson. 2019. Introduction to Econometrics. Vol. 4. Pearson New York.

  1. Our default significance level is 5%. However, the p-value is much smaller than even 1%, so that’s why I’ve said “1% level” here. Without providing the p-value, it is a way of characterizing how much evidence there is for the claim that on average women have lower wages than men.↩︎

  2. We have a positive correlation between female status and schooling, and a positive correlation between schooling and the error term. These two positives give us an overestimate of the coefficient on \(female\).↩︎

  3. We will cover hypothesis testing in the multiple regression framework in the next module, but for testing a single coefficient, like \(\beta_1\), the technique is identical to that for a simple t-test that we covered in Statistical Inference.↩︎

  4. The causal effect of female status would be capture the difference in average wages for womend and men who are otherwise the same. This would correspond to the part of the observed difference in wages that is due to some form of what we usually call disrimination.↩︎