# 11 Multiple Regression

In the last chapter we met our new friend regression, and did a few brief examples. Let’s start with one regression we already ran, for the relationship of computers to math scores.

```
##
## Call:
## lm(formula = math ~ computer, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.332 -14.276 -0.813 12.845 56.742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 653.767406 1.111619 588.122 <2e-16 ***
## computer -0.001400 0.002077 -0.674 0.501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.77 on 418 degrees of freedom
## Multiple R-squared: 0.001086, Adjusted R-squared: -0.001304
## F-statistic: 0.4543 on 1 and 418 DF, p-value: 0.5007
```

The relationship was small and insigifnicant, and perhaps most suprisingly negative. Schools with more computers did worse on the test. Are computers distracting the test takers? Diminishing their skills in math? Maybe, but maybe it’s not the computers fault. What do you think the relationship is between the number of computers at a school and the number of students? If you’re guessing that schools with more students have more computers, you’d be correct.

`## [1] 0.9288821`

The number of both is highly correlated, more students means more computers. And a regression can’t entirely tell the difference between those things with what we’ve done to this point. All the regression knows is the relationship of the number of computers to math scores, but it can’t tell why. If larger schools have more computers and do worse on tests, regression can’t seperate those effects on its own.

We can start though by manually distinguishing computers from students. Let’s calculate the number of computers per student at each school by ceating a new variable (divide computers by the number of students). Then we can regress that new variabe on math scores.

```
CASchools$comppercapita <- CASchools$computer/CASchools$students
summary(lm(math~comppercapita, data=CASchools))
```

```
##
## Call:
## lm(formula = math ~ comppercapita, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.284 -12.747 -1.144 13.008 52.836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 643.59 2.06 312.388 < 2e-16 ***
## comppercapita 71.77 13.68 5.247 2.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.19 on 418 degrees of freedom
## Multiple R-squared: 0.0618, Adjusted R-squared: 0.05955
## F-statistic: 27.53 on 1 and 418 DF, p-value: 2.461e-07
```

So our new varaible is not only positive, but highlighy significant. The more computers a school has per student makes a significant difference in how well it performs on math tests.

That’s great, we identified the effect of computers, seperate from the effect of larger schools and all we had to do was create a new variable. But, there’s an easier way to do that too, and it’s called multiple regression.

Multiple regression doesn’t mean running multiple regressions, it just refers to including multiple varaibles in the same regression. Most of the tools we’ve learned so far only allow for two variables to be used, but with regression we can use many more.

For instance, we could test the effect of computers and students on math scores at the same time. We structure a mutiple regression the same way, but we list additional varaibles on the right side of the ~ and seperate them with a plus sign.

```
##
## Call:
## lm(formula = math ~ computer + students, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.89 -12.60 -1.03 12.49 52.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.541e+02 1.089e+00 600.431 < 2e-16 ***
## computer 2.170e-02 5.482e-03 3.959 8.86e-05 ***
## students -2.805e-03 6.183e-04 -4.537 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.34 on 417 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.0435
## F-statistic: 10.53 on 2 and 417 DF, p-value: 3.461e-05
```

We can interpret the variables in the same way as earlier to some degree. We can see that a larger number of computers is associated with higher test scores, and larger schools in terms of erollment generally do worse on the math test.

Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points, and that change is highly significant.

But our interpreation needs to add something more. With multiple regression what we’re doing is looking at the effect of each varaible, while holding the other variable constant.

Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant, and that change is highly significant.

When we look at the effect of computers in this regression, we’re setting aside the impact of student enrollment and just looking at computers, similarly to how we did with the computerspercapita variable above. And when we look at the varaible for students, we’re setting aside the impact of computers and isolating the effect of larger school enrollments on test scores.

We looked at scatter plots and added a line to the graph to better understand the direction of relationships in teh previous chapter. We can do that again, but it’s slightly different.

Here is the relationship of computers to math scores, and the relationship of computers to math scores holding students constant. We can see the line produced by our refressions using the command termplot().

```
par(mfrow=c(2,2))
plot(math~computer, data=CASchools, type="n")
abline(lm(math~computer, data=CASchools))
plot(math~students, data=CASchools, type="n")
abline(lm(math~students, data=CASchools))
CASchoolslm <- (lm(math~computer+students, data=CASchools))
termplot(CASchoolslm)
```

All we are doing in multiple regression at this point is drawing lines, the same as we did earlier. Now though, we’re figuring out what part of math scores is determined unqiuely by the student enrollment at a school and what part of math scores is determined unqiuely by the number of computers. Once R figures that out it gives us the slope of two lines, one for computers and one for students. The line for computers slopes upwards, because the more computers a school has the better it’s students do, when we hold constant the number of students at the school. When we hold constant the number of computers, larger schools do worse on the math test.

I don’t expect that to fully make sense yet. This was meant to introduce multiple regression, but the next example will hopefully clarify what is meant to set aside different effects.

## 11.1 Predicting Wages

Let’s say we wanted to predict a persons income or wages. That would let us figure out what type of worker is making the most money, and whether there are any differences that might be caused by discriminaiton (against minorities for instance). We can look at that question using the data set in the package AER called PSID1982. The data is from the Panel Study of Income Dynamics and for the year 1982, so it’s a bit dated but it’s still useful to our question.

Before we begin, let’s look at the variables we have to play with.

- experience - Years of full-time work experience.
- weeks - Weeks worked.
- occupation - factor. Is the individual a white-collar (“white”) or blue-collar (“blue”) worker?
- industry - factor. Does the individual work in a manufacturing industry?
- south - factor. Does the individual reside in the South?
- smsa - factor. Does the individual reside in a SMSA (standard metropolitan statistical area)?
- married - factor. Is the individual married?
- gender - factor indicating gender.
- union - factor. Is the individual’s wage set by a union contract?
- education - Years of education.
- ethnicity - factor indicating ethnicity. Is the individual African-American (“afam”) or not (“other”)?
- wage - Wage.

All of those varaibles seem useful if we wanted to predict a persons wage. And we can see whether people earn more or less based on some of those qualities.

First, let’s see whether white collar or blue collar workers earn more in the data we have, and whether we can say anything about the population (all workersin 1982) based on the data. We’ll run a t test with the variable occupation and wage as the outcome.

We’ll test the null hypothesis that whether someone works a blue collar or white collar job makes no difference in wages, and see if we can accept the alternative hypothesis that occupation does make a differnece.

```
##
## white blue
## 290 305
```

Test statistic | df | P value | Alternative hypothesis |
---|---|---|---|

9.58 | 441.7 | 7.069e-20 * * * | two.sided |

mean in group white | mean in group blue |
---|---|

1350 | 956.4 |

The average wage for white collar workers in the data is 1350 dollars and the average wage for blue collar workers is 956.4 dollars. And that difference, based on the dispersion of the data, is highly highly significant. The probability of getting a difference that large based on random chance is less than 1 in a million, so we can be very confident in rejecting the null and asserting that occupation did make a difference in wages (at least in 1982).

Let’s test differences in wages by union status too. Back in the day when unions existed, they were able to boost workers wages. Here we have a varaible for union that indicates whether the worker was a part of one.

```
##
## no yes
## 377 218
```

Test statistic | df | P value | Alternative hypothesis | mean in group no |
---|---|---|---|---|

2.242 | 570.8 | 0.02532 * | two.sided | 1179 |

mean in group yes |
---|

1094 |

So workers in a union earned 1094 dollars on average, while none union workers earned 1179. That’s somewhat different than what I expected, since one of the big reasons people joined unions was to earn higher wages. But there are probably differences across occupation, as blue collar workers were more likely to be in unions than white collar workers. So there might be a different effect, depending on where you work.

```
## union white blue
## 1 no 1390.494 813.1159
## 2 yes 1158.157 1074.7246
```

What the table above shows is the relationship for four groups. The top row shows our two occupation groups (white and blue collar) and the first column shows our two union statuses (yes and no). Each box shows the average wage for the four groups that forms: white collar-no union, white collar-yes union, blue collar-no union, blue collar-yes union.

What we see is that white collar workers earn more if they aren’t in a union, 1390 vs. 1158. But that relationship is flipped if one is a blue collar worker, 1074 vs. 813. So the effect of being in a union is different depending on ones occupation.

In that table, essentially what we’re doing is holding one of the variables constant. We can compare the impact of union status on wages by comparing people with the same occupation. We can compare the wages for people by looking at the difference for union and non-union workers, but comparing them within the same occupation - comparing white collar union and non-union workers, and seperately comparing blue collar union and non-union workers. We’re holding occupation constant, but able to change people’s union status to see the difference.

That is what multiple regression does. The coefficients show us the impact of union on wages, when looking at people that work the same occupation. That’s what it means to hold occupation constant.

```
##
## Call:
## lm(formula = wage ~ occupation + union, data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1072.7 -334.7 -34.7 205.2 3765.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1334.68 30.01 44.469 <2e-16 ***
## occupationblue -424.89 43.79 -9.702 <2e-16 ***
## unionyes 85.06 45.43 1.872 0.0617 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 492.7 on 592 degrees of freedom
## Multiple R-squared: 0.1423, Adjusted R-squared: 0.1394
## F-statistic: 49.11 on 2 and 592 DF, p-value: < 2.2e-16
```

A one unit increase in occupation here is going from white collar to blue collar. So working in a blue collar job lowers a workers wage by 424 dollars, holding their union status constant, and that difference is highly significant.

Joining a union does increase a persons wage, holding their occupation constant, but that effect is only slightly significant in this regression. But it’s still interesting to see the effect differ between the t test and the regression.

The important point is to understand why we want to look at multiple variables. It helps us to understand the independent effect of each varaible, and to make better predictions at a persons wages.

## 11.2 Better predictions

Say you’re a career counselor in a high school back in 1982. What would you tell an impressionable kid about what career to follow based only on the regression we did immediately above?

You should probably tell them to find a white collar job, and joining a union might help their wages as well. But it’ll be far more important to get a white collar job, as the earnings are a difference of 424 dollars vs 85.

But we can get more information from a regression, as we can test more than 2 or 3 variables at a time. Let’s load this regression up and get as much information about wages as possible.

```
##
## Call:
## lm(formula = wage ~ occupation + union + education + smsa + south +
## industry, data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -956.7 -296.7 -85.8 228.3 3590.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 270.248 141.452 1.911 0.056553 .
## occupationblue -193.605 51.923 -3.729 0.000211 ***
## unionyes 52.877 43.342 1.220 0.222950
## education 65.221 9.007 7.241 1.4e-12 ***
## smsayes 130.363 40.638 3.208 0.001410 **
## southyes -94.680 43.141 -2.195 0.028577 *
## industryyes 157.691 40.322 3.911 0.000103 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459.9 on 588 degrees of freedom
## Multiple R-squared: 0.2578, Adjusted R-squared: 0.2502
## F-statistic: 34.04 on 6 and 588 DF, p-value: < 2.2e-16
```

Again, as the career counselor, what would you tell your student?

Working in a blue collar job would lower your expected wages by 193 dollars, so it would be smart to work in a white collar job.

Union status doesn’t make a significant difference once we add in all these additional varaibles. It is positive in the sample, but other things are more important, like…

Education is positive and significant. Each additional year of education a worker has increases their wages by 65 dollars. So if a student is thinking about whether to go to college or not (which would typically be 4 years) they could expect an increase in wages of 260 if they complete college. They can decide whether that is worth the cost of more educaiton.

The other varaibles are interesting too, but we don’t need to go on and on. Of course, you have the tools not to interpret them yourself. Each varaibles effect is positive if the coefficient is positive, and negative if the varaible is negative. You can tell if the effect is considered statistically significant if there are stars at the end, or you can use the p-value.

## 11.3 Identifying Discrimination

We didn’t use the varaible for ethnicity in the regression above, although it is a significant predictor of a workers wages. Let’s walk through an example focusing on it. The varaible ethnicity has two levels, other and African American. Let’s call the “other” etnicity group whites for the sake of making it easier to understand (hispanics weren’t always counted as a seperate group in the Census, and immigration from Latin America increase significantly after 1982 so they were overlooked).

```
##
## other afam
## 552 43
```

Test statistic | df | P value | Alternative hypothesis |
---|---|---|---|

6.113 | 57.15 | 9.312e-08 * * * | two.sided |

mean in group other | mean in group afam |
---|---|

1174 | 808.5 |

So other races, which we’re largely whites in 1982, earned an average of 1175 dollars in 1982, while African Americans earned an average of 809 dollars. And that difference is significant too.

So we know that African Americans earn less in the data, and that blue collar workers earn less. How should we predict the wage of an african-american white collar worker? Should we assume their wage is more similar to other white collar workers or other African Americans?

Let’s first look at that question using a table. Because those two varaibles only have two levels (white/blue and afam/other) we can see the differences quickly by seeing the mean value for each of the four groups they form: white collar African Americans, blue collar African Americans, white collar Whites, and blue collar Whites.

```
## ethnicity white blue
## 1 other 1373.1527 977.2635
## 2 afam 918.4667 749.5357
```

Again, that table is showing the average wage for each group based on the value in the colun and row. So if you’re White and work a white collar job, you’re average wage was 1373. And if you White and work a blue collar job, youre average wage was 977. For African Americans, the average wage for white collar workers was 918 and for blue collar workers 749.

So if we were predicting the wage of a white collar African American, we’d be better off using the value for all African Americans as they earn significantly less than the average white collar workers. But when we look at both varaibles at the same time we get a better guess.

Let’s look at the regression results, just for ethnicity and occupation. The coefficients witll show us the impact of race on wages, when looking at people that work the same occupation. That’s what it means to hold occupation constant.

```
##
## Call:
## lm(formula = wage ~ occupation + ethnicity, data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1018.6 -325.2 -40.6 215.3 3734.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1365.62 28.91 47.234 < 2e-16 ***
## occupationblue -380.89 40.11 -9.495 < 2e-16 ***
## ethnicityafam -309.14 77.43 -3.992 7.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 487.6 on 592 degrees of freedom
## Multiple R-squared: 0.1599, Adjusted R-squared: 0.157
## F-statistic: 56.32 on 2 and 592 DF, p-value: < 2.2e-16
```

A one unit increase in occupation here is going from white collar to blue collar. So working in a blue collar job lowers a workers wage by 380 dollars, holding their ethnicity constant, and that difference is highly significant.

A one unit change in ethnicity is going from White to African American. Which means that being African American lowers a workers wages by 309 dollars, holding their occupation constant, and that difference is highly significant as well.

What we’ve just done is central to identifying discrimination. If we look at the differences in wages between Whites and African Americans, is that necessarily because employers are discriminating against African Americans? One could object that the difference in wages is generated by them working differnt occupations.

```
PSID1982$bluecollar <- ifelse(PSID1982$occupation=="blue", 1, 0)
summary(lm(bluecollar~ethnicity, data=PSID1982))
```

```
##
## Call:
## lm(formula = bluecollar ~ ethnicity, data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6512 -0.5018 0.3488 0.4982 0.4982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50181 0.02125 23.62 <2e-16 ***
## ethnicityafam 0.14935 0.07903 1.89 0.0593 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4992 on 593 degrees of freedom
## Multiple R-squared: 0.005986, Adjusted R-squared: 0.00431
## F-statistic: 3.571 on 1 and 593 DF, p-value: 0.05928
```

That regression shows African Americans were slightly more likely to work in blue collar jobs, and we know that blue collar jobs earn less than white collar ones.

However, we’ve just shown that differences in occupation don’t explain the differences in wages. African Americans that work the same occupation earn less than Whites. An African American worker in a white collar profession earns less than a White working a blue collar job. So even when comparing individuals in the same occupations, we identify an earnings gap between Whites and African Americans, that’s better evidence of discrimination.

We know other things predict wages as well. Education can make a difference, and experience, and whether their members of a union…. everything the data has. So let’s add more to the regression, and see if all those factors help us to understand how much workers earned in the data.

```
##
## Call:
## lm(formula = wage ~ occupation + ethnicity + experience + education,
## data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1014.1 -292.6 -36.9 222.0 3565.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.350 148.425 0.602 0.54741
## occupationblue -142.696 49.136 -2.904 0.00382 **
## ethnicityafam -267.461 73.169 -3.655 0.00028 ***
## experience 9.522 1.792 5.312 1.54e-07 ***
## education 72.676 9.026 8.052 4.51e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 458.8 on 590 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2538
## F-statistic: 51.5 on 4 and 590 DF, p-value: < 2.2e-16
```

Holding constant an individuals occupation, their experience at the job, and their education, there is still a difference in earnings between Whites and African Americans. We’ve set aside the individual effects of all those varaibles, and we still find an earnings gap.

Based on this regression, even if an African American had worked at a job the same number of years, held the same occupation, and had the same occupation, we would expect them to earn 267 dollars less. The more that we can disprove as causing the gap in earnings, the stronger the argument that discrimination is present becomes. Let’s close this section of the chapter by throwing the kitchen sink at predicting wages.

```
##
## Call:
## lm(formula = wage ~ occupation + ethnicity + experience + education +
## union + gender + south + smsa + weeks + industry + married,
## data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1008.8 -279.8 -37.4 196.7 3482.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.735 234.478 -0.169 0.865491
## occupationblue -180.637 48.699 -3.709 0.000228 ***
## ethnicityafam -167.249 71.815 -2.329 0.020207 *
## experience 6.578 1.744 3.772 0.000179 ***
## education 65.257 8.698 7.503 2.34e-13 ***
## unionyes 25.782 41.761 0.617 0.537225
## genderfemale -344.516 80.124 -4.300 2.00e-05 ***
## southyes -59.662 40.791 -1.463 0.144110
## smsayes 169.833 38.978 4.357 1.56e-05 ***
## weeks 2.962 3.529 0.839 0.401726
## industryyes 86.693 38.443 2.255 0.024496 *
## marriedyes 85.620 64.500 1.327 0.184883
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 429.7 on 583 degrees of freedom
## Multiple R-squared: 0.3575, Adjusted R-squared: 0.3454
## F-statistic: 29.49 on 11 and 583 DF, p-value: < 2.2e-16
```

I should reemphasize this is one small data set that is fairly old, but it’s a finding youd get in other data sets as well. With taht caveat, we’ve just shown that being African American lowers wages by 167 dollars, holding constant their occupation, experience, education, union status, gender, whether they live in the south, whether they live in an urban area, the weeks they work, their industry, and their marital status.

So the results of that regression can be used to identify discrimination, but they can also be used for the more general purpose of predicting an individuals wages.

This is where the regression becomes a matter of intrepretation. Are we concerned that blue collar workers are discriminated against because their wages are lower? Not especially, since their labor is more replaceable generally, so they command lower wages. We don’t expect ethnicity (or gender) to make a difference in wages, when holding occupation, educaiton, etc. constant, because ethnicity and gender doesn’t make someones work more or less valuable. So what the coefficient *means* in the real world is a matter of debate, even if the regression tells us the same thing no matter who reads it.

## 11.4 A little more practice

Lets do that regression one more time with fewer variables to practice.

```
##
## Call:
## lm(formula = wage ~ occupation + experience + education + industry,
## data = PSID1982)
##
## Residuals:
## Min 1Q Median 3Q Max
## -976.0 -295.9 -54.2 234.0 3608.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.489 148.722 -0.218 0.827150
## occupationblue -167.277 49.526 -3.378 0.000780 ***
## experience 8.564 1.806 4.743 2.64e-06 ***
## education 78.582 9.023 8.709 < 2e-16 ***
## industryyes 150.916 40.141 3.760 0.000187 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 458.5 on 590 degrees of freedom
## Multiple R-squared: 0.2598, Adjusted R-squared: 0.2547
## F-statistic: 51.76 on 4 and 590 DF, p-value: < 2.2e-16
```

What is the intercept? In this case, it’s the predicted wage for someone that works a white collar job, with 0 years of experience, 0 years of education, and doesn’t work in manufacturing. Someone with 0 years of experience? We’re creating a line of best fit, so it always goes to 0 even if that’s ridiculous. So what do we predict the wage is for this new hire with 0 education in a white collar job that isn’t in manufacturing? -32 dollars.

That isn’t a useful prediction, but it isn’t a realistic person to make a prediction for. What matters is how changing those qualities impact a persons wages.

For a worker in a blue collar occupation, we expect their wage to be 32 dollars lower, holding their education, experience, and industry constant, and that change is highly significant.

For every 1 unit increase in experience, we expect them to increase their wages by 8.5 dollars, holding their occupation, education, and industry constant, and that difference is highly significant.

For every 1 additional year of education a person earns, we expect their wages to increase by 79 dollars, holding their occupation, experience, and industry constant, and that change is highly significant.

If a person works in manufacturing as opposed to other industries, we expect their salary to be 151 dollars higher, holding their experience, education, and occupation constant, and that change is highly significant.

## 11.5 Conclusion

Congrats, you made it to the end of the book (for now.)