# Chapter 17 Dummy Variables and Interactions in Regression Analysis

**This chapter is not part of the course HE802 in spring 2021.**

Over the last few weeks, we used simple and then multiple regression analysis to analyze the linear relationships between a continuous numeric dependent variable and one or more independent variables. This week we will continue to build on our knowledge of regression analysis by adding two key capabilities to our toolbox: dummy variables to look at differences between groups and interaction terms to see if the relationship between a dependent and independent variable is different for different groups.

Additionally, we have oral exam #2 coming up very soon. Oral exam #2 will focus on the materials from Chapters 6–8. In the exam, you will be asked to run an OLS linear regression model, interpret its results, and run selected diagnostic tests to determine if we can trust the results of the model you made.

**Everybody should please schedule a time with me (by email) when you would like to take your Oral Exam #2. It should be sometime in the March 16–27 range.**

This week, our goals are to…

Add and interpret dummy variables and interaction terms to our OLS linear regressions.

Visualize regression results, including visualizations with dummy and interaction variables.

## 17.1 Tips, Tricks, and Answers From Last Week

As always, reading this section is optional, but it is based on questions I received from members of the class over the last week and it may contain some useful information.

### 17.1.1 Extenuating Circumstances

Some of you have asked if you can modify the speed and schedule with which you complete the course, especially given recent extenuating circumstances caused by the spread of coronavirus.

The short answer is **yes, in most cases.** The Institute has mechanisms in place to accommodate your schedule and other obligations in situations like these. However, this needs to be discussed individually on a case-by-case basis. Please do not hesitate to e-mail or call me to discuss your specific situation. I will do everything I can to accommodate your request. Thank you.

### 17.1.2 Tables in RMarkdown

Separate from the tables that you make using the `table()`

command from within code chunks in RMarkdown, you can also make tables (containing text and/or numbers) in RMarkdown to help organize your writing.

If you would like to learn about how to add tables to your own RMarkdown file, please copy and adapt the code from the following resource:

Note that the most basic way to make a table is to simply type it out like this:

```
| Name | Age | Occupation |
|:------|:----|:-----------|
| Gedke | 23 | Chemist |
| Yada | 54 | Slug Tamer |
| Bif | 34 | Puppeteer |
```

You can copy and paste the table above into your own RMarkdown file and then modify it to your needs.

When you “knit” your RMarkdown file into a PDF, HTML, or Word document using the `Knit`

menu, this table that we inputted as plain text with hyphens and vertical bars will be converted into a nice-looking table. Here is what it will look like:

Name | Age | Occupation |
---|---|---|

Gedke | 23 | Chemist |

Yada | 54 | Slug Tamer |

Bif | 34 | Puppeteer |

I also find it handy to use an online tool to help me make tables in RMarkdown, so that I don’t have to manually type out all of the hyphens and vertical bars you saw above. Here is the tool I usually use:

## 17.2 Final Project Details and Requirements

### 17.2.1 Description

With just over a month remaining in our time together in this course, I would like to share specific expectations and requirements for the final project in this class. The final project is *not* meant to be even close to a full quantitative research study. Instead, you can think of the final project as a take-home final exam. Another way to think about the project is that you will be writing an extended methods section and a condensed results section of an empirical research article.

**The final project is due on April 25, 2020.**^{194}

### 17.2.2 Project Goals

**The goals of this final project are to…**

Present and interpret the results of

*one*quantitatve test or model (such as ANOVA, t-test, linear or logistic regression) that answers a clear and specific research question.Run, interpret, and appropriately respond to

*all*required diagnostic tests for the quantitative test or model and present the results of all tests.

### 17.2.3 Project Requirements

**Here are the items you must present and tasks you must complete:**

Write a clear research question (RQ) that can be solved using regression analysis techniques. This research question should be a

*single sentence*with a question mark at the end.^{195}To answer your RQ, various concepts will have to be first measured, recorded in a dataset as variables, and then related to each other quantitatively. Identify a dataset that you will use to answer your research question.

^{196}Clearly describe the dataset, including: a) population from which the data sample was drawn, b) unit of observation, c) all variables that you will use in your analysis and the unit of measurement of each variable, d) background information about the data.Given the structure of the data and the RQ of interest, explain which type of quantitative test is most appropriate to answer your RQ and why. Also identify at least one other type of quantitative test that could also be used and explain why you instead chose the test that you did.

Present basic descriptive statistics that are relevant to your RQ. You should include at least one table and at least one figure/chart.

Show the code and results of

*one*quantitative test or model that answers your RQ.^{197}Run and present the results of

*all*diagnostic tests that pertain to the type of test or model you ran. Ideally, your model will*pass*all of the tests. If your diagnostics show that your model specification violates any of the assumptions of your chosen test, you might be able to fix the problem and run the test again. Please describe all efforts to fix such problems. If you are not able to solve all such problems, it is okay. The key is that you explain what you find and how you went about your methods.Interpret the results of your test/model that are relevant to your RQ.

Briefly explain any limitations in your analysis.

Include all R code and results in your final submission.

Present all writing in well-written English.

Present everything in an aesthetically pleasing manner. It is recommended that you use an RMarkdown document, but this is not required.

### 17.2.4 Grading Rubric

The final project will be graded according to the rubric below. Each criterion is worth a maximum score of two points unless otherwise noted.

Criterion | Score = 0 | Score = 1 | Score = 2 |
---|---|---|---|

Clear RQ | Unclear, more than one sentence, not a question. | Confusingly presented but understandable. | Clear, simply written, single sentence ending with a question mark. |

Population and sample | Relationship between sample and population is unclear, details about population is omitted. | Minor omissions, but overall description of the population is understandable. | It is very clear what the population is and how many observations from this population were sampled and then included in the dataset used in the project. |

Unit of observation | The meaning of each row in the data is not understandable from what is written. | Reader can figure out based on context, but a clear explanation is missing. | It is very clear what each row of the data means/represents. This is explicitly stated with no ambiguity or confusion. |

Variables used | The variables used in the analysis are not addressed. | Some variables are mentioned but not all. How each variable is measured is not clear. | Dependent variable and all independent variables are described in one sentence each. Unit of measure (and any relevant explanation of how a variable is coded in the data) is given for each variable. |

Background on data | It is not understandable where the data came from and from what context. | Few details are given about the data. | Clear explanation of where the data came from, when it was collected, who collected it, etc. |

Choose test/model | No explanation of why the presented quantitative test/model was chosen. No comparison to another test/model. Incorrect selection of model type. | An explanation may be there but it might be incorrect, or a comparison to another test/model is missing. | Logical explanation of the way the data is structured and how the selected test/model is best suited to that data structure. Clear explanation of why at least one alternative test/model was not used. |

Descriptive statistics | No or very few statistics presented. Statistics for irrelevant variables or information are presented. | Descriptive statistics do not cover all variables and observations relevant to the RQ. Only one of two required charts is included. | Descriptive statistics are presented for all variables relevant to the RQ and used in the selected test/model. One well-made figure is presented. One well-made table is presented. |

Test/model result | Code and/or summary is not shown for test/model. Code does not accomplish the type of test/model that was supposed to be used. | Only partial work or result is shown. Type of test/model is unclear. | Correct test/model result is shown along with appropriate R code to execute it. |

Test/model assumption 1 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |

Test/model assumption 2 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |

Test/model assumption 3 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |

Interpret results | Many irrelevant details are given. Research question is not clearly answered. | Research question is answered but interpretation of results is not exactly correct. | Succinct interpretation of the portion of the test/model output that pertains to the RQ. |

Limitations | Limitations are not addressed or are completely incorrect given the test/model model used. | Limitations are partially addressed. | Multiple plausible limitations to the analysis and the conclusions we can draw from it are addressed. |

R code included | No R code is included | Only partial R code for the results presented is given. | R code is included (displayed in final document) for all results that were generated using R. |

Writing quality (+) | Sentences and paragraphs are not formatted according to convention. Full sentences are not used much or well. | Minor grammar and/or spelling errors occur throughout, but the main points are understandable. | Writing is clear and succinct. It is easy to read quickly and understand the analysis and the results. No grammar or spelling errors. |

Aesthetics | Project is presented in a confusing manner. Order and flow of requested items is not logical. Unnecessary fonts, symbols, and formatting layout appears. | Minor blemishes and errors are visible in the submitted project. | Order of all content is clear and logical. Sections and sub-sections are logically and clearly marked. The write-up is easy to read. |

Items marked with a (+) in the table above will carry more weight than just two points. All other items have a maximum score of two points.

Your grade on the project will be the number of points achieved divided by the total number of points possible.

**If you are not satisfied with your grade on this project, you do have the option of taking an INCOMPLETE grade for the course. Then, you will improve and re-submit your project in the weeks that follow the end of the course. I will re-grade the project and then put your improved final grade for the course into the grading system.**

## 17.3 Dummy Variables

### 17.3.1 Definition

“Dummy” variables are variables that just have two values. Here are some examples:

Variable | Levels |
---|---|

gender | female, male |

experimental group | treatment, control |

completed training | did complete training, did not complete training |

citizenship | native, foreign |

test result | pass, fail |

How does this look within a dataset? Maybe you would have a variable in your dataset called `gender`

and it would be coded as `1`

for females and `0`

for males.

A dummy variable can also be called a *binary* variable, *dichotomous* variable, *yes/no* variable, *two-level categorical* variable, etc. All of these terms mean the same thing.

But what about when you have three or more qualitative (non-numeric) categories that you want to include in your regression as a single variable?

Here are some examples:

Variable | Levels |
---|---|

Race | black, white, other |

Favorite ice cream flavor | chocolate, vanilla, strawberry, other |

Type of car owned | Gas, electric, hybrid, do not have a car |

Please read the following resources related to this situation:

### 17.3.2 Example

I’m going to run through a fake example now, to illustrate how dummy variables are used in regression. First I’m adding a `race`

variable to the `GSSvocab`

data that you have seen before:

```
# Generate race variable based on US demographics
library(car)
```

`## Loading required package: carData`

```
set.seed(24)
$race <- as.factor(sample( c("white","black","other"), nrow(GSSvocab), replace=TRUE, prob=c(.72, .13, .15) ))
GSSvocab
# Check the levels of the new variable
levels(GSSvocab$race)
# Select the reference group
$race <- relevel(GSSvocab$race, ref = "white") GSSvocab
```

Distribution of the new `race`

variable:

`with(GSSvocab, table(race))`

```
## race
## white black other
## 20922 3647 4298
```

Here’s an OLS linear regression:

`summary(VocabReg1 <- lm(vocab ~ age + gender + educ + race , data=GSSvocab))`

```
##
## Call:
## lm(formula = vocab ~ age + gender + educ + race, data = GSSvocab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9897 -1.1283 0.0782 1.2361 8.5150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8557585 0.0621644 13.766 < 2e-16 ***
## age 0.0143656 0.0006421 22.374 < 2e-16 ***
## gendermale -0.1465174 0.0223500 -6.556 5.64e-11 ***
## educ 0.3459089 0.0037091 93.261 < 2e-16 ***
## raceblack 0.0148854 0.0338229 0.440 0.660
## raceother -0.0196448 0.0314467 -0.625 0.532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.831 on 27402 degrees of freedom
## (1459 observations deleted due to missingness)
## Multiple R-squared: 0.2435, Adjusted R-squared: 0.2433
## F-statistic: 1764 on 5 and 27402 DF, p-value: < 2.2e-16
```

There are three dummy variables in this regression model: `gendermale`

, `raceblack`

, and `raceother`

.

Let’s interpret these one at a time.

`gendermale`

is a dummy variable that the computer created for us based on the factor (categorical) variable `gender`

, which just has two levels (male and female). It added the word “male” to the variable name to tell us that it coded male as 1 and female as 0.

The coefficient of `gendermale`

is interpreted like this: *Males are predicted to score 0.1465 lower on the vocabulary test than females, holding all other independent variables constant.* We compare the level of the variable that corresponds to 1 (males) with the level that corresponds to 0 (females).

`raceblack`

and `raceother`

are two dummy variables that were generated for us by the computer from the categorical factor variable `race`

. Let’s walk through exactly how this happened. `race`

has three levels: black, white, and other. We want to see if race is associated with our dependent variable (`vocab`

), but race is not a number that we can just add into the regression as a numeric variable like we can for age or years of education. All we can do is predict the differences in the dependent variable for each of these three groups.

But the computer only understands numbers. So what do we do? Well, we convert the categorical variable `race`

into numeric dummy variables (which only take on the values/levels of 1 and 0). We can make three dummy variables:

`raceblack`

– Coded as 1 for anyone in the survey data who is black and 0 for anyone who is not black (meaning they are white or other).`raceother`

– Coded as 1 for anyone in the survey data who is non-white and non-black and 0 for anyone who is white or black.`racewhite`

– Coded as 1 for anyone in the survey data who is white and 0 for anyone who is not white (meaning they are either black or other).

Now we have three separate *numeric* variables that, acting together, replace our need for the *categorical* `race`

variable that we started with. Note that the computer does this automatically for us when we run the regression, because it knows^{198} everything I just told you.

Then we include *two* of these three variables in the regression as numeric variables. That’s what the computer did for us automatically. It left one out. Why did it leave one out? Well, for the `gender`

variable, which is also a categorical factor variable, we could make one dummy variable called `gendermale`

^{199} and another called `genderfemale`

,^{200} if we really wanted to. That would be perfectly legitimate.

Consider what would happen if we included *both* `gendermale`

and `genderfemale`

as independent variables in the regression. It wouldn’t be useful for us to include both of these because they contain the same information. Both the `gendermale`

and the `genderfemale`

variables tell us whether a person in the survey is male or female, even though they are coded differently. There is no added benefit to having both of these variables. We just need one of them to understand the predicted difference in the dependent variable between males and females.

Going back to the three-category `race`

variable, but keeping the above explanation about the two-category `gender`

variable in mind, let’s think about why we only need to include two of the three race dummy variables in our regression model. Remember that we now have three dummy variables that we have added to our data, and we know the values of these three variables for *every single person* in the dataset.

Below are three people in our dataset. We’ll pretend their names are Ophelia who is other, Winnie who is white, and Belinda who is black. You can see their race as it is coded in the `race`

variable, and then you can see how each of these people is coded for each of the three new dummy variables.

name | race | raceblack | raceother | racewhite |
---|---|---|---|---|

Ophelia | other | 0 | 1 | 0 |

Winnie | white | 0 | 0 | 1 |

Belinda | black | 1 | 0 | 0 |

With these three dummy variables, we no longer need the original `race`

variable. Let’s look at the table again, but only with the new dummy variables:

name | raceblack | raceother | racewhite |
---|---|---|---|

Ophelia | 0 | 1 | 0 |

Winnie | 0 | 0 | 1 |

Belinda | 1 | 0 | 0 |

Using only these dummy variables, we can see very easily that Ophelia is other, because she is is coded as 1 for `raceother`

, 0 for `raceblack`

, and 0 for `racewhite`

. We don’t need the original `race`

variable anymore to identify each person’s race. And these dummy variables are purely numeric, so the computer can now incorporate everyone’s race into the regression. We solved that problem.

But now have a look at the version of the table below, in which I have eliminated the `racewhite`

variable:

name | raceblack | raceother |
---|---|---|

Ophelia | 0 | 1 |

Winnie | 0 | 0 |

Belinda | 1 | 0 |

Let’s go through our three surveyed people based on this new table with just two variables:

- Ophelia: Coded as 1 for
`raceother`

and 0 for`raceblack`

, so we know she is other. - Winnie: Coded as 0 for
`raceother`

and 0 for`raceblack`

, so we know that she is white, because she’s coded as 0 for the other two dummy variables. - Belinda: Coded as 0 for
`raceother`

and 1 for`raceblack`

, so we know she is black.

Even though we took out the `racewhite`

variable, we still were able to figure out each person’s race, even Winnie’s! Including all three dummy variables for a three-category categorical variable is **too much information**!!! The computer doesn’t need it. The variable that we leave out is called the *reference category* or *reference level*. If our categorical variable has 3 categories, we need 2 dummy variables. If our categorical variable has 6 categories, we need 5 dummy variables. The rule is:

\[number\; of\; dummy\; variables = number\; of\; categories - 1\]

That was a pretty lengthy explanation. Hopefully this makes sense. Ask questions if not!

Let’s return finally to the regression we ran. The coefficient for `raceblack`

is 0.0149 and the coefficient for `raceother`

is -0.0196. Here’s how we interpret these results:

- Black race people are predicted to have a vocabulary score that is 0.0149 higher than white race people, controlling for all of the other independent variables.
- Other race people are predicted to have a vocabulary score that is 0.0196 lower than white race people, controlling for all of the other independent variables.

`racewhite`

was left out as the reference category. The computer only understands numbers. The coefficients for `raceother`

compares all people for whom `raceother = 1`

(like Ophelia) to those for whom `raceother = 0`

and `raceblack = 0`

(like Winnie). The coefficient for `raceblack`

compares all people for whom `raceblack = 1`

(like Belinda) to all people for whom `raceother = 0`

and `raceblack = 0`

(like Winnie once again).

The regression only allows us to compare each category with the reference category. All non-white people can only be compared with those who are white. We could have left out a different variable as the reference category and included `racewhite`

. In that case, all of our results would be compared to the selected reference category. But the overall results would be the same. Look:

```
$race <- relevel(GSSvocab$race, ref = "black")
GSSvocabsummary(VocabReg2 <- lm(vocab ~ age + gender + educ + race , data=GSSvocab))
```

```
##
## Call:
## lm(formula = vocab ~ age + gender + educ + race, data = GSSvocab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9897 -1.1283 0.0782 1.2361 8.5150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8706439 0.0683596 12.736 < 2e-16 ***
## age 0.0143656 0.0006421 22.374 < 2e-16 ***
## gendermale -0.1465174 0.0223500 -6.556 5.64e-11 ***
## educ 0.3459089 0.0037091 93.261 < 2e-16 ***
## racewhite -0.0148854 0.0338229 -0.440 0.660
## raceother -0.0345303 0.0423761 -0.815 0.415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.831 on 27402 degrees of freedom
## (1459 observations deleted due to missingness)
## Multiple R-squared: 0.2435, Adjusted R-squared: 0.2433
## F-statistic: 1764 on 5 and 27402 DF, p-value: < 2.2e-16
```

I changed the reference category to `raceblack`

and now you’ll see that it is left out of the regression, and `racewhite`

is included. The coefficient for `racewhite`

is now `-0.0149`

, exactly the same magnitude as the previous `raceblack`

coefficient but with the opposite sign! Despite changing the reference category, we got the exact same result, that white people are predicted to have a lower vocabulary score that is 0.0149 than that of black people (holding constant all of the other independent variables). So it ultimately didn’t matter too much which reference category we chose. The rest of the model is the same. All the other coefficients are the same. The `Multiple R-squared`

is exactly the same.

Hopefully all of this made sense. I couldn’t find an explanation of more-than-two category categorical variables that I liked, so I wrote all of this. Please ask for clarifications as you see fit!

Keep in mind that I generated the above race data randomly just to illustrate how to use dummy variables in regressions, so this is obviously not a true finding. Another reminder of this is that the p-values for the race dummy variables in the result were not statistically significant. So we anyway don’t have any evidence that the fake race assignments are associated with vocabulary score in the population at large.

### 17.3.3 Optional Resources

The following resources might help reinforce your understanding of dummy variables. It is not required for you to read/consume these:

- Working With Dummy Variables
- Section “14.1 Dummy Variables” of
*Quantitative Research Methods for Political Science…* - Dummy Variables in Regression
- 2:38 and after in “Eviews 7: How to interpret dummy variables and the dummy variable trap explained part 1” – Note that in the regression output shown here, C means “Constant,” which is the same thing as an intercept. The video should start at 2:38 automatically if you open it through this link.

Some of the resources above explain how to interpret a regression coefficient for a dummy variable when there are only two categories (such as male and female) that you need to capture in your regression.

## 17.4 Interactions in Regression

### 17.4.1 Definition and Overview

Previously, we learned about the concept of an interaction when doing ANOVA tests. Now, we will look at how to incorporate the very same concept into a linear regression model.

Please have a look at the following resources. If you find the concept of an interaction to be intuitive already, you can quickly skim through these resources rather than reading word-for-word.

### 17.4.2 Example

Now we will run through a very short example, which you can easily run in R on your own computer!

Please copy the code below into R and run it.^{201} You will need to modify this code as part of your assignment this week.

```
if (!require(interactions)) install.packages('interactions')
library(interactions)
summary(fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris))
with(iris,table(Species))
interact_plot(fitiris, pred = Petal.Width, modx = Species)
```

If you want to see the same thing but with the individual data points displayed as well (which isn’t always a good idea), run this:

`interact_plot(fitiris, pred = Petal.Width, modx = Species, plot.points = TRUE)`

Basically, when we interact a continuous variable with a categorical variable, we are asking the regression model to tell us if there is a *different slope* for the relationship between that continuous variable and the dependent variable for the different levels of the categorical variable. This is extremely useful.

Here’s how this corresponds to our borrowed example above:

dependent variable | continuous independent variable | categorical independent variable |
---|---|---|

petal length | petal width | species |

In the regression, we interacted petal width and species. We got the following regression equation:

\[\begin{eqnarray} Petal.Length_{predicted} &=& 0.55Petal.Width + 0.45Speciesversicolor \\ && + \text{ } 2.91Speciesvirginica + 1.32Petal.Width*Speciesversicolor \\ && + \text{ } 0.10Petal.Width*Speciesvirginica + 1.33 \end{eqnarray}\]

The categorical variable `Species`

has three levels:

`levels(iris$Species)`

`## [1] "setosa" "versicolor" "virginica"`

`setosa`

is the reference category. You’ll notice that it’s missing from the regression output, and that’s why. The computer created dummy variables for the other two species: `Speciesversicolor`

and `Speciesvirginica`

.

Let’s look now at the predicted relationship between the independent variable `Petal.Width`

and the dependent variable `Petal.Length`

. To do this, we’ll take out any terms^{202} that include `Petal.Width`

on the right side of the equation:

\[0.55Petal.Width + 1.32Petal.Width*Speciesversicolor + 0.10Petal.Width*Speciesvirginica\]

Using this, we can figure out the predicted relationship between `Petal.Length`

and `Petal.Width`

for plants that fall into each of the three levels of `Species`

:

`setosa`

– For all setosa plants,`Speciesversicolor`

is coded as 0 and`Speciesvirginica`

is also coded as 0. We plug in 0 for each of these in the expression above: \(0.55Petal.Width + 1.32Petal.Width*0 + 0.10Petal.Width*0\) which is equal to \(0.55Petal.Width\). 0.55 is the final coefficient. For setosa plants, A one-unit increase in`Petal.Width`

is associated with a 0.55 unit increase in`Petal.Length`

.`versicolor`

– For all versicolor plants,`Speciesversicolor`

is coded as 1 and`Speciesvirginica`

is coded as 0. We plug these into the expression above: \(0.55Petal.Width + 1.32Petal.Width*1 + 0.10Petal.Width*0\) which is equal to \((0.55+1.32)Petal.Width = 1.87Petal.Width\). 1.87 is the final coefficient. For versicolor plants, a one-unit increase in`Petal.Width`

is associated with a 1.87 unit increase in`Petal.Length`

.`virginica`

– For all virginica plants,`Speciesversicolor`

is coded as 0 and`Speciesvirginica`

is coded as 1. We plug these into the expression above: \(0.55Petal.Width + 1.32Petal.Width*0 + 0.10Petal.Width*1\) which is equal to \((0.55+0.10)Petal.Width = 0.65Petal.Width\). 0.65 is the final coefficient. For virginica plants, a one-unit increase in`Petal.Width`

is associated with a 0.65 unit increase in`Petal.Length`

.

There are more examples in the resources linked above and you’ll also be practicing this in this week’s assignment.

## 17.5 Assignment

In this week’s assignment, you will revisit some of your work from last week as we add dummy variables and interaction terms into our linear regression models.

Like last week, load the `GSSvocab`

dataset from the `car`

package. Once again, run exact same regression you ran last week, which used the variables `age`

, `gender`

, `educ`

, and `vocab`

.^{203}

### 17.5.1 Dummy Variables, Part 1

Right now in the dataset, `gender`

is coded as a `factor`

variable:

`class(GSSvocab$gender)`

`## [1] "factor"`

`factor`

is what R calls a categorical variable.

And how many levels does this categorical variable have?

`length(levels(GSSvocab$gender))`

`## [1] 2`

`levels(GSSvocab$gender)`

`## [1] "female" "male"`

It has `2`

levels, and those levels are `female`

and `male`

. So this particular categorical variable is also a dummy variable.

**Task 1**: Recode the `gender`

variable. Make a new variable called `female`

for which females are coded as `1`

and males as `0`

.

**Task 2**: Create a two-way table to show that your recode was successful.

**Task 3**: Use the `class()`

command (demonstrated above) to figure out what type of variable your new `female`

variable is. It should be numeric.

**Task 4**: Re-run the same linear regression (with `age`

, `gender`

, `educ`

, and `vocab`

), but replace `gender`

with the new `female`

variable that you just made. Is the regression result the same as the one you got last week? It should be the same.

What just happened? Last week, R converted the `gender`

variable to a dummy variable for you automatically. So you don’t actually need to do this recoding process every time you use a dummy variable. **But it’s important for you to know that the computer is treating females as 1 and males as 0 nevertheless (or sometimes vice versa, but it’s always using 0’s and 1’s).**

**Task 5**: Now that you know more this week than last week about dummy variables, interpret the coefficient for the `female`

variable in your regression output.

### 17.5.2 Dummy Variables, Part 2

Now consider this new research question, still using the `GSSvocab`

dataset: Do native-born people have different vocabulary abilities than non-native-born people, controlling for age, gender, and education?

**Task 6**: What is the null hypothesis for this research question?

**Task 7**: What is the alternate hypothesis for this research question?

**Task 8**: Run a new regression to answer this new research question. Show the results of this regression.

**Task 9**: Write out the full regression equation based on this output.

**Task 10**: What is the predicted vocabulary score for someone with the following characteristics? Please show the entire calculation.

`age`

= 35`gender`

= male`education`

= 8`nativeBorn`

= yes

**Task 11**: What is the predicted vocabulary score for someone with the same characteristics above, except that they are female? *The difference should be equal to the coefficient of the dummy variable for gender!* That’s the whole point! Please show the entire calculation.

**Task 12**: What is the answer to the research question? Make sure your answer includes an interpretation of the coefficient for the `nativeBorn`

variable, as well as that coefficient’s standard error, t-value, and p-value.

### 17.5.3 Interactions

Now we’ll turn to another research question: Is the relationship between education and vocabulary *different* for native-born and non-native-born people, when controlling for age and gender? In other words, is there an interaction between `educ`

and `nativBorn`

, when controlling for `age`

and `gender`

?

This page is likely to help you complete the next few tasks. **And you should also refer to the code with the iris data that is earier in this chapter.**

**Task 13**: Modify your previous code and run a new regression that includes the interaction in this new research question. Show your regression table in your submission.

**Task 14**: Write out the full regression equation based on this output.

**Task 15**: Use the `interact_plot()`

function to visualize the results.

**Task 16**: What is the answer to the new research question about the interaction? Make sure you look to see which coefficients are statistically significant and then interpret the results accordingly.

### 17.5.4 Logistical Tasks

**Task 17**: Please submit any feedback or questions you have as part of your assignment.

**Task 18**: Please e-mail me to **schedule a time when you would like to take your Oral Exam #2.** It should be sometime in the **March 16–27 2020 range**.

**Task 19**: Please submit your assignment to the D2L dropbox as always.

This due date was added on April 1, 2020.↩︎

There are no exceptions to this requirement.↩︎

As stated before, those of you who do not have data of your own that you would like to analyze can have a discussion with me and I can provide you with a research question and a dataset in which to study it.↩︎

In reality, you will likely run many tests/models on your own to arrive at the one that fits your RQ and data the best. But you do not need to show all of this work in your final submission. If you do wish to show all of this additional work, you can include it in an appendix to your assignment, but this is not required.↩︎

Meaning that it is programmed to behave as if it knows.↩︎

In this variable

`gendermale`

, all males in the data would be coded as 1 and all females will be coded as 0.↩︎In this variable

`genderfemale`

, all females in the data would be coded as 1 and all males would be coded as 0, which is the exact opposite of how we would code the`gendermale`

variable.↩︎Source: Exploring interactions with continuous predictors in regression models↩︎

A

*term*is anything in between the plus signs. In the equation \(a = 2b + rudolph + 43\), \(2b\), \(rudolph\), and \(43\) are all terms on the right side of the equation.↩︎Just copy and paste your code from last week. Don’t type it again!↩︎