Chapter 17 Dummy Variables and Interactions in Regression Analysis

This chapter is not part of the course HE802 in spring 2021.

Over the last few weeks, we used simple and then multiple regression analysis to analyze the linear relationships between a continuous numeric dependent variable and one or more independent variables. This week we will continue to build on our knowledge of regression analysis by adding two key capabilities to our toolbox: dummy variables to look at differences between groups and interaction terms to see if the relationship between a dependent and independent variable is different for different groups.

Additionally, we have oral exam #2 coming up very soon. Oral exam #2 will focus on the materials from Chapters 6–8. In the exam, you will be asked to run an OLS linear regression model, interpret its results, and run selected diagnostic tests to determine if we can trust the results of the model you made.

Everybody should please schedule a time with me (by email) when you would like to take your Oral Exam #2. It should be sometime in the March 16–27 range.

This week, our goals are to…

  1. Add and interpret dummy variables and interaction terms to our OLS linear regressions.

  2. Visualize regression results, including visualizations with dummy and interaction variables.

17.1 Tips, Tricks, and Answers From Last Week

As always, reading this section is optional, but it is based on questions I received from members of the class over the last week and it may contain some useful information.

17.1.1 Extenuating Circumstances

Some of you have asked if you can modify the speed and schedule with which you complete the course, especially given recent extenuating circumstances caused by the spread of coronavirus.

The short answer is yes, in most cases. The Institute has mechanisms in place to accommodate your schedule and other obligations in situations like these. However, this needs to be discussed individually on a case-by-case basis. Please do not hesitate to e-mail or call me to discuss your specific situation. I will do everything I can to accommodate your request. Thank you.

17.1.2 Tables in RMarkdown

Separate from the tables that you make using the table() command from within code chunks in RMarkdown, you can also make tables (containing text and/or numbers) in RMarkdown to help organize your writing.

If you would like to learn about how to add tables to your own RMarkdown file, please copy and adapt the code from the following resource:

Note that the most basic way to make a table is to simply type it out like this:

| Name  | Age | Occupation |
|:------|:----|:-----------|
| Gedke | 23  | Chemist    |
| Yada  | 54  | Slug Tamer |
| Bif   | 34  | Puppeteer  |

You can copy and paste the table above into your own RMarkdown file and then modify it to your needs.

When you “knit” your RMarkdown file into a PDF, HTML, or Word document using the Knit menu, this table that we inputted as plain text with hyphens and vertical bars will be converted into a nice-looking table. Here is what it will look like:

Name Age Occupation
Gedke 23 Chemist
Yada 54 Slug Tamer
Bif 34 Puppeteer

I also find it handy to use an online tool to help me make tables in RMarkdown, so that I don’t have to manually type out all of the hyphens and vertical bars you saw above. Here is the tool I usually use:

17.2 Final Project Details and Requirements

17.2.1 Description

With just over a month remaining in our time together in this course, I would like to share specific expectations and requirements for the final project in this class. The final project is not meant to be even close to a full quantitative research study. Instead, you can think of the final project as a take-home final exam. Another way to think about the project is that you will be writing an extended methods section and a condensed results section of an empirical research article.

The final project is due on April 25, 2020.194

17.2.2 Project Goals

The goals of this final project are to…

  1. Present and interpret the results of one quantitatve test or model (such as ANOVA, t-test, linear or logistic regression) that answers a clear and specific research question.

  2. Run, interpret, and appropriately respond to all required diagnostic tests for the quantitative test or model and present the results of all tests.

17.2.3 Project Requirements

Here are the items you must present and tasks you must complete:

  1. Write a clear research question (RQ) that can be solved using regression analysis techniques. This research question should be a single sentence with a question mark at the end.195

  2. To answer your RQ, various concepts will have to be first measured, recorded in a dataset as variables, and then related to each other quantitatively. Identify a dataset that you will use to answer your research question.196 Clearly describe the dataset, including: a) population from which the data sample was drawn, b) unit of observation, c) all variables that you will use in your analysis and the unit of measurement of each variable, d) background information about the data.

  3. Given the structure of the data and the RQ of interest, explain which type of quantitative test is most appropriate to answer your RQ and why. Also identify at least one other type of quantitative test that could also be used and explain why you instead chose the test that you did.

  4. Present basic descriptive statistics that are relevant to your RQ. You should include at least one table and at least one figure/chart.

  5. Show the code and results of one quantitative test or model that answers your RQ.197

  6. Run and present the results of all diagnostic tests that pertain to the type of test or model you ran. Ideally, your model will pass all of the tests. If your diagnostics show that your model specification violates any of the assumptions of your chosen test, you might be able to fix the problem and run the test again. Please describe all efforts to fix such problems. If you are not able to solve all such problems, it is okay. The key is that you explain what you find and how you went about your methods.

  7. Interpret the results of your test/model that are relevant to your RQ.

  8. Briefly explain any limitations in your analysis.

  9. Include all R code and results in your final submission.

  10. Present all writing in well-written English.

  11. Present everything in an aesthetically pleasing manner. It is recommended that you use an RMarkdown document, but this is not required.

17.2.4 Grading Rubric

The final project will be graded according to the rubric below. Each criterion is worth a maximum score of two points unless otherwise noted.

Criterion Score = 0 Score = 1 Score = 2
Clear RQ Unclear, more than one sentence, not a question. Confusingly presented but understandable. Clear, simply written, single sentence ending with a question mark.
Population and sample Relationship between sample and population is unclear, details about population is omitted. Minor omissions, but overall description of the population is understandable. It is very clear what the population is and how many observations from this population were sampled and then included in the dataset used in the project.
Unit of observation The meaning of each row in the data is not understandable from what is written. Reader can figure out based on context, but a clear explanation is missing. It is very clear what each row of the data means/represents. This is explicitly stated with no ambiguity or confusion.
Variables used The variables used in the analysis are not addressed. Some variables are mentioned but not all. How each variable is measured is not clear. Dependent variable and all independent variables are described in one sentence each. Unit of measure (and any relevant explanation of how a variable is coded in the data) is given for each variable.
Background on data It is not understandable where the data came from and from what context. Few details are given about the data. Clear explanation of where the data came from, when it was collected, who collected it, etc.
Choose test/model No explanation of why the presented quantitative test/model was chosen. No comparison to another test/model. Incorrect selection of model type. An explanation may be there but it might be incorrect, or a comparison to another test/model is missing. Logical explanation of the way the data is structured and how the selected test/model is best suited to that data structure. Clear explanation of why at least one alternative test/model was not used.
Descriptive statistics No or very few statistics presented. Statistics for irrelevant variables or information are presented. Descriptive statistics do not cover all variables and observations relevant to the RQ. Only one of two required charts is included. Descriptive statistics are presented for all variables relevant to the RQ and used in the selected test/model. One well-made figure is presented. One well-made table is presented.
Test/model result Code and/or summary is not shown for test/model. Code does not accomplish the type of test/model that was supposed to be used. Only partial work or result is shown. Type of test/model is unclear. Correct test/model result is shown along with appropriate R code to execute it.
Test/model assumption 1 Assumption not considered. Assumption is mentioned but incorrectly interpreted. Assumption is tested correctly and interpreted correctly.
Test/model assumption 2 Assumption not considered. Assumption is mentioned but incorrectly interpreted. Assumption is tested correctly and interpreted correctly.
Test/model assumption 3 Assumption not considered. Assumption is mentioned but incorrectly interpreted. Assumption is tested correctly and interpreted correctly.
Interpret results Many irrelevant details are given. Research question is not clearly answered. Research question is answered but interpretation of results is not exactly correct. Succinct interpretation of the portion of the test/model output that pertains to the RQ.
Limitations Limitations are not addressed or are completely incorrect given the test/model model used. Limitations are partially addressed. Multiple plausible limitations to the analysis and the conclusions we can draw from it are addressed.
R code included No R code is included Only partial R code for the results presented is given. R code is included (displayed in final document) for all results that were generated using R.
Writing quality (+) Sentences and paragraphs are not formatted according to convention. Full sentences are not used much or well. Minor grammar and/or spelling errors occur throughout, but the main points are understandable. Writing is clear and succinct. It is easy to read quickly and understand the analysis and the results. No grammar or spelling errors.
Aesthetics Project is presented in a confusing manner. Order and flow of requested items is not logical. Unnecessary fonts, symbols, and formatting layout appears. Minor blemishes and errors are visible in the submitted project. Order of all content is clear and logical. Sections and sub-sections are logically and clearly marked. The write-up is easy to read.

Items marked with a (+) in the table above will carry more weight than just two points. All other items have a maximum score of two points.

Your grade on the project will be the number of points achieved divided by the total number of points possible.

If you are not satisfied with your grade on this project, you do have the option of taking an INCOMPLETE grade for the course. Then, you will improve and re-submit your project in the weeks that follow the end of the course. I will re-grade the project and then put your improved final grade for the course into the grading system.

17.3 Dummy Variables

17.3.1 Definition

“Dummy” variables are variables that just have two values. Here are some examples:

Variable Levels
gender female, male
experimental group treatment, control
completed training did complete training, did not complete training
citizenship native, foreign
test result pass, fail

How does this look within a dataset? Maybe you would have a variable in your dataset called gender and it would be coded as 1 for females and 0 for males.

A dummy variable can also be called a binary variable, dichotomous variable, yes/no variable, two-level categorical variable, etc. All of these terms mean the same thing.

But what about when you have three or more qualitative (non-numeric) categories that you want to include in your regression as a single variable?

Here are some examples:

Variable Levels
Race black, white, other
Favorite ice cream flavor chocolate, vanilla, strawberry, other
Type of car owned Gas, electric, hybrid, do not have a car

Please read the following resources related to this situation:

17.3.2 Example

I’m going to run through a fake example now, to illustrate how dummy variables are used in regression. First I’m adding a race variable to the GSSvocab data that you have seen before:

# Generate race variable based on US demographics
library(car)
## Loading required package: carData
set.seed(24)
GSSvocab$race <- as.factor(sample( c("white","black","other"), nrow(GSSvocab), replace=TRUE, prob=c(.72, .13, .15) ))

# Check the levels of the new variable
levels(GSSvocab$race)

# Select the reference group
GSSvocab$race <- relevel(GSSvocab$race, ref = "white")

Distribution of the new race variable:

with(GSSvocab, table(race))
## race
## white black other 
## 20922  3647  4298

Here’s an OLS linear regression:

summary(VocabReg1 <- lm(vocab ~ age + gender + educ + race , data=GSSvocab))
## 
## Call:
## lm(formula = vocab ~ age + gender + educ + race, data = GSSvocab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9897 -1.1283  0.0782  1.2361  8.5150 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.8557585  0.0621644  13.766  < 2e-16 ***
## age          0.0143656  0.0006421  22.374  < 2e-16 ***
## gendermale  -0.1465174  0.0223500  -6.556 5.64e-11 ***
## educ         0.3459089  0.0037091  93.261  < 2e-16 ***
## raceblack    0.0148854  0.0338229   0.440    0.660    
## raceother   -0.0196448  0.0314467  -0.625    0.532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.831 on 27402 degrees of freedom
##   (1459 observations deleted due to missingness)
## Multiple R-squared:  0.2435, Adjusted R-squared:  0.2433 
## F-statistic:  1764 on 5 and 27402 DF,  p-value: < 2.2e-16

There are three dummy variables in this regression model: gendermale, raceblack, and raceother.

Let’s interpret these one at a time.

gendermale is a dummy variable that the computer created for us based on the factor (categorical) variable gender, which just has two levels (male and female). It added the word “male” to the variable name to tell us that it coded male as 1 and female as 0.

The coefficient of gendermale is interpreted like this: Males are predicted to score 0.1465 lower on the vocabulary test than females, holding all other independent variables constant. We compare the level of the variable that corresponds to 1 (males) with the level that corresponds to 0 (females).

raceblack and raceother are two dummy variables that were generated for us by the computer from the categorical factor variable race. Let’s walk through exactly how this happened. race has three levels: black, white, and other. We want to see if race is associated with our dependent variable (vocab), but race is not a number that we can just add into the regression as a numeric variable like we can for age or years of education. All we can do is predict the differences in the dependent variable for each of these three groups.

But the computer only understands numbers. So what do we do? Well, we convert the categorical variable race into numeric dummy variables (which only take on the values/levels of 1 and 0). We can make three dummy variables:

  1. raceblack – Coded as 1 for anyone in the survey data who is black and 0 for anyone who is not black (meaning they are white or other).
  2. raceother – Coded as 1 for anyone in the survey data who is non-white and non-black and 0 for anyone who is white or black.
  3. racewhite – Coded as 1 for anyone in the survey data who is white and 0 for anyone who is not white (meaning they are either black or other).

Now we have three separate numeric variables that, acting together, replace our need for the categorical race variable that we started with. Note that the computer does this automatically for us when we run the regression, because it knows198 everything I just told you.

Then we include two of these three variables in the regression as numeric variables. That’s what the computer did for us automatically. It left one out. Why did it leave one out? Well, for the gender variable, which is also a categorical factor variable, we could make one dummy variable called gendermale199 and another called genderfemale,200 if we really wanted to. That would be perfectly legitimate.

Consider what would happen if we included both gendermale and genderfemale as independent variables in the regression. It wouldn’t be useful for us to include both of these because they contain the same information. Both the gendermale and the genderfemale variables tell us whether a person in the survey is male or female, even though they are coded differently. There is no added benefit to having both of these variables. We just need one of them to understand the predicted difference in the dependent variable between males and females.

Going back to the three-category race variable, but keeping the above explanation about the two-category gender variable in mind, let’s think about why we only need to include two of the three race dummy variables in our regression model. Remember that we now have three dummy variables that we have added to our data, and we know the values of these three variables for every single person in the dataset.

Below are three people in our dataset. We’ll pretend their names are Ophelia who is other, Winnie who is white, and Belinda who is black. You can see their race as it is coded in the race variable, and then you can see how each of these people is coded for each of the three new dummy variables.

name race raceblack raceother racewhite
Ophelia other 0 1 0
Winnie white 0 0 1
Belinda black 1 0 0

With these three dummy variables, we no longer need the original race variable. Let’s look at the table again, but only with the new dummy variables:

name raceblack raceother racewhite
Ophelia 0 1 0
Winnie 0 0 1
Belinda 1 0 0

Using only these dummy variables, we can see very easily that Ophelia is other, because she is is coded as 1 for raceother, 0 for raceblack, and 0 for racewhite. We don’t need the original race variable anymore to identify each person’s race. And these dummy variables are purely numeric, so the computer can now incorporate everyone’s race into the regression. We solved that problem.

But now have a look at the version of the table below, in which I have eliminated the racewhite variable:

name raceblack raceother
Ophelia 0 1
Winnie 0 0
Belinda 1 0

Let’s go through our three surveyed people based on this new table with just two variables:

  • Ophelia: Coded as 1 for raceother and 0 for raceblack, so we know she is other.
  • Winnie: Coded as 0 for raceother and 0 for raceblack, so we know that she is white, because she’s coded as 0 for the other two dummy variables.
  • Belinda: Coded as 0 for raceother and 1 for raceblack, so we know she is black.

Even though we took out the racewhite variable, we still were able to figure out each person’s race, even Winnie’s! Including all three dummy variables for a three-category categorical variable is too much information!!! The computer doesn’t need it. The variable that we leave out is called the reference category or reference level. If our categorical variable has 3 categories, we need 2 dummy variables. If our categorical variable has 6 categories, we need 5 dummy variables. The rule is:

\[number\; of\; dummy\; variables = number\; of\; categories - 1\]

That was a pretty lengthy explanation. Hopefully this makes sense. Ask questions if not!

Let’s return finally to the regression we ran. The coefficient for raceblack is 0.0149 and the coefficient for raceother is -0.0196. Here’s how we interpret these results:

  • Black race people are predicted to have a vocabulary score that is 0.0149 higher than white race people, controlling for all of the other independent variables.
  • Other race people are predicted to have a vocabulary score that is 0.0196 lower than white race people, controlling for all of the other independent variables.

racewhite was left out as the reference category. The computer only understands numbers. The coefficients for raceother compares all people for whom raceother = 1 (like Ophelia) to those for whom raceother = 0 and raceblack = 0 (like Winnie). The coefficient for raceblack compares all people for whom raceblack = 1 (like Belinda) to all people for whom raceother = 0 and raceblack = 0 (like Winnie once again).

The regression only allows us to compare each category with the reference category. All non-white people can only be compared with those who are white. We could have left out a different variable as the reference category and included racewhite. In that case, all of our results would be compared to the selected reference category. But the overall results would be the same. Look:

GSSvocab$race <- relevel(GSSvocab$race, ref = "black")
summary(VocabReg2 <- lm(vocab ~ age + gender + educ + race , data=GSSvocab))
## 
## Call:
## lm(formula = vocab ~ age + gender + educ + race, data = GSSvocab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9897 -1.1283  0.0782  1.2361  8.5150 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.8706439  0.0683596  12.736  < 2e-16 ***
## age          0.0143656  0.0006421  22.374  < 2e-16 ***
## gendermale  -0.1465174  0.0223500  -6.556 5.64e-11 ***
## educ         0.3459089  0.0037091  93.261  < 2e-16 ***
## racewhite   -0.0148854  0.0338229  -0.440    0.660    
## raceother   -0.0345303  0.0423761  -0.815    0.415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.831 on 27402 degrees of freedom
##   (1459 observations deleted due to missingness)
## Multiple R-squared:  0.2435, Adjusted R-squared:  0.2433 
## F-statistic:  1764 on 5 and 27402 DF,  p-value: < 2.2e-16

I changed the reference category to raceblack and now you’ll see that it is left out of the regression, and racewhite is included. The coefficient for racewhite is now -0.0149, exactly the same magnitude as the previous raceblack coefficient but with the opposite sign! Despite changing the reference category, we got the exact same result, that white people are predicted to have a lower vocabulary score that is 0.0149 than that of black people (holding constant all of the other independent variables). So it ultimately didn’t matter too much which reference category we chose. The rest of the model is the same. All the other coefficients are the same. The Multiple R-squared is exactly the same.

Hopefully all of this made sense. I couldn’t find an explanation of more-than-two category categorical variables that I liked, so I wrote all of this. Please ask for clarifications as you see fit!

Keep in mind that I generated the above race data randomly just to illustrate how to use dummy variables in regressions, so this is obviously not a true finding. Another reminder of this is that the p-values for the race dummy variables in the result were not statistically significant. So we anyway don’t have any evidence that the fake race assignments are associated with vocabulary score in the population at large.

17.3.3 Optional Resources

The following resources might help reinforce your understanding of dummy variables. It is not required for you to read/consume these:

Some of the resources above explain how to interpret a regression coefficient for a dummy variable when there are only two categories (such as male and female) that you need to capture in your regression.

17.4 Interactions in Regression

17.4.1 Definition and Overview

Previously, we learned about the concept of an interaction when doing ANOVA tests. Now, we will look at how to incorporate the very same concept into a linear regression model.

Please have a look at the following resources. If you find the concept of an interaction to be intuitive already, you can quickly skim through these resources rather than reading word-for-word.

17.4.2 Example

Now we will run through a very short example, which you can easily run in R on your own computer!

Please copy the code below into R and run it.201 You will need to modify this code as part of your assignment this week.

if (!require(interactions)) install.packages('interactions')
library(interactions)

summary(fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris))
with(iris,table(Species))
interact_plot(fitiris, pred = Petal.Width, modx = Species)

If you want to see the same thing but with the individual data points displayed as well (which isn’t always a good idea), run this:

interact_plot(fitiris, pred = Petal.Width, modx = Species, plot.points = TRUE)

Basically, when we interact a continuous variable with a categorical variable, we are asking the regression model to tell us if there is a different slope for the relationship between that continuous variable and the dependent variable for the different levels of the categorical variable. This is extremely useful.

Here’s how this corresponds to our borrowed example above:

dependent variable continuous independent variable categorical independent variable
petal length petal width species

In the regression, we interacted petal width and species. We got the following regression equation:

\[\begin{eqnarray} Petal.Length_{predicted} &=& 0.55Petal.Width + 0.45Speciesversicolor \\ && + \text{ } 2.91Speciesvirginica + 1.32Petal.Width*Speciesversicolor \\ && + \text{ } 0.10Petal.Width*Speciesvirginica + 1.33 \end{eqnarray}\]

The categorical variable Species has three levels:

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"

setosa is the reference category. You’ll notice that it’s missing from the regression output, and that’s why. The computer created dummy variables for the other two species: Speciesversicolor and Speciesvirginica.

Let’s look now at the predicted relationship between the independent variable Petal.Width and the dependent variable Petal.Length. To do this, we’ll take out any terms202 that include Petal.Width on the right side of the equation:

\[0.55Petal.Width + 1.32Petal.Width*Speciesversicolor + 0.10Petal.Width*Speciesvirginica\]

Using this, we can figure out the predicted relationship between Petal.Length and Petal.Width for plants that fall into each of the three levels of Species:

  • setosa – For all setosa plants, Speciesversicolor is coded as 0 and Speciesvirginica is also coded as 0. We plug in 0 for each of these in the expression above: \(0.55Petal.Width + 1.32Petal.Width*0 + 0.10Petal.Width*0\) which is equal to \(0.55Petal.Width\). 0.55 is the final coefficient. For setosa plants, A one-unit increase in Petal.Width is associated with a 0.55 unit increase in Petal.Length.

  • versicolor – For all versicolor plants, Speciesversicolor is coded as 1 and Speciesvirginica is coded as 0. We plug these into the expression above: \(0.55Petal.Width + 1.32Petal.Width*1 + 0.10Petal.Width*0\) which is equal to \((0.55+1.32)Petal.Width = 1.87Petal.Width\). 1.87 is the final coefficient. For versicolor plants, a one-unit increase in Petal.Width is associated with a 1.87 unit increase in Petal.Length.

  • virginica – For all virginica plants, Speciesversicolor is coded as 0 and Speciesvirginica is coded as 1. We plug these into the expression above: \(0.55Petal.Width + 1.32Petal.Width*0 + 0.10Petal.Width*1\) which is equal to \((0.55+0.10)Petal.Width = 0.65Petal.Width\). 0.65 is the final coefficient. For virginica plants, a one-unit increase in Petal.Width is associated with a 0.65 unit increase in Petal.Length.

There are more examples in the resources linked above and you’ll also be practicing this in this week’s assignment.

17.5 Assignment

In this week’s assignment, you will revisit some of your work from last week as we add dummy variables and interaction terms into our linear regression models.

Like last week, load the GSSvocab dataset from the car package. Once again, run exact same regression you ran last week, which used the variables age, gender, educ, and vocab.203

17.5.1 Dummy Variables, Part 1

Right now in the dataset, gender is coded as a factor variable:

class(GSSvocab$gender)
## [1] "factor"

factor is what R calls a categorical variable.

And how many levels does this categorical variable have?

length(levels(GSSvocab$gender))
## [1] 2
levels(GSSvocab$gender)
## [1] "female" "male"

It has 2 levels, and those levels are female and male. So this particular categorical variable is also a dummy variable.

Task 1: Recode the gender variable. Make a new variable called female for which females are coded as 1 and males as 0.

Task 2: Create a two-way table to show that your recode was successful.

Task 3: Use the class() command (demonstrated above) to figure out what type of variable your new female variable is. It should be numeric.

Task 4: Re-run the same linear regression (with age, gender, educ, and vocab), but replace gender with the new female variable that you just made. Is the regression result the same as the one you got last week? It should be the same.

What just happened? Last week, R converted the gender variable to a dummy variable for you automatically. So you don’t actually need to do this recoding process every time you use a dummy variable. But it’s important for you to know that the computer is treating females as 1 and males as 0 nevertheless (or sometimes vice versa, but it’s always using 0’s and 1’s).

Task 5: Now that you know more this week than last week about dummy variables, interpret the coefficient for the female variable in your regression output.

17.5.2 Dummy Variables, Part 2

Now consider this new research question, still using the GSSvocab dataset: Do native-born people have different vocabulary abilities than non-native-born people, controlling for age, gender, and education?

Task 6: What is the null hypothesis for this research question?

Task 7: What is the alternate hypothesis for this research question?

Task 8: Run a new regression to answer this new research question. Show the results of this regression.

Task 9: Write out the full regression equation based on this output.

Task 10: What is the predicted vocabulary score for someone with the following characteristics? Please show the entire calculation.

  • age = 35
  • gender = male
  • education = 8
  • nativeBorn = yes

Task 11: What is the predicted vocabulary score for someone with the same characteristics above, except that they are female? The difference should be equal to the coefficient of the dummy variable for gender! That’s the whole point! Please show the entire calculation.

Task 12: What is the answer to the research question? Make sure your answer includes an interpretation of the coefficient for the nativeBorn variable, as well as that coefficient’s standard error, t-value, and p-value.

17.5.3 Interactions

Now we’ll turn to another research question: Is the relationship between education and vocabulary different for native-born and non-native-born people, when controlling for age and gender? In other words, is there an interaction between educ and nativBorn, when controlling for age and gender?

This page is likely to help you complete the next few tasks. And you should also refer to the code with the iris data that is earier in this chapter.

Task 13: Modify your previous code and run a new regression that includes the interaction in this new research question. Show your regression table in your submission.

Task 14: Write out the full regression equation based on this output.

Task 15: Use the interact_plot() function to visualize the results.

Task 16: What is the answer to the new research question about the interaction? Make sure you look to see which coefficients are statistically significant and then interpret the results accordingly.

17.5.4 Logistical Tasks

Task 17: Please submit any feedback or questions you have as part of your assignment.

Task 18: Please e-mail me to schedule a time when you would like to take your Oral Exam #2. It should be sometime in the March 16–27 2020 range.

Task 19: Please submit your assignment to the D2L dropbox as always.


  1. This due date was added on April 1, 2020.↩︎

  2. There are no exceptions to this requirement.↩︎

  3. As stated before, those of you who do not have data of your own that you would like to analyze can have a discussion with me and I can provide you with a research question and a dataset in which to study it.↩︎

  4. In reality, you will likely run many tests/models on your own to arrive at the one that fits your RQ and data the best. But you do not need to show all of this work in your final submission. If you do wish to show all of this additional work, you can include it in an appendix to your assignment, but this is not required.↩︎

  5. Meaning that it is programmed to behave as if it knows.↩︎

  6. In this variable gendermale, all males in the data would be coded as 1 and all females will be coded as 0.↩︎

  7. In this variable genderfemale, all females in the data would be coded as 1 and all males would be coded as 0, which is the exact opposite of how we would code the gendermale variable.↩︎

  8. Source: Exploring interactions with continuous predictors in regression models↩︎

  9. A term is anything in between the plus signs. In the equation \(a = 2b + rudolph + 43\), \(2b\), \(rudolph\), and \(43\) are all terms on the right side of the equation.↩︎

  10. Just copy and paste your code from last week. Don’t type it again!↩︎