Chapter 7 Dummy Variables: Smarter than You Think
In this chapter we will learn how R handles dummy variables.
We will need the following libraries.
library(tidyverse)
library(magrittr)
library(broom)
library(estimatr)
library(forcats)
7.1 Dummy Variables in R
R uses factor vectors to to represent dummy or categorical data. Factors can be ordered or unordered. Factor vectors are built on top of integer vectors and include a unique label for each integer.
7.1.1 Factors
R uses factors to handle categorical variables. Categorical variables have fixed and known set of possible values. The package forcats as part of the tidyverse offers a suite of tools for that solve common problems with factors. See the vignette on forcats for more information on the forcats package to learn more about using factors in R.
7.1.2 Character Vectors as Dummies
Character vectors are one of the six atomic vector types in R. Atomic means that the vector contains only data of a single type, in this case all of the observations are characters. Categorical data or dummy variables though they are typically coded as numeric are character vectors. For example, a dummy varialbe for sex may contain male and female, but be coded as 0 and 1 and named female. If you use a character vector as an argument in lm
, R will treat the vector as a set of dummy variables. The number of dummy variables will be the number of characteristics (unique observations) minus 1.
The student admissions at UC Berkeley data set has aggregate data on graduate school applicants for the six largest departments, ?UCBAdmissions
for more information. There are four variables in the data set, Admit (whether the cadidate was admitted or rejected), Gender (the gender of the candidate: Male or Female), Dept (department to which the candidate applied coded as A, B, C, D, E, F), and n (the number of applicants). n is a numeric vector. Admit, Gender, and Dept, are character vectors. Since the data are store as a table, to read them into R as a data frame call as_tibble
from the dplyr package with the argument UCBAdmissions.
<- UCBAdmissions %>%
ucb_admissions as_tibble() %>%
::clean_names() %>%
janitorglimpse()
Rows: 24
Columns: 4
$ admit <chr> "Admitted", "Rejected", "Admitted", "Rejected", "Admitted", ...
$ gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Female"...
$ dept <chr> "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", ...
$ n <dbl> 512, 313, 89, 19, 353, 207, 17, 8, 120, 205, 202, 391, 138, ...
Suppose we wish to estimate the difference in difference model \(n_i = \beta_0+\beta_1Admit_i+\epsilon_i\). If we use Admit as an argument in lm
, R will correctly treat Admit as single dummy variable with two categories.
%>%
ucb_admissions lm(n ~ admit, .)
Call:
lm(formula = n ~ admit, data = .)
Coefficients:
(Intercept) admitRejected
146.2 84.7
R has coded Rejected as 1 and Admitted as 0. The regression indicates that mean of admitted is 146.25 while the mean number rejected is 230.92. We can confirm that directly as well.
# Using dplyr verbs
%>%
ucb_admissions group_by(admit) %>%
summarize(Average = mean(n))
# A tibble: 2 x 2
admit Average
<chr> <dbl>
1 Admitted 146.
2 Rejected 231.
Similarly, if we want to calculate the mean number of applicants by department, R will treat Dept as 5 dummy variables.
%>%
ucb_admissions lm(n ~ dept, .)
Call:
lm(formula = n ~ dept, data = .)
Coefficients:
(Intercept) deptB deptC deptD deptE deptF
233.25 -87.00 -3.75 -35.25 -87.25 -54.75
The mean number of applicants in Department A is 233.25. To find the mean number of applicants for each department add the appropriate coefficient to 233.25.
We can confirm these results as we did above.
7.2 Difference in Means Test
Using the UCB Admissions data, let’s conduct a difference of means test for number of applications by Gender. We will test the following hypothesis: \[H_0: \mu_{Male}=\mu_{Female}\\ H_1: \mu_{Male}\ne\mu_{Female}\] at the \(\alpha=.05\) level of significance. We can use t.test
in two different ways, lm
, or lm_robust
. First, we will test the hypothesis with t.test
assuming, in turn, equal and unequal variances.
7.2.1 Using t.test
# Assume equal variances
# Use t.test for class 'formula`
%>%
ucb_admissions t.test(n ~ gender, ., var.equal = TRUE)
Two Sample t-test
data: n by gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-188.4 45.7
sample estimates:
mean in group Female mean in group Male
153 224
# Assume unequal variances
# Use t.test for class 'formula`
%>%
ucb_admissions t.test(n ~ gender, .)
Welch Two Sample t-test
data: n by gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-188.4 45.8
sample estimates:
mean in group Female mean in group Male
153 224
7.2.2 Using lm
and lm_robust
# Assume equal variances
%>%
ucb_admissions lm(n ~ gender, .) %>%
tidy()
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 153. 39.9 3.83 0.000911
2 genderMale 71.3 56.5 1.26 0.220
# Assume unequal variances
%>%
ucb_admissions lm_robust(n ~ gender, .) %>%
tidy()
term estimate std.error statistic p.value conf.low conf.high df
1 (Intercept) 152.9 38.7 3.95 0.000679 72.7 233 22
2 genderMale 71.3 56.5 1.26 0.219606 -45.7 188 22
outcome
1 n
2 n
7.3 Integer and Numerical Vectors as Dummy Variables
lm
treated the character vectors as factors. For most of what we will do, that is enough. If the categorical (dummy) variable is coded as a numeric vector or integer vector, we my have coerce the variable to a factor for lm
to interpret it correctly. If the variable is coded as 0 and 1, we can use it as it is. For example, consider the the mtcars
data.
mtcars %>%
glimpse()
The type of transmission, am, takes on two values 1 if the transmission is automatic and 0 if it is manual. Suppose we’d like to know if the mpg is different for the two types of transmissions. We can test the hypothesis \[H_0:\mu_a=\mu_m\] \[H_1:\mu_a\ne\mu_m\]d at the \(\alpha=.05\) level of significance.
%>%
mtcars lm_robust(mpg ~ am, .) %>%
tidy()
term estimate std.error statistic p.value conf.low
1 (Intercept) 17.15 0.88 19.50 0.00000000000000000138 15.35
2 am 7.24 1.92 3.77 0.00072109506857981581 3.32
conf.high df outcome
1 18.9 30 mpg
2 11.2 30 mpg
If, however, the categorical variable is not coded as 0 and 1, we will have to coerce it to a factor. The forcats package simplifies this process. Suppose we’d like to know if the average mpg is different for 4, 6, and 8 cylinder cars. \[H_0:\mu_4=\mu_6=\mu_8\] \[H_1:\text{@ least one }\mu\text{ is not equal}\]If we estimate a model of mpg on cyl, the coefficient on cyl will give us the marginal effect on mpg of adding a cylinder. A significant coefficient in this model will not answer our question. To do that, we must coerce cyl into a categorical variable with as.factor
.
%>%
mtcars lm(mpg ~ as.factor(cyl), .) %>%
summary()
Call:
lm(formula = mpg ~ as.factor(cyl), data = .)
Residuals:
Min 1Q Median 3Q Max
-5.264 -1.836 0.029 1.389 7.236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.664 0.972 27.44 < 0.0000000000000002 ***
as.factor(cyl)6 -6.921 1.558 -4.44 0.00012 ***
as.factor(cyl)8 -11.564 1.299 -8.90 0.00000000086 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.22 on 29 degrees of freedom
Multiple R-squared: 0.732, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 0.00000000498
The F-stat for overall significance of the model is significant at the \(\alpha = .05\) level of significance so we reject the null hypothesis in favor of the alternative and conclude that at least one average mpg is different.
The base case is cars with 4 cylinders with an average mpg of 26.7 mpg. 6 cylinder cars average a statistically significant 6.9 mpg less than 4 cylinder cars. 8 cylinder cars average a statistically significant 11.6 mpg less than 4 cylinder cars. These averages are statistically significantly different.
Had we estimated the model without coercing cylinders into a factor our results would have been
%>%
mtcars lm(mpg ~ cyl, .) %>%
tidy()
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 37.9 2.07 18.3 8.37e-18
2 cyl -2.88 0.322 -8.92 6.11e-10
\(\hat\beta_1=-2.88\) tells us that for each additional cylinder fuel mileage will fall by 2.88 mpg.
7.4 Manipulating Factors
The forcats package provides a set of tools for the simple manipulation of factors like renaming factors, re-ordering factors, combining factors, etc. Using the mtcars
data, lets coerce the number of cylinders to a factor and look at ways to manipulate in ways to aid in understanding. The compound pipe operator %<>%
is used to update a value by first piping into one or more expressions and then assigning the result.
### Coerce cyl to a factor
$cyl %<>%
mtcarsas.character() %>% # forcats will not coerce integer or numeric vectors to factors
as_factor()
$cyl %>% str() mtcars
Factor w/ 3 levels "6","4","8": 1 1 2 1 3 1 3 2 2 1 ...
cyl is now a factor with 3 levels, 6, 4, 8. Suppose we estimate the model \(mpg = \beta_0 + \beta_1mpg+\epsilon\).
%>%
mtcars lm(mpg ~ cyl, .) %>%
tidy()
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19.7 1.22 16.2 4.49e-16
2 cyl4 6.92 1.56 4.44 1.19e- 4
3 cyl8 -4.64 1.49 -3.11 4.15e- 3
This model indicates that cars with 6 cylinder engines average 19.74 mpg, cars with 4 cylinders average 6.9 mpg more than cars with 6 cylinders, and cars with 8 cylinders average 4.64 mpg less than cars with 6 cylinders. Suppose, instead, you’d prefere 4 cylinder cars to be the base case. We can reorder the factor with fct_relevel
from the forcats package. fct_revel
changes the order of a factor by hand.
For some factors the order doesn’t or won’t matter, for others there is “natural” ordering suggested by the data, for others you may have an ordering that you prefer. fct_relevel()
from the forcats package handles that task. If we call fct_relevel
within lm
the re-leveling will be ad hoc.
%>%
mtcars lm(mpg ~ fct_relevel(cyl, levels = c("4", "6", "8")), .) %>%
tidy()
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 "(Intercept)" 26.7 0.972 27.4 2.69e-22
2 "fct_relevel(cyl, levels = c(\"4\", \"6~ -6.92 1.56 -4.44 1.19e- 4
3 "fct_relevel(cyl, levels = c(\"4\", \"6~ -11.6 1.30 -8.90 8.57e-10
We can permanently re-level cylinders
# re-level the factor
%>%
mtcars mutate(cyl = fct_relevel(cyl, "6", after = 1)) %>%
lm(mpg ~ cyl, .) %>%
tidy()
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.7 0.972 27.4 2.69e-22
2 cyl6 -6.92 1.56 -4.44 1.19e- 4
3 cyl8 -11.6 1.30 -8.90 8.57e-10
See Reorder factor levels by hand for a more ways to re-level factors.
The transmission variable (am) is a numeric vector coded as 0 and 1. Suppose we’d like to coerce it to a factor coded with the levels named “automatic” and “manual” rather than 0 and 1.
%>%
mtcars mutate(am = factor(am, levels = c(0,1), labels = c("automatic", "manual"))) %>%
::datatable() DT
If we re-estimate the model \(mpg = \beta_0+\beta_1am\) we see the results are the same, but the variable is labeled more clearly.
%>%
mtcars mutate(am = factor(am, levels = c(0,1), labels = c("automatic", "manual"))) %>%
lm_robust(mpg ~ am, .) %>%
tidy()
term estimate std.error statistic p.value conf.low
1 (Intercept) 17.15 0.88 19.50 0.00000000000000000138 15.35
2 ammanual 7.24 1.92 3.77 0.00072109506857981581 3.32
conf.high df outcome
1 18.9 30 mpg
2 11.2 30 mpg
7.5 Dummy Interaction Variables
Dummy interactions \(x_iD_i\) can be created in lm
as an argument. Let’s esitmate the the model \(mpg= \beta_0+\beta_1am+\beta_2hp+\beta_3hp*am+\epsilon\).
%>%
mtcars mutate(am = factor(am, levels = c(0,1),
labels = c("automatic", "manual"))) %>%
lm_robust(mpg ~ hp*am, .) %>%
tidy()
term estimate std.error statistic p.value conf.low
1 (Intercept) 26.624848 1.36368 19.5243 0.00000000000000000762 23.8315
2 hp -0.059137 0.00905 -6.5330 0.00000044155365639502 -0.0777
3 ammanual 5.217653 2.44094 2.1376 0.04142393989920568204 0.2176
4 hp:ammanual 0.000403 0.01556 0.0259 0.97953020888772734942 -0.0315
conf.high df outcome
1 29.4182 28 mpg
2 -0.0406 28 mpg
3 10.2177 28 mpg
4 0.0323 28 mpg
Notice that R assumed that you wanted to calculate \(\hat\beta_1\), \(\hat\beta_2\), and \(\hat\beta_3\). By including hp*am
as an argument in lm
R estimated the continuous coefficients for the continuous variable, the dummy variable, and the interactions. If, on the other hand, you wanted just the interaction term, i.e., \(mpg=\alpha_0+\alpha_1hp*am+\eta\), use the “AsIs” function I()
as follows:
%>%
mtcars lm_robust(mpg ~ I(hp*am), .) %>%
tidy()
term estimate std.error statistic p.value conf.low
1 (Intercept) 19.5696 1.1844 16.523 0.000000000000000131 17.1507
2 I(hp * am) 0.0101 0.0171 0.591 0.559145377816964606 -0.0248
conf.high df outcome
1 21.9885 30 mpg
2 0.0451 30 mpg
I()
is used to inhibit the interpretation of operators in formulas, so they are used as arithmetic operators.