Chapter 7 Dummy Variables: Smarter than You Think

In this chapter we will learn how R handles dummy variables.

We will need the following libraries.

library(tidyverse)
library(magrittr)
library(broom)
library(estimatr)
library(forcats)

7.1 Dummy Variables in R

R uses factor vectors to to represent dummy or categorical data. Factors can be ordered or unordered. Factor vectors are built on top of integer vectors and include a unique label for each integer.

7.1.1 Factors

R uses factors to handle categorical variables. Categorical variables have fixed and known set of possible values. The package forcats as part of the tidyverse offers a suite of tools for that solve common problems with factors. See the vignette on forcats for more information on the forcats package to learn more about using factors in R.

7.1.2 Character Vectors as Dummies

Character vectors are one of the six atomic vector types in R. Atomic means that the vector contains only data of a single type, in this case all of the observations are characters. Categorical data or dummy variables though they are typically coded as numeric are character vectors. For example, a dummy varialbe for sex may contain male and female, but be coded as 0 and 1 and named female. If you use a character vector as an argument in lm, R will treat the vector as a set of dummy variables. The number of dummy variables will be the number of characteristics (unique observations) minus 1.

The student admissions at UC Berkeley data set has aggregate data on graduate school applicants for the six largest departments, ?UCBAdmissions for more information. There are four variables in the data set, Admit (whether the cadidate was admitted or rejected), Gender (the gender of the candidate: Male or Female), Dept (department to which the candidate applied coded as A, B, C, D, E, F), and n (the number of applicants). n is a numeric vector. Admit, Gender, and Dept, are character vectors. Since the data are store as a table, to read them into R as a data frame call as_tibble from the dplyr package with the argument UCBAdmissions.

ucb_admissions <- UCBAdmissions %>% 
  as_tibble() %>%  
  janitor::clean_names() %>% 
  glimpse() 
Rows: 24
Columns: 4
$ admit  <chr> "Admitted", "Rejected", "Admitted", "Rejected", "Admitted", ...
$ gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Female"...
$ dept   <chr> "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", ...
$ n      <dbl> 512, 313, 89, 19, 353, 207, 17, 8, 120, 205, 202, 391, 138, ...

Suppose we wish to estimate the difference in difference model \(n_i = \beta_0+\beta_1Admit_i+\epsilon_i\). If we use Admit as an argument in lm, R will correctly treat Admit as single dummy variable with two categories.

ucb_admissions %>%
  lm(n ~ admit, .)

Call:
lm(formula = n ~ admit, data = .)

Coefficients:
  (Intercept)  admitRejected  
        146.2           84.7  

R has coded Rejected as 1 and Admitted as 0. The regression indicates that mean of admitted is 146.25 while the mean number rejected is 230.92. We can confirm that directly as well.

# Using dplyr verbs
ucb_admissions %>% 
  group_by(admit) %>% 
  summarize(Average = mean(n))
# A tibble: 2 x 2
  admit    Average
  <chr>      <dbl>
1 Admitted    146.
2 Rejected    231.

Similarly, if we want to calculate the mean number of applicants by department, R will treat Dept as 5 dummy variables.

ucb_admissions %>% 
  lm(n ~ dept, .) 

Call:
lm(formula = n ~ dept, data = .)

Coefficients:
(Intercept)        deptB        deptC        deptD        deptE        deptF  
     233.25       -87.00        -3.75       -35.25       -87.25       -54.75  

The mean number of applicants in Department A is 233.25. To find the mean number of applicants for each department add the appropriate coefficient to 233.25.

We can confirm these results as we did above.

7.2 Difference in Means Test

Using the UCB Admissions data, let’s conduct a difference of means test for number of applications by Gender. We will test the following hypothesis: \[H_0: \mu_{Male}=\mu_{Female}\\ H_1: \mu_{Male}\ne\mu_{Female}\] at the \(\alpha=.05\) level of significance. We can use t.test in two different ways, lm, or lm_robust. First, we will test the hypothesis with t.test assuming, in turn, equal and unequal variances.

7.2.1 Using t.test

# Assume equal variances
# Use t.test for class 'formula`
ucb_admissions %>%
  t.test(n ~ gender, ., var.equal = TRUE)

    Two Sample t-test

data:  n by gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -188.4   45.7
sample estimates:
mean in group Female   mean in group Male 
                 153                  224 
# Assume unequal variances

# Use t.test for class 'formula`
ucb_admissions %>%
  t.test(n ~ gender, .)

    Welch Two Sample t-test

data:  n by gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -188.4   45.8
sample estimates:
mean in group Female   mean in group Male 
                 153                  224 

7.2.2 Using lm and lm_robust

# Assume equal variances
ucb_admissions %>%
  lm(n ~ gender, .) %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    153.       39.9      3.83 0.000911
2 genderMale      71.3      56.5      1.26 0.220   
# Assume unequal variances
ucb_admissions %>%
  lm_robust(n ~ gender, .) %>% 
  tidy()
         term estimate std.error statistic  p.value conf.low conf.high df
1 (Intercept)    152.9      38.7      3.95 0.000679     72.7       233 22
2  genderMale     71.3      56.5      1.26 0.219606    -45.7       188 22
  outcome
1       n
2       n

7.3 Integer and Numerical Vectors as Dummy Variables

lm treated the character vectors as factors. For most of what we will do, that is enough. If the categorical (dummy) variable is coded as a numeric vector or integer vector, we my have coerce the variable to a factor for lm to interpret it correctly. If the variable is coded as 0 and 1, we can use it as it is. For example, consider the the mtcars data.

mtcars %>% 
  glimpse()

The type of transmission, am, takes on two values 1 if the transmission is automatic and 0 if it is manual. Suppose we’d like to know if the mpg is different for the two types of transmissions. We can test the hypothesis \[H_0:\mu_a=\mu_m\] \[H_1:\mu_a\ne\mu_m\]d at the \(\alpha=.05\) level of significance.

mtcars %>%
  lm_robust(mpg ~ am, .) %>% 
  tidy()
         term estimate std.error statistic                p.value conf.low
1 (Intercept)    17.15      0.88     19.50 0.00000000000000000138    15.35
2          am     7.24      1.92      3.77 0.00072109506857981581     3.32
  conf.high df outcome
1      18.9 30     mpg
2      11.2 30     mpg

If, however, the categorical variable is not coded as 0 and 1, we will have to coerce it to a factor. The forcats package simplifies this process. Suppose we’d like to know if the average mpg is different for 4, 6, and 8 cylinder cars. \[H_0:\mu_4=\mu_6=\mu_8\] \[H_1:\text{@ least one }\mu\text{ is not equal}\]If we estimate a model of mpg on cyl, the coefficient on cyl will give us the marginal effect on mpg of adding a cylinder. A significant coefficient in this model will not answer our question. To do that, we must coerce cyl into a categorical variable with as.factor.

mtcars %>%
  lm(mpg ~ as.factor(cyl), .) %>% 
  summary()

Call:
lm(formula = mpg ~ as.factor(cyl), data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-5.264 -1.836  0.029  1.389  7.236 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       26.664      0.972   27.44 < 0.0000000000000002 ***
as.factor(cyl)6   -6.921      1.558   -4.44              0.00012 ***
as.factor(cyl)8  -11.564      1.299   -8.90        0.00000000086 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.22 on 29 degrees of freedom
Multiple R-squared:  0.732, Adjusted R-squared:  0.714 
F-statistic: 39.7 on 2 and 29 DF,  p-value: 0.00000000498

The F-stat for overall significance of the model is significant at the \(\alpha = .05\) level of significance so we reject the null hypothesis in favor of the alternative and conclude that at least one average mpg is different.

The base case is cars with 4 cylinders with an average mpg of 26.7 mpg. 6 cylinder cars average a statistically significant 6.9 mpg less than 4 cylinder cars. 8 cylinder cars average a statistically significant 11.6 mpg less than 4 cylinder cars. These averages are statistically significantly different.

Had we estimated the model without coercing cylinders into a factor our results would have been

mtcars %>% 
  lm(mpg ~ cyl, .) %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.9      2.07      18.3  8.37e-18
2 cyl            -2.88     0.322     -8.92 6.11e-10

\(\hat\beta_1=-2.88\) tells us that for each additional cylinder fuel mileage will fall by 2.88 mpg.

7.4 Manipulating Factors

The forcats package provides a set of tools for the simple manipulation of factors like renaming factors, re-ordering factors, combining factors, etc. Using the mtcars data, lets coerce the number of cylinders to a factor and look at ways to manipulate in ways to aid in understanding. The compound pipe operator %<>% is used to update a value by first piping into one or more expressions and then assigning the result.

### Coerce cyl to a factor
mtcars$cyl %<>% 
  as.character() %>% # forcats will not coerce integer or numeric vectors to factors
  as_factor()
mtcars$cyl %>% str()
 Factor w/ 3 levels "6","4","8": 1 1 2 1 3 1 3 2 2 1 ...

cyl is now a factor with 3 levels, 6, 4, 8. Suppose we estimate the model \(mpg = \beta_0 + \beta_1mpg+\epsilon\).

mtcars %>% 
  lm(mpg ~ cyl, .) %>% 
  tidy()
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    19.7       1.22     16.2  4.49e-16
2 cyl4            6.92      1.56      4.44 1.19e- 4
3 cyl8           -4.64      1.49     -3.11 4.15e- 3

This model indicates that cars with 6 cylinder engines average 19.74 mpg, cars with 4 cylinders average 6.9 mpg more than cars with 6 cylinders, and cars with 8 cylinders average 4.64 mpg less than cars with 6 cylinders. Suppose, instead, you’d prefere 4 cylinder cars to be the base case. We can reorder the factor with fct_relevel from the forcats package. fct_revel changes the order of a factor by hand.

For some factors the order doesn’t or won’t matter, for others there is “natural” ordering suggested by the data, for others you may have an ordering that you prefer. fct_relevel() from the forcats package handles that task. If we call fct_relevel within lm the re-leveling will be ad hoc.

mtcars %>%
  lm(mpg ~ fct_relevel(cyl, levels = c("4", "6", "8")), .) %>% 
  tidy()
# A tibble: 3 x 5
  term                                     estimate std.error statistic  p.value
  <chr>                                       <dbl>     <dbl>     <dbl>    <dbl>
1 "(Intercept)"                               26.7      0.972     27.4  2.69e-22
2 "fct_relevel(cyl, levels = c(\"4\", \"6~    -6.92     1.56      -4.44 1.19e- 4
3 "fct_relevel(cyl, levels = c(\"4\", \"6~   -11.6      1.30      -8.90 8.57e-10

We can permanently re-level cylinders

# re-level the factor
mtcars %>% 
  mutate(cyl = fct_relevel(cyl, "6", after = 1)) %>% 
  lm(mpg ~ cyl, .) %>% 
  tidy()
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    26.7      0.972     27.4  2.69e-22
2 cyl6           -6.92     1.56      -4.44 1.19e- 4
3 cyl8          -11.6      1.30      -8.90 8.57e-10

See Reorder factor levels by hand for a more ways to re-level factors.

The transmission variable (am) is a numeric vector coded as 0 and 1. Suppose we’d like to coerce it to a factor coded with the levels named “automatic” and “manual” rather than 0 and 1.

mtcars %>% 
  mutate(am = factor(am, levels = c(0,1), labels = c("automatic", "manual"))) %>% 
  DT::datatable()

If we re-estimate the model \(mpg = \beta_0+\beta_1am\) we see the results are the same, but the variable is labeled more clearly.

mtcars %>% 
  mutate(am = factor(am, levels = c(0,1), labels = c("automatic", "manual"))) %>% 
  lm_robust(mpg ~ am, .) %>% 
  tidy()
         term estimate std.error statistic                p.value conf.low
1 (Intercept)    17.15      0.88     19.50 0.00000000000000000138    15.35
2    ammanual     7.24      1.92      3.77 0.00072109506857981581     3.32
  conf.high df outcome
1      18.9 30     mpg
2      11.2 30     mpg

7.5 Dummy Interaction Variables

Dummy interactions \(x_iD_i\) can be created in lm as an argument. Let’s esitmate the the model \(mpg= \beta_0+\beta_1am+\beta_2hp+\beta_3hp*am+\epsilon\).

mtcars %>% 
  mutate(am = factor(am, levels = c(0,1), 
                     labels = c("automatic", "manual"))) %>% 
  lm_robust(mpg ~ hp*am, .) %>% 
  tidy()
         term  estimate std.error statistic                p.value conf.low
1 (Intercept) 26.624848   1.36368   19.5243 0.00000000000000000762  23.8315
2          hp -0.059137   0.00905   -6.5330 0.00000044155365639502  -0.0777
3    ammanual  5.217653   2.44094    2.1376 0.04142393989920568204   0.2176
4 hp:ammanual  0.000403   0.01556    0.0259 0.97953020888772734942  -0.0315
  conf.high df outcome
1   29.4182 28     mpg
2   -0.0406 28     mpg
3   10.2177 28     mpg
4    0.0323 28     mpg

Notice that R assumed that you wanted to calculate \(\hat\beta_1\), \(\hat\beta_2\), and \(\hat\beta_3\). By including hp*am as an argument in lm R estimated the continuous coefficients for the continuous variable, the dummy variable, and the interactions. If, on the other hand, you wanted just the interaction term, i.e., \(mpg=\alpha_0+\alpha_1hp*am+\eta\), use the “AsIs” function I() as follows:

mtcars %>% 
  lm_robust(mpg ~ I(hp*am), .) %>% 
  tidy()
         term estimate std.error statistic              p.value conf.low
1 (Intercept)  19.5696    1.1844    16.523 0.000000000000000131  17.1507
2  I(hp * am)   0.0101    0.0171     0.591 0.559145377816964606  -0.0248
  conf.high df outcome
1   21.9885 30     mpg
2    0.0451 30     mpg

I() is used to inhibit the interpretation of operators in formulas, so they are used as arithmetic operators.