Chapter 8 Dummy Variables: Smarter than You Think

In this chapter we will learn how R handles dummy variables.

We will need the following libraries.

8.1 Dummy Variables in R

R uses factor vectors to to represent dummy or categorical data. Factors can be ordered or unordered. Factor vectors are built on top of integer vectors and include a unique label for each integer.

8.1.1 Factors

R uses factors to handle categorical variables. Categorical variables have fixed and known set of possible values. The package forcats as part of the tidyverse offers a suite of tools for that solve common problems with factors. See the vignette on forcats for more information on the forcats package to learn more about using factors in R.

8.1.2 Character Vectors as Dummies

Character vectors are one of the six atomic vector types in R. Atomic means that the vector contains only data of a single type, in this case all of the observations are characters. Categorical data or dummy variables though they are typically coded as numeric are character vectors. For example, a dummy varialbe for sex may contain male and female, but be coded as 0 and 1 and named female. If you use a character vector as an argument in lm, R will treat the vector as a set of dummy variables. The number of dummy variables will be the number of characteristics (unique observations) minus 1.

The student admissions at UC Berkeley data set has aggregate data on graduate school applicants for the six largest departments, ?UCBAdmissions for more information. There are four variables in the data set, Admit (whether the cadidate was admitted or rejected), Gender (the gender of the candidate: Male or Female), Dept (department to which the candidate applied coded as A, B, C, D, E, F), and n (the number of applicants). n is a numeric vector. Admit, Gender, and Dept, are character vectors. Since the data are store as a table, to read them into R as a data frame call as_tibble from the dplyr package with the argument UCBAdmissions.

Observations: 24
Variables: 4
$ Admit  <chr> "Admitted", "Rejected", "Admitted", "Rejected", "Admitt...
$ Gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Fe...
$ Dept   <chr> "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", ...
$ n      <dbl> 512, 313, 89, 19, 353, 207, 17, 8, 120, 205, 202, 391, ...

Suppose we wisht to estimate the difference in difference model \(n_i = \beta_0+\beta_1Admit_i+\epsilon_i\). If we use Admit as an argument in lm, R will correctly treat Admit as single dummy variable with two categories.


Call:
lm(formula = n ~ Admit)

Coefficients:
  (Intercept)  AdmitRejected  
        146.2           84.7  

R has coded Rejected as 1 and Admitted as 0. The regression indicates that mean of admitted is 146.25 while the mean number rejected is 230.92. We can confirm that directly as well.

    Admit              Gender              Dept                 n      
 Length:12          Length:12          Length:12          Min.   : 17  
 Class :character   Class :character   Class :character   1st Qu.: 46  
 Mode  :character   Mode  :character   Mode  :character   Median :107  
                                                          Mean   :146  
                                                          3rd Qu.:154  
                                                          Max.   :512  
    Admit              Gender              Dept                 n      
 Length:12          Length:12          Length:12          Min.   :  8  
 Class :character   Class :character   Class :character   1st Qu.:188  
 Mode  :character   Mode  :character   Mode  :character   Median :262  
                                                          Mean   :231  
                                                          3rd Qu.:314  
                                                          Max.   :391  
[1] 146
[1] 231

Similarly, if we want to calculate the mean number of applicants by department, R will treat Dept as 5 dummy variables.


Call:
lm(formula = n ~ Dept)

Coefficients:
(Intercept)        DeptB        DeptC        DeptD        DeptE  
     233.25       -87.00        -3.75       -35.25       -87.25  
      DeptF  
     -54.75  

The mean number of applicants in Department A is 233.25. To find the mean number of applicants for each department add the appropriate coefficient to 233.25.

We can confirm these results as we did above.

8.2 Difference in Means Test

Using the UCB Admissions data, let’s conduct a difference of means test for number of applications by Gender. We will test the following hypothesis: \[H_0: \mu_{Male}=\mu_{Female}\\ H_1: \mu_{Male}\ne\mu_{Female}\] at the \(\alpha=.05\) level of significance. We can use t.test in two different ways, lm, or lm_robust. First, we will test the hypothesis with t.test assuming, in turn, equal and unequal variances.

8.2.1 Using t.test


    Two Sample t-test

data:  n[Gender == "Female"] and n[Gender == "Male"]
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -188.4   45.7
sample estimates:
mean of x mean of y 
      153       224 

    Two Sample t-test

data:  n by Gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -188.4   45.7
sample estimates:
mean in group Female   mean in group Male 
                 153                  224 

    Welch Two Sample t-test

data:  n[Gender == "Female"] and n[Gender == "Male"]
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -188.4   45.8
sample estimates:
mean of x mean of y 
      153       224 

    Welch Two Sample t-test

data:  n by Gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -188.4   45.8
sample estimates:
mean in group Female   mean in group Male 
                 153                  224 

8.2.2 Using lm and lm_robust

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    153.       39.9      3.83 0.000911
2 GenderMale      71.3      56.5      1.26 0.220   
         term estimate std.error statistic  p.value conf.low conf.high df
1 (Intercept)    152.9      38.7      3.95 0.000679     72.7       233 22
2  GenderMale     71.3      56.5      1.26 0.219606    -45.7       188 22
  outcome
1       n
2       n

8.3 Integer and Numerical Vectors as Dummy Variables

lm treated the character vectors as factors. For most of what we will do, that is enough. If the categorical (dummy) variable is coded as a numeric vector or integer vector, we my have coerce the variable to a factor for lm to interpret it correctly. If the variable is coded as 0 and 1, we can use it as it is. For example, conisder the the mtcars data.

Observations: 32
Variables: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
$ wt   <dbl> 2.62, 2.88, 2.32, 3.21, 3.44, 3.46, 3.57, 3.19, 3.15, 3.4...
$ qsec <dbl> 16.5, 17.0, 18.6, 19.4, 17.0, 20.2, 15.8, 20.0, 22.9, 18....
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

The type of transmission, am, takes on two values 1 if the transmission is automatic and 0 if it is manual. Suppose we’d like to know if the mpg is different for the two types of transmissions. We can test the hypothesis \[H_0:\mu_a=\mu_m\] \[H_1:\mu_a\ne\mu_m\]d at the \(\alpha=.05\) level of significance.

         term estimate std.error statistic                p.value conf.low
1 (Intercept)    17.15      0.88     19.50 0.00000000000000000138    15.35
2          am     7.24      1.92      3.77 0.00072109506857981581     3.32
  conf.high df outcome
1      18.9 30     mpg
2      11.2 30     mpg

If, however, the categorical variable is not coded as 0 and 1, we will have to coerce it to a factor. The forcats package simplifies this process. Suppose we’d like to know if the average mpg is different for 4, 6, and 8 cylinder cars. \[H_0:\mu_4=\mu_6=\mu_8\] \[H_1:\text{@ least one }\mu\text{ is not equal}\]If we estimate a model of mpg on cyl, the coefficient on cyl will give us the marginal effect on mpg of adding a cylinder. A signficant coefficient in this model will not answer our question. To do that, we must coerce cyl into a categorical variable with as.factor.


Call:
lm(formula = mpg ~ as.factor(cyl))

Residuals:
   Min     1Q Median     3Q    Max 
-5.264 -1.836  0.029  1.389  7.236 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       26.664      0.972   27.44 < 0.0000000000000002 ***
as.factor(cyl)6   -6.921      1.558   -4.44              0.00012 ***
as.factor(cyl)8  -11.564      1.299   -8.90        0.00000000086 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.22 on 29 degrees of freedom
Multiple R-squared:  0.732, Adjusted R-squared:  0.714 
F-statistic: 39.7 on 2 and 29 DF,  p-value: 0.00000000498

The F-stat for overall significance of the model is significant at the \(\alpha = .05\) level of significance so we reject the null hypothesis in favor of the alternative and conclude that at least one average mpg is different.

The base case is cars with 4 cylinders with an average mpg of 26.7 mpg. 6 cylinder cars average a statistically significant 6.9 mpg less than 4 cylinder cars. 8 cylinder cars average a statistically siginficant 11.6 mpg less than 4 cylider cars. These averages are statistically signficantly different.

Had we estimated the model without coercing cylinders into a factor our results would have been

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.9      2.07      18.3  8.37e-18
2 cyl            -2.88     0.322     -8.92 6.11e-10

\(\hat\beta_1=-2.88\) tells us that for each additional cylinder fuel mileage will fall by 2.88 mpg.

8.4 Manipulating Factors

The forcats package provides a set of tools for the simple manipulation of factors like renaming factors, re-ordering factors, combining factors, etc. Using the mtcars data, lets coerce the number of cylinders to a factor and look at ways to manipulate in ways to aid in understanding. The compound pipe operator %<>% is used to update a value by first piping into one or more expressions and then assigning the result.

 Factor w/ 3 levels "6","4","8": 1 1 2 1 3 1 3 2 2 1 ...

cyl is now a factor with 3 levels, 6, 4, 8. Suppose we estimate the model \(mpg = \beta_0 + \beta_1mpg+\epsilon\).

# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    19.7       1.22     16.2  4.49e-16
2 cyl4            6.92      1.56      4.44 1.19e- 4
3 cyl8           -4.64      1.49     -3.11 4.15e- 3

This model indicates that cars with 6 cylinder engines average 19.74 mpg, cars with 4 cylinders average 6.9 mpg more than cars with 6 cylinders, and cars with 8 cylinders average 4.64 mpg less than cars with 6 cylinders. Suppose, instead, you’d prefere 4 cylinder cars to be the base case. We can reorder the factor with fct_relevel from the forcats package. fct_revel changes the order of a factor by hand.

For some factors the order doesn’t or won’t matter, for others there is “natural” ordering suggested by the data, for others you may have an ordering that you prefer. fct_relevel() from the forcats package handles that task. If we call fct_relevel within lm the releveling will be ad hoc.

# A tibble: 3 x 5
  term                                estimate std.error statistic  p.value
  <chr>                                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                            26.7      0.972     27.4  2.69e-22
2 "fct_relevel(cyl, levels = c(\"4\"~    -6.92     1.56      -4.44 1.19e- 4
3 "fct_relevel(cyl, levels = c(\"4\"~   -11.6      1.30      -8.90 8.57e-10

We can permantly relevel cylinders

# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    26.7      0.972     27.4  2.69e-22
2 cyl6           -6.92     1.56      -4.44 1.19e- 4
3 cyl8          -11.6      1.30      -8.90 8.57e-10

See Reorder factor levels by hand for a more ways to relevel factors.

The transmission variable (am) is a numeric vector coded as 0 and 1. Suppose we’d like to coerce it to a factor coded with the levels named “automatic” and “manual” rather than 0 and 1.

If we re-estimate the model \(mpg = \beta_0+\beta_1am\) we see the results are the same, but the variable is labeled more clearly.

         term estimate std.error statistic                p.value conf.low
1 (Intercept)    17.15      0.88     19.50 0.00000000000000000138    15.35
2    ammanual     7.24      1.92      3.77 0.00072109506857981581     3.32
  conf.high df outcome
1      18.9 30     mpg
2      11.2 30     mpg

8.5 Dummy Interaction Variables

Dummy interactions \(x_iD_i\) can be created in lm as an argument. Let’s esitmate the the model \(mpg= \beta_0+\beta_1am+\beta_2hp+\beta_3hp*am+\epsilon\).


Call:
lm(formula = mpg ~ hp * am)

Coefficients:
(Intercept)           hp     ammanual  hp:ammanual  
  26.624848    -0.059137     5.217653     0.000403  

Notice that R assumed that you wanted to calculate \(\hat\beta_1\), \(\hat\beta_2\), and \(\hat\beta_3\). By including hp*am as an argument in lm R estimated the continuous coefficients for the continuous variable, the dummy variable, and the interactions. If, on the other hand, you wanted just the interaction term, i.e., \(mpg=\alpha_0+\alpha_1hp*am+\eta\), use the “AsIs” function I() as follows:


Call:
lm(formula = mpg ~ I(hp * am))

Coefficients:
(Intercept)   I(hp * am)  
    19.5696       0.0101  

I() is used to inhibit the interpretation of operators in formulas, so they are used as arithmetic operators.