# Chapter 8 Dummy Variables: Smarter than You Think

In this chapter we will learn how R handles dummy variables.

We will need the following libraries.

## 8.1 Dummy Variables in R

R uses factor vectors to to represent dummy or categorical data. Factors can be ordered or unordered. Factor vectors are built on top of integer vectors and include a unique label for each integer.

### 8.1.1 Factors

R uses factors to handle categorical variables. Categorical variables have fixed and known set of possible values. The package forcats as part of the tidyverse offers a suite of tools for that solve common problems with factors. See the vignette on forcats for more information on the forcats package to learn more about using factors in R.

### 8.1.2 Character Vectors as Dummies

Character vectors are one of the six atomic vector types in R. *Atomic* means that the vector contains only data of a single type, in this case all of the observations are characters. Categorical data or dummy variables though they are typically coded as numeric are character vectors. For example, a dummy varialbe for sex may contain male and female, but be coded as 0 and 1 and named female. If you use a character vector as an argument in `lm`

, R will treat the vector as a set of dummy variables. The number of dummy variables will be the number of characteristics (unique observations) minus 1.

The student admissions at UC Berkeley data set has aggregate data on graduate school applicants for the six largest departments, `?UCBAdmissions`

for more information. There are four variables in the data set, Admit (whether the cadidate was admitted or rejected), Gender (the gender of the candidate: Male or Female), Dept (department to which the candidate applied coded as A, B, C, D, E, F), and n (the number of applicants). n is a numeric vector. Admit, Gender, and Dept, are character vectors. Since the data are store as a table, to read them into R as a data frame call `as_tibble`

from the dplyr package with the argument UCBAdmissions.

```
Observations: 24
Variables: 4
$ Admit <chr> "Admitted", "Rejected", "Admitted", "Rejected", "Admitt...
$ Gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Fe...
$ Dept <chr> "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", ...
$ n <dbl> 512, 313, 89, 19, 353, 207, 17, 8, 120, 205, 202, 391, ...
```

Suppose we wisht to estimate the difference in difference model \(n_i = \beta_0+\beta_1Admit_i+\epsilon_i\). If we use Admit as an argument in `lm`

, R will correctly treat Admit as single dummy variable with two categories.

```
Call:
lm(formula = n ~ Admit)
Coefficients:
(Intercept) AdmitRejected
146.2 84.7
```

R has coded Rejected as 1 and Admitted as 0. The regression indicates that mean of admitted is 146.25 while the mean number rejected is 230.92. We can confirm that directly as well.

```
Admit Gender Dept n
Length:12 Length:12 Length:12 Min. : 17
Class :character Class :character Class :character 1st Qu.: 46
Mode :character Mode :character Mode :character Median :107
Mean :146
3rd Qu.:154
Max. :512
```

```
Admit Gender Dept n
Length:12 Length:12 Length:12 Min. : 8
Class :character Class :character Class :character 1st Qu.:188
Mode :character Mode :character Mode :character Median :262
Mean :231
3rd Qu.:314
Max. :391
```

```
# Directly selecting observations based on other values
UCB_Admissions %$%
mean(n[Admit == "Admitted"])
```

`[1] 146`

`[1] 231`

Similarly, if we want to calculate the mean number of applicants by department, R will treat Dept as 5 dummy variables.

```
Call:
lm(formula = n ~ Dept)
Coefficients:
(Intercept) DeptB DeptC DeptD DeptE
233.25 -87.00 -3.75 -35.25 -87.25
DeptF
-54.75
```

The mean number of applicants in Department A is 233.25. To find the mean number of applicants for each department add the appropriate coefficient to 233.25.

We can confirm these results as we did above.

## 8.2 Difference in Means Test

Using the UCB Admissions data, let’s conduct a difference of means test for number of applications by Gender. We will test the following hypothesis: \[H_0: \mu_{Male}=\mu_{Female}\\ H_1: \mu_{Male}\ne\mu_{Female}\] at the \(\alpha=.05\) level of significance. We can use `t.test`

in two different ways, `lm`

, or `lm_robust`

. First, we will test the hypothesis with `t.test`

assuming, in turn, equal and unequal variances.

### 8.2.1 Using `t.test`

```
# Assume equal variances
# Use t.test default method
UCB_Admissions %$%
t.test(n[Gender == "Female"], n[Gender == "Male"], var.equal = TRUE)
```

```
Two Sample t-test
data: n[Gender == "Female"] and n[Gender == "Male"]
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-188.4 45.7
sample estimates:
mean of x mean of y
153 224
```

```
Two Sample t-test
data: n by Gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-188.4 45.7
sample estimates:
mean in group Female mean in group Male
153 224
```

```
# Assume unequal variances
# unequal variances is the default
UCB_Admissions %$%
t.test(n[Gender == "Female"], n[Gender == "Male"])
```

```
Welch Two Sample t-test
data: n[Gender == "Female"] and n[Gender == "Male"]
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-188.4 45.8
sample estimates:
mean of x mean of y
153 224
```

```
Welch Two Sample t-test
data: n by Gender
t = -1, df = 22, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-188.4 45.8
sample estimates:
mean in group Female mean in group Male
153 224
```

### 8.2.2 Using `lm`

and `lm_robust`

```
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 153. 39.9 3.83 0.000911
2 GenderMale 71.3 56.5 1.26 0.220
```

```
term estimate std.error statistic p.value conf.low conf.high df
1 (Intercept) 152.9 38.7 3.95 0.000679 72.7 233 22
2 GenderMale 71.3 56.5 1.26 0.219606 -45.7 188 22
outcome
1 n
2 n
```

## 8.3 Integer and Numerical Vectors as Dummy Variables

`lm`

treated the character vectors as factors. For most of what we will do, that is enough. If the categorical (dummy) variable is coded as a numeric vector or integer vector, we my have coerce the variable to a factor for `lm`

to interpret it correctly. If the variable is coded as 0 and 1, we can use it as it is. For example, conisder the the mtcars data.

```
Observations: 32
Variables: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
$ wt <dbl> 2.62, 2.88, 2.32, 3.21, 3.44, 3.46, 3.57, 3.19, 3.15, 3.4...
$ qsec <dbl> 16.5, 17.0, 18.6, 19.4, 17.0, 20.2, 15.8, 20.0, 22.9, 18....
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
```

The type of transmission, am, takes on two values 1 if the transmission is automatic and 0 if it is manual. Suppose we’d like to know if the mpg is different for the two types of transmissions. We can test the hypothesis \[H_0:\mu_a=\mu_m\] \[H_1:\mu_a\ne\mu_m\]d at the \(\alpha=.05\) level of significance.

```
term estimate std.error statistic p.value conf.low
1 (Intercept) 17.15 0.88 19.50 0.00000000000000000138 15.35
2 am 7.24 1.92 3.77 0.00072109506857981581 3.32
conf.high df outcome
1 18.9 30 mpg
2 11.2 30 mpg
```

If, however, the categorical variable is not coded as 0 and 1, we will have to coerce it to a factor. The forcats package simplifies this process. Suppose we’d like to know if the average mpg is different for 4, 6, and 8 cylinder cars. \[H_0:\mu_4=\mu_6=\mu_8\] \[H_1:\text{@ least one }\mu\text{ is not equal}\]If we estimate a model of mpg on cyl, the coefficient on cyl will give us the marginal effect on mpg of adding a cylinder. A signficant coefficient in this model will not answer our question. To do that, we must coerce cyl into a categorical variable with `as.factor`

.

```
Call:
lm(formula = mpg ~ as.factor(cyl))
Residuals:
Min 1Q Median 3Q Max
-5.264 -1.836 0.029 1.389 7.236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.664 0.972 27.44 < 0.0000000000000002 ***
as.factor(cyl)6 -6.921 1.558 -4.44 0.00012 ***
as.factor(cyl)8 -11.564 1.299 -8.90 0.00000000086 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.22 on 29 degrees of freedom
Multiple R-squared: 0.732, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 0.00000000498
```

The *F-stat* for overall significance of the model is significant at the \(\alpha = .05\) level of significance so we reject the null hypothesis in favor of the alternative and conclude that at least one average mpg is different.

The base case is cars with 4 cylinders with an average mpg of 26.7 mpg. 6 cylinder cars average a statistically significant 6.9 mpg less than 4 cylinder cars. 8 cylinder cars average a statistically siginficant 11.6 mpg less than 4 cylider cars. These averages are statistically signficantly different.

Had we estimated the model without coercing cylinders into a factor our results would have been

```
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 37.9 2.07 18.3 8.37e-18
2 cyl -2.88 0.322 -8.92 6.11e-10
```

\(\hat\beta_1=-2.88\) tells us that for each additional cylinder fuel mileage will fall by 2.88 mpg.

## 8.4 Manipulating Factors

The forcats package provides a set of tools for the simple manipulation of factors like renaming factors, re-ordering factors, combining factors, etc. Using the mtcars data, lets coerce the number of cylinders to a factor and look at ways to manipulate in ways to aid in understanding. The compound pipe operator `%<>%`

is used to update a value by first piping into one or more expressions and then assigning the result.

```
data(mtcars)
### Coerce cyl to a factor
mtcars$cyl %<>%
as.character() %>% # forcats will not coerce integer or numeric vectors to factors
as_factor()
mtcars$cyl %>% str()
```

` Factor w/ 3 levels "6","4","8": 1 1 2 1 3 1 3 2 2 1 ...`

cyl is now a factor with 3 levels, 6, 4, 8. Suppose we estimate the model \(mpg = \beta_0 + \beta_1mpg+\epsilon\).

```
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19.7 1.22 16.2 4.49e-16
2 cyl4 6.92 1.56 4.44 1.19e- 4
3 cyl8 -4.64 1.49 -3.11 4.15e- 3
```

This model indicates that cars with 6 cylinder engines average 19.74 mpg, cars with 4 cylinders average 6.9 mpg more than cars with 6 cylinders, and cars with 8 cylinders average 4.64 mpg less than cars with 6 cylinders. Suppose, instead, you’d prefere 4 cylinder cars to be the base case. We can reorder the factor with `fct_relevel`

from the forcats package. `fct_revel`

changes the order of a factor by hand.

For some factors the order doesn’t or won’t matter, for others there is “natural” ordering suggested by the data, for others you may have an ordering that you prefer. `fct_relevel()`

from the forcats package handles that task. If we call `fct_relevel`

within `lm`

the releveling will be *ad hoc*.

```
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.7 0.972 27.4 2.69e-22
2 "fct_relevel(cyl, levels = c(\"4\"~ -6.92 1.56 -4.44 1.19e- 4
3 "fct_relevel(cyl, levels = c(\"4\"~ -11.6 1.30 -8.90 8.57e-10
```

We can permantly relevel cylinders

```
# relevel the factor
mtcars$cyl <-
fct_relevel(mtcars$cyl, levels = c("4", "6", "8"))
# alternatively
mtcars$cyl <-
fct_relevel(mtcars$cyl, "6", after = 1) # move 6 to the second position
# re-estimate the model
mtcars %$%
lm(mpg ~ cyl) %>%
tidy()
```

```
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.7 0.972 27.4 2.69e-22
2 cyl6 -6.92 1.56 -4.44 1.19e- 4
3 cyl8 -11.6 1.30 -8.90 8.57e-10
```

See Reorder factor levels by hand for a more ways to relevel factors.

The transmission variable (am) is a numeric vector coded as 0 and 1. Suppose we’d like to coerce it to a factor coded with the levels named “automatic” and “manual” rather than 0 and 1.

If we re-estimate the model \(mpg = \beta_0+\beta_1am\) we see the results are the same, but the variable is labeled more clearly.

```
term estimate std.error statistic p.value conf.low
1 (Intercept) 17.15 0.88 19.50 0.00000000000000000138 15.35
2 ammanual 7.24 1.92 3.77 0.00072109506857981581 3.32
conf.high df outcome
1 18.9 30 mpg
2 11.2 30 mpg
```

## 8.5 Dummy Interaction Variables

Dummy interactions \(x_iD_i\) can be created in `lm`

as an argument. Let’s esitmate the the model \(mpg= \beta_0+\beta_1am+\beta_2hp+\beta_3hp*am+\epsilon\).

```
Call:
lm(formula = mpg ~ hp * am)
Coefficients:
(Intercept) hp ammanual hp:ammanual
26.624848 -0.059137 5.217653 0.000403
```

Notice that R assumed that you wanted to calculate \(\hat\beta_1\), \(\hat\beta_2\), and \(\hat\beta_3\). By including `hp*am`

as an argument in `lm`

R estimated the continuous coefficients for the continuous variable, the dummy variable, and the interactions. If, on the other hand, you wanted just the interaction term, i.e., \(mpg=\alpha_0+\alpha_1hp*am+\eta\), use the “AsIs” function `I()`

as follows:

```
Call:
lm(formula = mpg ~ I(hp * am))
Coefficients:
(Intercept) I(hp * am)
19.5696 0.0101
```

`I()`

is used to inhibit the interpretation of operators in formulas, so they are used as arithmetic operators.