3.3 Use of qualitative predictors
An important situation not covered so far is how to deal with qualitative, and not quantitative, predictors. Qualitative predictors are also known as categorical variables or, in R’s terminology, factors, and are very common in many fields, such as in social sciences. Dealing with them requires some care and proper understanding of how these variables are represented.
The simplest case is the situation with two levels. A binary variable \(C\) with two levels (for example, a and b) can be encoded as
\[\begin{align*} D=\left\{\begin{array}{ll} 1,&\text{if }C=b,\\ 0,&\text{if }C=a. \end{array}\right. \end{align*}\]
\(D\) is a dummy variable: it codifies with zeros and ones the two possible levels of the categorical variable. An example of \(C\) is smoker, which has levels yes and no. The dummy variable associated is \(D=1\) if a person smokes and \(D=0\) if a person is non-smoker.
The advantage of this dummification is its interpretability in regression models. Since level a corresponds to \(0,\) it can be seen as the reference level to which level b is compared. This is the key point in dummification: set one level as the reference, codify the rest as departures from it.
The previous interpretation translates easily to the linear model. Assume that the dummy variable \(D\) is available together with other predictors \(X_1,\ldots,X_p.\) Then:
\[\begin{align*} \mathbb{E}[Y|X_1=x_1,\ldots,X_p=x_p,D=d]=\beta_0+\beta_1x_1+\cdots+\beta_px_p+\beta_{p+1}d. \end{align*}\]
The coefficient associated to \(D\) is easily interpretable: \(\beta_{p+1}\) is the increment in the mean of \(Y\) associated to changing \(D=0\) (reference) to \(D=1,\) while the rest of the predictors are fixed. Or in other words, \(\beta_{p+1}\) is the increment in mean of \(Y\) associated to changing the level of the categorical variable from a to b.
R does the dummification automatically, translating a categorical variable \(C\) into its dummy version \(D,\) if it detects that a factor variable is present in the regression model.
Let’s see now the case with more than two levels, for example, a categorical variable \(C\) with levels a, b, and c. If we take a as the reference level, this variable can be represented by two dummy variables:
\[\begin{align*} D_1=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C\neq b\end{array}\right. \end{align*}\]
and
\[\begin{align*} D_2=\left\{\begin{array}{ll}1,&\text{if }C=c,\\0,& \text{if }C\neq c.\end{array}\right. \end{align*}\]
Therefore, we can represent the levels of \(C\) as in the following table.
\(C\) | \(D_1\) | \(D_2\) |
---|---|---|
\(a\) | \(0\) | \(0\) |
\(b\) | \(1\) | \(0\) |
\(c\) | \(0\) | \(1\) |
The interpretation of the regression models in the presence of \(D_1\) and \(D_2\) is very similar to the one before. For example, for the linear model, the coefficient associated to \(D_1\) gives the increment in the mean of \(Y\) when the level of \(C\) changes from a to b, and the coefficient for \(D_2\) gives the increment in mean of \(Y\) when \(C\) changes from a to c.
In general, if we have a categorical variable \(C\) with \(J\) levels, then the previous process is iterated and the number of dummy variables required to encode \(C\) is \(J-1\;\)62. Again, R does the dummification automatically if it detects that a factor variable is present in the regression model.
Let’s see an example with the famous iris
dataset.
# iris dataset -- factors in the last column
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# Summary of a linear model
mod1 <- lm(Sepal.Length ~ ., data = iris)
summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
## Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciesversicolor (D1) coefficient: -0.72356. The average increment of
# Sepal.Length when the species is versicolor instead of setosa (reference)
# Speciesvirginica (D2) coefficient: -1.02350. The average increment of
# Sepal.Length when the species is virginica instead of setosa (reference)
# Both dummy variables are significant
# How to set a different level as reference (versicolor)
iris$Species <- relevel(iris$Species, ref = "versicolor")
# Same estimates, except for the dummy coefficients
mod2 <- lm(Sepal.Length ~ ., data = iris)
summary(mod2)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.44770 0.28149 5.143 8.68e-07 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciessetosa 0.72356 0.24017 3.013 0.00306 **
## Speciesvirginica -0.29994 0.11898 -2.521 0.01280 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciessetosa (D1) coefficient: 0.72356. The average increment of
# Sepal.Length when the species is setosa instead of versicolor (reference)
# Speciesvirginica (D2) coefficient: -0.29994. The average increment of
# Sepal.Length when the species is virginica instead of versicolor (reference)
# Both dummy variables are significant
# Coefficients of the model
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 0.8913266 2.00408209
## Sepal.Width 0.3257653 0.66601260
## Petal.Length 0.6937939 0.96469395
## Petal.Width -0.6140049 -0.01630542
## Speciessetosa 0.2488500 1.19827390
## Speciesvirginica -0.5351144 -0.06475727
# The coefficients of Speciessetosa and Speciesvirginica are
# significantly positive and negative, respectively
# Show the dummy variables employed for encoding a factor
contrasts(iris$Species)
## setosa virginica
## versicolor 0 0
## setosa 1 0
## virginica 0 1
iris$Species <- relevel(iris$Species, ref = "setosa")
contrasts(iris$Species)
## versicolor virginica
## setosa 0 0
## versicolor 1 0
## virginica 0 1
It may happen that one dummy variable, say \(D_1,\) is not significant, while other dummy variables of the same categorical variable, say \(D_2,\) are significant. For example, this happens in the example above at level \(\alpha=0.01.\) Then, in the considered model, the level associated to \(D_1\) does not add relevant information for explaining \(Y\) with respect to the reference level.
Do not codify a categorical variable as a discrete variable. This constitutes a major methodological failure that will flaw the subsequent statistical analysis.
For example if you have a categorical variable party
with levels partyA
, partyB
, and
partyC
, do not encode it as a discrete variable taking the
values 1
, 2
, and 3
, respectively.
If you do so:
-
You assume implicitly an order in the levels of
party
, sincepartyA
is closer topartyB
than topartyC
. -
You assume implicitly that
partyC
is three times larger thanpartyA
. -
The codification is completely arbitrary – why not consider
1
,1.5
, and1.75
instead?
The right way of dealing with categorical variables in regression is to set the variable as a factor and let R do the dummification internally.
3.3.1 Case study application
Let’s see what the dummy variables are in the Boston
dataset and what effect they have on medv
.
# Load the Boston dataset
data(Boston, package = "MASS")
# Structure of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# chas is a dummy variable measuring if the suburb is close to the river (1)
# or not (0). In this case it is not codified as a factor but as a 0 or 1
# (so it is already dummified)
# Summary of a linear model
mod <- lm(medv ~ chas + crim, data = Boston)
summary(mod)
##
## Call:
## lm(formula = medv ~ chas + crim, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.540 -5.421 -1.878 2.575 30.134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.61403 0.41862 56.409 < 2e-16 ***
## chas 5.57772 1.46926 3.796 0.000165 ***
## crim -0.40598 0.04339 -9.358 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.373 on 503 degrees of freedom
## Multiple R-squared: 0.1744, Adjusted R-squared: 0.1712
## F-statistic: 53.14 on 2 and 503 DF, p-value: < 2.2e-16
# The coefficient associated to chas is 5.57772. That means that if the suburb
# is close to the river, the mean of medv increases in 5.57772 units for
# the same house and neighborhood conditions
# chas is significant (the presence of the river adds a valuable information
# for explaining medv)
# Summary of the best model in terms of BIC
summary(modBIC)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
# The coefficient associated to chas is 2.71871. If the suburb is close to
# the river, the mean of medv increases in 2.71871 units
# chas is significant as well in the presence of more predictors
We will see how to mix dummy and quantitative predictors in Section 3.4.3.
To account for the \(J-1\) possible departures from a reference level.↩︎