B Use of qualitative predictors in regression

An important situation not covered in Chapters 2, 3 and 4 is how to deal with qualitative, and not quantitative, predictors. Qualitative predictors, also known as categorical variables or, in R’s terminology, factors, are ubiquitous in social sciences. Dealing with them requires some care and proper understanding of how these variables are represented in statistical softwares such as R.

Two levels

The simplest case is the situation with two levels, this is, the binary case covered in logistic regression. There we saw that a binary variable $C$ with two levels (for example, a and b) could be represented as $D=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C=a.\end{array}\right.$ $D$ now is a dummy variable: it codifies with zeros and ones the two possible levels of the categorical variable. An example of $C$ could be gender, which has levels male and female. The dummy variable associated is $D=0$ if the gender is male and $D=1$ if the gender is female.

The advantage of this dummification is its interpretability in regression models. Since level a corresponds to $0$ , it can be seen as the reference level to which level b is compared. This is the key point in dummification: set one level as the reference and codify the rest as departures from it with ones.

The previous interpretation translates easily to regression models. Assume that the dummy variable $D$ is available together with other predictors $X_1,\ldots,X_k$ . Then:

Linear model $\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k,D=d]=\beta_0+\beta_1X_1+\cdots+\beta_kX_k+\beta_{k+1}D.$ The coefficient associated to $D$ is easily interpretable. $\beta_{k+1}$ is the increment in mean of $Y$ associated to changing $D=0$ (reference) to $D=1$ , while the rest of the predictors are fixed. Or in other words, $\beta_{k+1}$ is the increment in mean of $Y$ associated to changing of the level of the categorical variable from a to b.
Logistic model $\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k,D=d]=\text{logisitic}(\beta_0+\beta_1X_1+\cdots+\beta_kX_k+\beta_{k+1}D).$ We have two interpretations of $\beta_{k+1}$ , either in terms of log-odds or odds:
- $\beta_{k+1}$ is the additive increment in log-odds of $Y$ associated to changing of the level of the categorical variable from a (reference, $D=0$ ) to b ( $D=1$ ).
- $e^{\beta_{k+1}}$ is the multiplicative increment in odds of $Y$ associated to changing of the level of the categorical variable from a (reference, $D=0$ ) to b ( $D=1$ ).

R does the dummification automatically (translates a categorical variable $C$ into its dummy version $D$ ) if it detects that a factor variable is present in the regression model. Let’s see an example of this in linear and logistic regression.

# Load the Boston dataset
library(MASS)
data(Boston)

# Structure of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# chas is a dummy variable measuring if the suburb is close to the river (1)
# or not (0). In this case it is not codified as a factor but as a 0 or 1.

# Summary of a linear model
mod <- lm(medv ~ chas + crim, data = Boston)
summary(mod)
## 
## Call:
## lm(formula = medv ~ chas + crim, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.540  -5.421  -1.878   2.575  30.134 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.61403    0.41862  56.409  < 2e-16 ***
## chas         5.57772    1.46926   3.796 0.000165 ***
## crim        -0.40598    0.04339  -9.358  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.373 on 503 degrees of freedom
## Multiple R-squared:  0.1744, Adjusted R-squared:  0.1712 
## F-statistic: 53.14 on 2 and 503 DF,  p-value: < 2.2e-16
# The coefficient associated to chas is 5.57772. That means that if the suburb
# is close to the river, the mean of medv increases in 5.57772 units.
# chas is significant (the presence of the river adds a valuable information
# for explaining medv)

# Create a binary response (1 expensive suburb, 0 inexpensive)
Boston$expensive <- Boston$medv > 25

# Summary of a logistic model
mod <- glm(expensive ~ chas + crim, data = Boston, family = "binomial")
summary(mod)
## 
## Call:
## glm(formula = expensive ~ chas + crim, family = "binomial", data = Boston)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.26764  -0.84292  -0.67854  -0.00099   2.87470  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.82159    0.12217  -6.725 1.76e-11 ***
## chas         1.04165    0.36962   2.818  0.00483 ** 
## crim        -0.22816    0.05265  -4.333 1.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 513.44  on 503  degrees of freedom
## AIC: 519.44
## 
## Number of Fisher Scoring iterations: 6
# The coefficient associated to chas is 1.04165. That means that if the suburb
# is close to the river, the log-odds of expensive increases by 1.04165.
# Alternatively, the odds of expensive increases by a factor of exp(1.04165).
# chas is significant (the presence of the river adds a valuable information
# for explaining medv)

More than two levels

Let’s see now the case with more than two levels, for example, a categorical variable $C$ with levels a, b and c. If we take a as the reference level, this variable can be represented by two dummy variables: $D_1=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C\neq b\end{array}\right.$ and $D_2=\left\{\begin{array}{ll}1,&\text{if }C=c,\\0,& \text{if }C\neq c.\end{array}\right.$ Then $C=a$ is represented by $D_1=D_2=0$ , $C=b$ is represented by $D_1=1,D_2=0$ and $C=c$ is represented by $D_1=0,D_2=1$ . The interpretation of the regression models with the presence of $D_1$ and $D_2$ is the very similar to the one before. For example, for the linear model, the coefficient associated to $D_1$ gives the increment in mean of $Y$ when the category of $C$ changes from a to b. The coefficient for $D_2$ gives the increment in mean of $Y$ when it changes from a to c.

In general, if we have a categorical variable with $J$ levels, then the number of dummy variables required is $J-1$ . Again, R does the dummification automatically for you if it detects that a factor variable is present in the regression model.

# Load dataset - factors in the last column
data(iris)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

# Summary of a linear model
mod1 <- lm(Sepal.Length ~ ., data = iris)
summary(mod1)
## 
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79424 -0.21874  0.00899  0.20255  0.73103 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
## Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
## Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
## Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
## F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16
# Speciesversicolor (D1) coefficient: -0.72356. The average increment of
# Sepal.Length when the species is versicolor instead of setosa (reference).
# Speciesvirginica (D2) coefficient: -1.02350. The average increment of
# Sepal.Length when the species is virginica instead of setosa (reference).
# Both dummy variables are significant

# How to set a different level as reference (versicolor)
iris$Species <- relevel(iris$Species, ref = "versicolor")

# Same estimates except for the dummy coefficients
mod2 <- lm(Sepal.Length ~ ., data = iris)
summary(mod2)
## 
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79424 -0.21874  0.00899  0.20255  0.73103 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.44770    0.28149   5.143 8.68e-07 ***
## Sepal.Width       0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length      0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width      -0.31516    0.15120  -2.084  0.03889 *  
## Speciessetosa     0.72356    0.24017   3.013  0.00306 ** 
## Speciesvirginica -0.29994    0.11898  -2.521  0.01280 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
## F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16
# Speciessetosa (D1) coefficient: 0.72356. The average increment of
# Sepal.Length when the species is setosa instead of versicolor (reference).
# Speciesvirginica (D2) coefficient: -0.29994.s The average increment of
# Sepal.Length when the species is virginica instead of versicolor (reference).
# Both dummy variables are significant

# Coefficients of the model
confint(mod2)
##                       2.5 %      97.5 %
## (Intercept)       0.8913266  2.00408209
## Sepal.Width       0.3257653  0.66601260
## Petal.Length      0.6937939  0.96469395
## Petal.Width      -0.6140049 -0.01630542
## Speciessetosa     0.2488500  1.19827390
## Speciesvirginica -0.5351144 -0.06475727
# The coefficients of Speciesversicolor and Speciesvirginica are significantly
# negative. Therefore, there are significant

Do not codify a categorical variable as a discrete variable. This constitutes a major methodological fail that will flaw the subsequent statistical analysis.

For example if you have a categorical variable party with levels partyA, partyB and partyC, do not encode it as a discrete variable taking the values 1, 2 and 3, respectively. If you do so:

You assume implicitly an order in the levels of party, since partyA is closer to partyB than to partyC.
You assume implicitly that partyC is three times larger than partyA.
The codification is completely arbitrary – why not considering 1, 1.5 and 1.75 instead of?

The right way of dealing with categorical variables in regression is to set the variable as a factor and let R do internally the dummification.