CHAPTER 8 Dummy Variables
So far, we have only considered utilizing quantitative variables in the regression models. Though many variables of interest would fall under this category, some are qualitative or categorical in nature which poses a slight problem.
Definition 8.1 A Dummy Variable is a dichotomous variable assuming values of 0 or 1. This is used to indicate whether the observation belongs to a category or not.
Dummy Variable may also be called zero-one indicator variable or simply indicator variable
Examples
\(\text{sex}=\begin{cases}0, & \text{if the person is male}\\ 1, & \text{if the person is female}\end{cases}\)
\(\text{survival status}=\begin{cases}0, & \text{if the person did not survive}\\ 1, & \text{if the person survived}\end{cases}\)
Dummy variable regressors can be used to incorporate qualitative explanatory variables into a linear model, substantially expanding the range of application of regression analysis.
8.1 Dichotomous Independent Variables
In this section, we show an example for categorical independent variables with only 2 levels (dichotomous).
An economist wishes to relate the speed in which a particular insurance innovation is adopted (\(Y\)) to the size of the insurance firm (\(X_1\)) and the type of firm \(X_2\) : stock companies and mutual companies.
firm | months_elapsed | size | firm_type |
---|---|---|---|
1 | 17 | 151 | Mutual |
2 | 26 | 92 | Mutual |
3 | 21 | 175 | Mutual |
4 | 30 | 31 | Mutual |
5 | 22 | 104 | Mutual |
6 | 0 | 277 | Mutual |
7 | 12 | 210 | Mutual |
8 | 19 | 120 | Mutual |
9 | 4 | 290 | Mutual |
10 | 16 | 238 | Mutual |
11 | 28 | 164 | Stock |
12 | 15 | 272 | Stock |
13 | 11 | 295 | Stock |
14 | 38 | 68 | Stock |
15 | 31 | 85 | Stock |
16 | 21 | 224 | Stock |
17 | 20 | 166 | Stock |
18 | 13 | 305 | Stock |
19 | 30 | 124 | Stock |
20 | 14 | 246 | Stock |
Visualizing the dataset, including the type of firm, we have the following:
We can try fitting two different equations: one for the stock bond firm type, and another for mutual bond firm type. However, it is also possible, and better for interpretations, if we only have one regression equation.
Suppose the model employed is \[ Y_i=\beta_0+\beta_1X_1+\beta_2X_2+\varepsilon_i \] where \(X_1\) = size of the firm, and \(X_2 = 1\) if stock and \(X_2 = 0\) if mutual.
If the company is mutual, \(X_2=0\), then \[ E(Y)=\beta_0+\beta_1X_1 + \beta_2(0) = \beta_0+\beta_1X_1 \]
If the company is stock, \(X_2=1\), then \[ E(Y)=\beta_0+\beta_1X_1 + \beta_2(1) = (\beta_0+\beta_2) + \beta_1X_1 \]
Interpretation
- \(\beta_0\) is the intercept of the model, the \(E(Y)\) if the size is 0 and the company is mutual type of firm.
- \(\beta_1\) is the slope of the model. For every unit increase in the size, there is \(\beta_1\) increase (or decrease) in the average of the number of months elapsed.
- \(\beta_2\) indicates how much higher or lower the response function for stock firms is than the one for mutual firms. Thus, \(\beta_2\) measures the differential effect of type of firm.
- \(\beta_0+\beta_2\) is the intercept of the line if the type of the firm is stock.
For this type of data, we can still use the lm()
function. No need to manually specify the dummy variables.
##
## Call:
## lm(formula = months_elapsed ~ size + firm_type, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6915 -1.7036 -0.4385 1.9210 6.3406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.874069 1.813858 18.675 9.15e-13 ***
## size -0.101742 0.008891 -11.443 2.07e-09 ***
## firm_typeStock 8.055469 1.459106 5.521 3.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.221 on 17 degrees of freedom
## Multiple R-squared: 0.8951, Adjusted R-squared: 0.8827
## F-statistic: 72.5 on 2 and 17 DF, p-value: 4.765e-09
Visualizing again…
ggplot(insurance, aes(x = size, y = months_elapsed, color = firm_type))+
geom_point()+ theme_bw()+
geom_abline(intercept = 33.87, slope = -0.1017, color = "red")+
geom_abline(intercept = 33.87 + 8.0554,
slope = -0.1017, color = "blue")
Notice that the slope is the same for both firm types (i.e. lines are parallel).
Why not fit separate regression models for the different categories?
- The model assumes constant error variance \(\sigma^2\) for each category. Different models for different categories may have different estimate of the variance (i.e. \(MSE\))
- The model also assumes equal slopes. The common slope \(\beta_1\) can best be estimated by pooling the categories.
- There will be higher sample size which means better estimates and inferences.
- It easily allows for the comparison of the groups. You can even test if there is significant difference between group, which can be quantified by the slope of the dummy variable.
Interaction Effect
So far, we only discussed a what we call the “parallel slopes model”.
A parallel slopes model is not flexible: it allows for different intercepts but forces a common slope.
Suppose we want to create models wherein we have different slope parameters for different values of a specific dummy variable.
To incorporate it to our regression model, we need to multiply the dummy variable with a regressor and include the product in our model.
Model A: \(Y=\beta_0+\beta_1X+\delta D+\varepsilon\)
Intercepts: \(\beta_0, (\beta_0+\delta)\)
Slope: \(\beta_1\)
Model B: \(Y=\beta_0+\beta_1X+\delta D + \gamma X D + \varepsilon\)
Intercepts: \(\beta_0, (\beta_0+\delta)\)
Slopes: \(\beta_1, (\beta_1+\gamma)\)
Example in R
For illustration, we use the salary
dataset.
age | sex | salary |
---|---|---|
21 | male | 20 |
57 | female | 77 |
32 | male | 31 |
25 | female | 15 |
54 | female | 75 |
44 | female | 42 |
63 | male | 174 |
31 | female | 17 |
42 | female | 35 |
26 | male | 17 |
38 | female | 27 |
64 | female | 111 |
48 | female | 62 |
34 | female | 29 |
30 | female | 15 |
55 | female | 77 |
50 | female | 59 |
27 | female | 17 |
31 | female | 21 |
52 | male | 108 |
60 | female | 96 |
40 | male | 37 |
53 | female | 64 |
39 | male | 41 |
52 | male | 101 |
55 | male | 103 |
33 | male | 32 |
22 | male | 16 |
38 | female | 40 |
39 | male | 50 |
51 | female | 61 |
27 | female | 25 |
23 | female | 17 |
42 | female | 29 |
44 | female | 42 |
50 | female | 59 |
39 | female | 40 |
57 | female | 84 |
56 | male | 125 |
46 | female | 39 |
53 | female | 58 |
22 | male | 17 |
49 | female | 61 |
29 | male | 15 |
21 | male | 23 |
57 | female | 91 |
54 | male | 116 |
37 | male | 34 |
65 | female | 110 |
50 | female | 54 |
37 | female | 33 |
24 | male | 27 |
48 | female | 55 |
41 | female | 33 |
49 | female | 64 |
44 | male | 66 |
42 | female | 41 |
53 | female | 66 |
55 | male | 113 |
36 | male | 40 |
61 | male | 157 |
23 | female | 17 |
53 | female | 65 |
31 | male | 30 |
60 | male | 142 |
41 | female | 37 |
34 | male | 29 |
51 | male | 93 |
20 | male | 20 |
63 | female | 104 |
33 | female | 16 |
48 | male | 76 |
50 | male | 93 |
47 | female | 48 |
33 | female | 28 |
50 | male | 96 |
45 | male | 67 |
54 | male | 113 |
61 | female | 96 |
36 | male | 42 |
44 | male | 59 |
42 | female | 47 |
25 | male | 20 |
63 | male | 167 |
45 | male | 75 |
25 | female | 24 |
29 | male | 28 |
65 | male | 186 |
54 | male | 114 |
24 | female | 17 |
44 | female | 49 |
56 | female | 71 |
40 | female | 33 |
28 | male | 17 |
41 | female | 36 |
57 | male | 123 |
57 | male | 135 |
63 | female | 102 |
41 | male | 45 |
41 | male | 48 |
33 | male | 60 |
37 | female | 37 |
29 | female | 27 |
38 | female | 32 |
35 | female | 25 |
37 | female | 32 |
20 | male | 15 |
34 | female | 28 |
38 | male | 61 |
21 | female | 18 |
25 | female | 15 |
29 | female | 18 |
29 | male | 41 |
33 | female | 22 |
25 | male | 36 |
24 | female | 20 |
24 | female | 20 |
35 | male | 59 |
39 | male | 71 |
26 | female | 17 |
29 | female | 22 |
29 | male | 50 |
37 | female | 35 |
25 | female | 21 |
24 | female | 25 |
35 | female | 32 |
27 | female | 22 |
25 | female | 16 |
22 | male | 24 |
24 | male | 21 |
24 | male | 36 |
30 | male | 48 |
34 | male | 58 |
24 | male | 17 |
31 | male | 34 |
29 | male | 45 |
40 | female | 32 |
33 | male | 64 |
33 | female | 30 |
39 | male | 60 |
33 | female | 27 |
38 | male | 60 |
20 | female | 22 |
23 | male | 18 |
20 | female | 16 |
22 | female | 19 |
22 | male | 21 |
22 | female | 17 |
28 | female | 27 |
24 | male | 37 |
Suppose that age and sex interact in predicting the salary, or the slopes are different.
From the plot, the slope of the salary with respect to age of male employees is steeper than the female employees. In other words, the inceremental effect of age on salary interacts with sex.
A first-order model with interaction term for our example is:
\[ Y_i=\beta_0+\beta_1 X_1 + \beta_2 X_2 + \beta_3 X_{i1}X_{i2} + \varepsilon_i \]
The meaning of the parameters are as follows:
Intercept | Slope | |
---|---|---|
Female | \(\beta_0\) | \(\beta_1\) |
Male | \(\beta_0+\beta_2\) | \(\beta_1+\beta_3\) |
Thus, \(\beta_2\) indicates how much greater (or smaller) is the \(Y\)-intercept for the class coded 1 than that for the class coded 0. \(\beta_3\) indicates how much greater (or smaller) is the slope for the class coded 1 than that coded 0.
In R, when exploring interaction terms, we just put colon :
between the variables that have interaction effect.
##
## Call:
## lm(formula = salary ~ age + sex + age:sex, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.280 -8.511 -0.446 7.730 38.237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.2726 4.3635 -7.854 7.98e-13 ***
## age 1.9393 0.1064 18.219 < 2e-16 ***
## sexmale -24.6201 6.2706 -3.926 0.000133 ***
## age:sexmale 1.2400 0.1546 8.020 3.14e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.92 on 146 degrees of freedom
## Multiple R-squared: 0.8951, Adjusted R-squared: 0.8929
## F-statistic: 415.1 on 3 and 146 DF, p-value: < 2.2e-16
Visualizing…
ggplot(salary, aes(x = age, y = salary, color = sex))+
geom_point()+ theme_bw()+
geom_smooth(method = "lm", se = F)
In the dataset, not only we can see that the male employees have higher salary than female on the average, but also the salary gap increases as age increases.
What does it imply if the interaction effects are significant, but the main dummy variable is not?
The slopes of the two categories are different, but their intercepts are the same.
It is the belief of many statisticians that we do not need take away the main dummy variable in the model even if it is insignificant and the interaction effect is significant.
It is mainly because of the behavior of the coefficient of the dummy variable that it becomes ”part” of the intercept coefficient in its interpretation.
In general, you should not take away ”lower” order variables when ”higher” order variables are included in the model.
8.2 Polytomous Independent Variables
What if there are more than 2 categories? How do we setup the Dummy Variables?
Suppose a variable has 4 categories. Define 4 dummy variables: \(X_1, X_2,X_3,X_4\). If there are 4 dummy variables for 4 categories, we will have the identity \(X_4=1-X_1-X_2-X_3\).
What does this imply?
From the 4 variables, only 3 are independent because \(X_4\) will be dependent on the values of the other 3 dummy variables. In the regression assumptions, all regressors should be independent from each other. Therefore, we should only use 3 dummy variables if there are 4 categories.
Remarks
In general, if there are \(m\) categories, define only \(m-1\) dummy variables to indicate membership in a category.
All that are left out are lumped together into the baseline category (all other dummy variables = 0).
Selection of the Baseline category is Arbitrary! Take note though that the baseline is the category where all other categories are compared to.
Thus, it is better if the most dominant, the least dominant, the default, or the most familiar category is chosen as the baseline.
Examples of setting dummy variables for more than 2 categories
3 categories: Disability Status = not disabled, partly disabled, fully disabled \[ \begin{align} D1&=\begin{cases} 1, & \text{if the person is disabled}\\ 0, &\text{otherwise}\end{cases} \\ D2&=\begin{cases} 1, & \text{if the person is partially disabled}\\ 0, &\text{otherwise}\end{cases} \end{align} \]
This means that if \(D1=0\) and \(D2=0\), the person is not disabled.
4 categories:
Year Level of a student = I, II, III, IV \[ D1=\begin{cases} 1, & \text{1st year}\\ 0, &\text{otherwise}\end{cases} \quad D2=\begin{cases} 1, & \text{2nd year}\\ 0, &\text{otherwise}\end{cases} \quad D3=\begin{cases} 1, & \text{3rd year}\\ 0, &\text{otherwise}\end{cases} \] This means that if \(D1=D2=D3=0\), then the year level of the student is 4th year.
Example
Suppose we want to regress the annual salary of an employee with respect to age and different departments.
There are 3 departments in a company: A, B, and C. The three-category classification can be represented in the regression equation by introducing two dummy variables \(D_1\) and \(D_2\). The regression model is then
\[ Y_i=\beta_0+\beta_1X_{i1} + \gamma_1D_{1i} +\gamma_2D_{2i} + \varepsilon \]
where \(D_{1i}=\begin{cases} 1, \text{employee } i \text{ is from department B}\\ 0, \text{otherwise}\end{cases}\), \(D_{2i}=\begin{cases} 1, \text{employee } i \text{ is from department C}\\ 0, \text{otherwise}\end{cases}\)
Age | Department | Salary |
---|---|---|
38 | C | 1204460.92 |
46 | B | 1180508.92 |
29 | C | 777106.73 |
47 | B | 1143911.29 |
49 | C | 1902898.10 |
55 | A | 764687.13 |
30 | C | 921713.04 |
25 | A | 101277.43 |
59 | B | 1591115.51 |
56 | A | 460089.62 |
29 | C | 902450.49 |
32 | B | 631703.95 |
54 | A | 486890.42 |
55 | C | 2334375.51 |
45 | A | 612248.23 |
23 | A | 336320.00 |
49 | B | 1465534.82 |
55 | B | 1474524.73 |
30 | A | 382170.64 |
56 | A | 636563.93 |
39 | B | 882012.01 |
21 | B | 335102.08 |
54 | C | 2296409.62 |
51 | C | 1881404.03 |
33 | B | 592676.04 |
47 | A | 573165.94 |
24 | A | 118265.13 |
37 | A | 524928.31 |
26 | C | 709815.92 |
36 | C | 1400076.61 |
45 | C | 1602196.73 |
20 | B | 309938.20 |
40 | C | 1419015.03 |
31 | B | 456906.17 |
50 | B | 1294530.47 |
42 | B | 1033898.65 |
34 | C | 1121838.34 |
46 | C | 1877019.92 |
34 | A | 466004.30 |
38 | A | 285656.79 |
28 | B | 733866.80 |
26 | A | 112581.20 |
51 | C | 1985142.17 |
44 | C | 1615519.42 |
56 | C | 2296767.73 |
41 | A | 501270.60 |
41 | C | 1392760.27 |
25 | C | 739148.20 |
23 | B | 261178.79 |
32 | C | 725438.76 |
29 | A | 388000.20 |
30 | C | 1042548.17 |
44 | A | 481914.88 |
31 | C | 920542.66 |
38 | B | 979934.75 |
53 | B | 1464490.57 |
42 | B | 1044196.90 |
23 | A | 95223.12 |
21 | C | 398922.95 |
34 | B | 818534.55 |
48 | B | 1030033.39 |
29 | C | 741456.83 |
32 | A | 273066.43 |
45 | B | 1124010.23 |
25 | C | 643024.36 |
41 | B | 810309.68 |
60 | A | 775000.80 |
34 | B | 798142.07 |
51 | C | 1966628.29 |
49 | A | 474773.38 |
51 | C | 1962871.05 |
27 | C | 719235.40 |
48 | C | 1847994.37 |
27 | B | 605715.20 |
47 | B | 1423232.62 |
44 | A | 573616.24 |
26 | A | 221250.84 |
45 | C | 1706019.45 |
30 | A | 229963.24 |
29 | A | 402156.80 |
46 | C | 1760612.05 |
57 | A | 580932.18 |
28 | B | 451140.30 |
34 | A | 345555.70 |
32 | B | 677117.35 |
58 | C | 2415942.88 |
51 | B | 1283166.39 |
47 | B | 1197548.05 |
52 | A | 578386.78 |
55 | C | 2225538.29 |
34 | A | 453169.98 |
32 | C | 1072559.40 |
55 | C | 2240956.18 |
53 | C | 2116580.01 |
42 | A | 667160.81 |
44 | A | 490047.32 |
36 | A | 253704.48 |
42 | A | 484811.94 |
22 | A | 206055.23 |
28 | A | 389616.43 |
Show more
##
## Call:
## lm(formula = Salary ~ Age + Department, data = employee)
##
## Residuals:
## Min 1Q Median 3Q Max
## -547138 -124745 9666 154759 435911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -871010 84608 -10.295 < 2e-16 ***
## Age 33540 1980 16.941 < 2e-16 ***
## DepartmentB 484539 54491 8.892 3.56e-14 ***
## DepartmentC 972408 51682 18.815 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216900 on 96 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8759
## F-statistic: 233.9 on 3 and 96 DF, p-value: < 2.2e-16
In this example, Department A is the baseline category.
In this model, parameters are significant, and the \(R^2\) is also good. However, we can still improve the interpretability of the model. Some of the problems of the model:
The intercept \(\beta_0\) cannot be interpreted, since \(Age=0\) is not in the dataset.
We cannot conclude yet if salary progression is faster in Department C, and slowest in Department A.
Exercise 8.1 Explore how to improve the model in predicting salary in the company based on age and department. Make sure to do the following:
Transform
age
toyears_working
. Assume that all started to work at age 20 (years_working
=age
-20). Use this as a predictor instead of age. Now, how do we interpret the \(\beta_0\)?Include interaction of
years_working
anddepartment
to have different slopes of salary per department.Assume that the intercept of all departments are the same. That is, the average starting salary is the same regardless of department.
Show and interpret the results of the model. What is the new \(R^2\)?
The final regression lines should look like this:
Some Notes on Models with Categorical Independent Variables
- For models with more than one qualitative independent variable, we define the appropriate number of dummy variables for each qualitative variable and include them in the model.
- Models in which all independent variables are qualitative are called analysis of variance (ANOVA) models.
- Models containing some quantitative and some qualitative independent variables, where the chief independent variables of interest are qualitative and the quantitative independent variables are introduced primarily to reduce the variance error terms, are called analysis of covariance (ANCOVA) models.
8.3 Regime-Switching Models
Rationale: Suppose we want to create a regression model wherein the slope and the intercept parameters would change at different set of values or ”Regimes” of X.
Illustration
The blue dashed line is the fitted line using whole dataset. However, we can see that the relationship of \(X\) and \(Y\) changes at at some value of \(X=b\). That is, \(Y\) can be seen as a piecewise function of \(X\).
To facilitate this, we will simply use the concept of dummy variables.
Two Regimes
Regime 1: \(X\leq b\) , Regime 2: \(X>b\)
Define 1 dummy variable
\[ Y=\beta_0+\beta_1X+\delta R +\gamma XR + \varepsilon \]
where \(R=\begin{cases}1, & X \leq b\\ 0, & X>b\end{cases}\)
Three Regimes
Regime 1: \(X\leq b\) , Regime 2: \(b<X\leq c\), Regime 3: \(X>c\)
Define 2 dummy variables
\[ Y=\beta_0+\beta_1X+\delta_1 R_1 + \delta_2R_2 +\gamma_1 XR_1 +\gamma_2XR_2 + \varepsilon \]
where \(R_1=\begin{cases}1, & X \leq b\\ 0, & otherwise\end{cases}\), \(R_2=\begin{cases}1, & b<X \leq c\\ 0, & otherwise\end{cases}\)