Categorical Variables

A categorical variables is a variable with two or more categories. For example

yes, no or undecided;
infected, not infected or susceptible;
high, medium, or low;
January, February, March, $\ldots$ , December;
Ethnic group;
Species.

For now, let’s assume that levels, or groups or categories, within a categorical variable have no natural ordering. Notice that binary variables are just a special case of categorical variables with exactly two levels/groups/categories. Typically we are interested in comparing categories. For example, are there ethnic differences or seasonal differences. Expanding the government election scenario were we are asked to vote for party A or party B but we are undecided. We could expand this variable to have three levels namely A, B and undecided. We may wish to investigate differences between voters who are undecided, those who vote A and those who vote B.

Categorical variables in a regression setting

In order to incorporate a categorical variable into a regression model, we need to code each level of the categorical variable. For example

yes, no or undecided may be coded as yes=1, no=0 and undecided=2;
infected, not infected or susceptible may be coded as infected=1, not infected=0 and susceptible=-1;
high, medium, or low may be coded as high=1, medium=2 and low=3;
January, February, March, $\ldots$ , December may be coded as January=1, February=2, March=3, $\ldots$ , December=12.

Consider the dataset

Observation	Response	Yes	High	Month
1	0.690	0	1	1
2	1.028	2	1	3
3	0.507	1	3	2
4	1.689	1	2	3
5	-1.800	1	1	1
6	2.966	0	2	2
7	0.653	2	2	1
8	3.423	2	3	3
9	1.476	0	1	2
10	0.540	1	3	3

For each observation, we have a data entry relating to the continuous response variable Response and variables Yes, High and Month. Similarly to the binary case, the numerical entries in each cell are almost meaningless. For instance, a value of Month=3 may not entirely be clear. We need to refer back to our initial coding of this variable to see that Month=3 tells us this entry was observed in March. With this in mind, we could equivalently use different numerical codes for each of these categorical variables.

We can assess the relationship between Response and Month again using boxplots.

We can see there may be a difference between January and February and January and March. There does not seem to any difference between February and March.

Specifying a regression model with categorical variables

Categorical variables with more than two levels require more attention in a regression analysis. Consider the previous example using `Month} as a categorical variable. We are performing more than one comparison. In fact, we are performing three comparisons

January and February;
January and March;
March and February.

We want to be able to reflect all comparisons in a regression model. In order to do this, we need to create a series of variables. For now, let’s suppose we are really interested in the difference between January and the remaining two months, February and March. Call January our baseline category. We can express this model as $Y_i = \alpha + \beta_1 I_{\text{Feb}_i} + \beta_2 I_{\text{Mar}_i} + \epsilon_i, \quad \quad i=1,\dots,n.$ where

$I_{\text{Mar}_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if observation } i \mbox{ was recored in March} \\ 0 \hspace{0.5cm} \mbox{if observation } i \mbox{ was not recored in March} \end{cases}$ $I_{\text{Feb}_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if observation } i \mbox{ was recored in February} \\ 0 \hspace{0.5cm} \mbox{if observation } i \mbox{ was not recored in February} \end{cases}$ and the assumption $\mathrm{E}(\epsilon_i) = 0 \quad \mathrm{and}\quad \mathrm{Var}(\epsilon_i) = \sigma^2.$ In other words, we need to re-structure the data

Observation	Response	Month
1	0.690	1
2	1.028	3
3	0.507	2
4	1.689	3
5	-1.800	1
6	2.966	2
7	0.653	1
8	3.423	3
9	1.476	2
10	0.540	3

such that

Observation	Response	Feb	Mar
1	0.690	0	0
2	1.028	0	1
3	0.507	1	0
4	1.689	0	1
5	-1.800	0	0
6	2.966	1	0
7	0.653	0	0
8	3.423	0	1
9	1.476	1	0
10	0.540	0	1

Variables Feb and Mar are dummy variables. In order to understand differences between months, we need to interpret regression parameters $\beta_1$ and $\beta_2$ .