Binary variables

A binary variable is a variable with exactly two categories. For example

yes or no;
infected or not infected;
success of failure;
high or low;
summer or winter.

Typically with binary variables we are interested in comparing two levels. For instance, are males different to females or is someone infected with a disease or not. In many cases, the groups within a binary variable are well defined. However in some cases we may have more of a grey area. For instance, suppose in a government election we were asked to vote for party A or party B. Although we would need to choose only one party, it is more than likely that we would agree with some policies in party A and other policies in party B and so we may vote for a different party across elections.

Binary variables in a regression setting

In order to incorporate binary variable into a regression model, we need to code each level of a binary variable. By level, we simply mean one group within a binary variable. For example

yes or no may be coded as yes=1 and no=0;
infected or not infected may be coded as infected=1 and not infected=0;
success of failure may be coded as success=5 and failure=67;
male or female may be coded as male=-1 and female=1;
high or low may be coded as high=0 and low=-1;
summer or winter may be coded as summer=-100 and winter=49.

Consider the dataset

Observation	Response	Yes	Infected	Success	Male	High	Summer
1	3.697	0	1	5	1	0	49
2	1.557	0	0	5	-1	0	49
3	1.311	1	0	67	-1	-1	-100
4	3.293	1	0	67	1	0	49
5	4.365	1	1	67	1	-1	-100
6	3.769	0	1	67	-1	0	-100
7	0.888	0	0	5	-1	0	-100
8	2.881	1	0	5	1	-1	49
9	1.398	0	0	67	-1	-1	49
10	2.388	1	1	5	-1	0	-100

For each observation, we have a data entry relating to the continuous response variable Response and variables Yes, Infected, Success, Male, High and Summer. Notice that the numerical entries in each cell are almost meaningless. For instance, a value of Summer=49 is not intuitive. We need to refer back to our initial coding of this variable to see that `Summer’=49 tells us this entry was observed in winter. With this in mind, we could equivalently use different numerical codes for each of these binary variables, for instance the following coding is equally valid

Observation	Response	Yes	Infected	Success	Male	High	Summer
1	3.697	0	1	1	1	0	1
2	1.557	0	0	1	0	0	1
3	1.311	1	0	0	0	1	0
4	3.293	1	0	0	1	0	1
5	4.365	1	1	0	1	1	0
6	3.769	0	1	0	0	0	0
7	0.888	0	0	1	0	0	0
8	2.881	1	0	1	1	1	1
9	1.398	0	0	0	0	1	1
10	2.388	1	1	1	0	0	0

In order the assess the relationship between the continuous response variable and a binary explanatory variable, we may use boxplots. For instance, let’s assess the relationship between Response and `Infected.

We can see that observations with Infected=1 have a lower median value of Response than observations with Infected=0.

Likewise, we can assess the relationship between Response and Summer.

We can see that observations with Summer=1 have a lower median value of Response than observations with Summer=0 although in this case, the difference seems less extreme.

Specifying a regression model with categorical variables

Recall from week 1 that regression models can be expressed in several ways. A simple linear regression model with response variable $Y$ and one continuous explanatory variable $x$ can be expressed as $Y_i = \alpha + \beta x_i+ \epsilon_i, \quad \quad i=1,\dots,n.$ with the assumption $\mathrm{E}(\epsilon_i) = 0 \quad \mathrm{and}\quad \mathrm{Var}(\epsilon_i) = \sigma^2.$ Please see later lectures for more on these assumptions.

We can express a linear regression model with response variable $Y$ and one binary explanatory variable $x$ as $Y_i = \alpha + \beta I_{x_i}+ \epsilon_i, \quad \quad i=1,\dots,n.$ where $I_{x_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if } x_i=1 \\ 0 \hspace{0.5cm} \mbox{if } x_i=0 \end{cases}$ and the assumption $\mathrm{E}(\epsilon_i) = 0 \quad \mathrm{and}\quad \mathrm{Var}(\epsilon_i) = \sigma^2.$