Interpreting with categorical variables

Height, weight and gender

Data are available detailing the weight and height for 10000 men and women. We want to compare the average height of men and women. The data are plotted below.

We can see that the median weight of females lies below the median weight of male in this sample of data. We can fit a linear regression model of the form

\[y_i = \alpha + \beta I_{x_i}+ \epsilon_i, \quad \quad i=1,\dots,n.\] where \[ I_{x_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if } x_i=1 \\ 0 \hspace{0.5cm} \mbox{if } x_i=0 \end{cases} \] for some coding. Assume here that \[ I_{x_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if person } i \mbox{ is male} \\ 0 \hspace{0.5cm} \mbox{if person } i \mbox{ is female}. \end{cases} \] Again we know how to estimate \(\alpha\) and \(\beta\) using the usual formula. Suppose we found

\[\begin{eqnarray*} \hat{\alpha}&=&135.86\\ \hat{\beta}&=&51.16\\ \end{eqnarray*}\] Consequently, if observation \(i\) is male, then \[E(\mbox{Weight}_i|\mbox{Male})=135.86 + 51.16 = 187.02.\] If observation \(i\) is female, then \[E(\mbox{Weight}_i|\mbox{Female})=135.86.\] That is, the expected weight of a male is 187.02lbs and the expected weight of a female is 135.86lbs. In other words, we expect an average male to be 51lbs heavier than an average female.

Month

Consider the following Month dataset

Observation Response Month
1 0.690 1
2 1.028 3
3 0.507 2
4 1.689 3
5 -1.800 1
6 2.966 2
7 0.653 1
8 3.423 3
9 1.476 2
10 0.540 3

We wish to fit the regression model

\[Y_i = \alpha + \beta_1 I_{\text{Feb}_i} + \beta_2 I_{\text{Mar}_i} + \epsilon_i, \quad \quad i=1,\dots,n.\] for response variable \(Y\) and \[ I_{\text{Mar}_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if observation } i \mbox{ was recored in March} \\ 0 \hspace{0.5cm} \mbox{if observation } i \mbox{ was not recored in March} \end{cases} \] \[ I_{\text{Feb}_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if observation } i \mbox{ was recored in February} \\ 0 \hspace{0.5cm} \mbox{if observation } i \mbox{ was not recored in February} \end{cases} \] and the assumption (to be discussed later) \[\mathrm{E}(\epsilon_i) = 0 \quad \mathrm{and}\quad \mathrm{Var}(\epsilon_i) = \sigma^2.\] Suppose we estimate \[\begin{eqnarray*} \hat{\alpha}&=&-0.152\\ \hat{\beta}_1&=&1.802\\ \hat{\beta}_2&=&1.822.\\ \end{eqnarray*}\]

If observation \(i\) was recorded in January then \[E(Y_i|\mbox{January})=-0.152.\] If observation \(i\) was recorded in February then \[E(Y_i|\mbox{February})=-0.152 + 1.802=1.65.\] If observation \(i\) was recorded in March then \[E(Y_i|\mbox{March})=-0.152 + 1.822=1.67.\]

Consequently, the difference in response \(Y\) between January and February is 1.802 and the difference in response \(Y\) between January and March is 1.822. We can then infer the difference in response \(Y\) between February and March as 1.822 - 1.802 = 0.02.

Notice the choice of baseline category, January, was not chosen for any particular reason and we can change this. Let February now be the baseline category and so we wish to fit the regression model

\[Y_i = \alpha' + \beta'_1 I_{\text{Jan}_i} + \beta'_2 I_{\text{Mar}_i} + \epsilon_i, \quad \quad i=1,\dots,n.\] for response variable \(Y\) and \[ I_{\text{Jan}_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if observation } i \mbox{ was recored in January} \\ 0 \hspace{0.5cm} \mbox{if observation } i \mbox{ was not recored in January} \end{cases} \] \[ I_{\text{Mar}_i}= \begin{cases} 1 \hspace{0.5cm} \mbox{if observation } i \mbox{ was recored in March} \\ 0 \hspace{0.5cm} \mbox{if observation } i \mbox{ was not recored in March} \end{cases} \]

The coefficient estimates will be different to those given previously. Suppose we estimate \[\begin{eqnarray*} \hat{\alpha}'&=&1.650\\ \hat{\beta}'_1&=&-1.802\\ \hat{\beta}'_2&=&0.020.\\ \end{eqnarray*}\]

Regression coefficients for categorical variables should be interpreted with respect to the baseline category. Based on these results, the difference in response \(Y\) between February and January is -1.802 (notice the change in sign) and the difference in response \(Y\) between February and March is 0.020. We can then infer the difference in response \(Y\) between January and March as -1.802 - 0.0203 = -1.8223.