Interpreting multiple regression

Now let’s consider an example of a multiple linear regression with two explanatory variables, one categorical and the other continuous.

Height, weight and gender

Data are available detailing the weight and height for 10000 men and women. We want to compare the average height of men and women in addition to investigating the relationship between weight and height. In particular, does the relationship between weight and height depend on gender? The data are plotted below.

We can see that as weight increases, height also increases. In general, the green points lie above the red points suggesting that males have, on average, higher weight and height. In summary

  • Weight increases with height;

  • Males have higher heights than females;

  • Males have higher weights than females;

  • The relationship between height and weight appears to be the same for males and females. In other words, looking at both genders separately, weight increases with height. The two solid fitted lines also suggest this since they seem to have a very similar gradients.

We can fit a linear regression model of the form

\[Y_i = \alpha + \beta I_{x_{1i}}+ \gamma x_{2i} + \epsilon_i, \quad \quad i=1,\dots,n.\] where \(Y_i\) is the weight of person \(i\), \(x_{2i}\) is the height of person \(i\) and \[ I_{x_{1i}}= \begin{cases} 1 \hspace{0.5cm} \mbox{if person } i \mbox{ is male } \\ 0 \hspace{0.5cm} \mbox{if person } i \mbox{ is female } \end{cases} \] Therefore we have two explanatory variables \(x_1\) and \(x_2\) corresponding to gender and height. The interpretation of the coefficients \(\beta\) and \(\gamma\) differ depending on the nature of \(x_1\) and \(x_2\), that is because \(x_1\) is a binary variable and \(x_2\) is a continuous variable. Suppose we estimate

\[\begin{eqnarray*} \hat{\alpha}&=&-244.924\\ \hat{\beta}&=&19.378\\ \hat{\gamma}&=&5.977\\ \end{eqnarray*}\]

In this dataset, the average height was 66.37 inches. Therefore, for a male of height 66.37, we would predict a weight of

\[E(\mbox{Weight}_i|\mbox{Male}, \mbox{Height}=66.37)=-244.924 +19.378 + (5.977 \times66.37)=171.15\]

for a female of height 66.37 inches, we would predict a weight of

\[E(\mbox{Weight}_i|\mbox{Female}, \mbox{Height}=66.37)=-244.924 + (5.977 \times66.37)=151.77\]

Notice these estimate are different to those obtained in our previous model where we modelled weight using gender only. The expected weight of a male was 187.02 and the expected weight of a female was 135.86. Since we know height to have an effect of weight, because taller people tend to weigh more, by including height in the model in addition to gender we can provide different estimates of weight. We will visit the notion of whether this model is ‘better’ that the previous model (with just gender) in future lectures.

Let’s now see how these parameters estimate relate to the plot of the data. We have height plotted against weight and each point in the scatterplot corresponds to one person in the data set. Each point in coloured depending on gender. There is a red section, or cloud, of data points corresponding to females and a green cloud of data points corresponding to males. There are also two fitted lines here; one for males in green and one for females in red. Notice that both lines need to have an intercept term and a slope term. We can see from this plot that both lines look like they have the same slope but different intercept terms since the lines run parallel but the red line is completely below the green line. We actually already know the slopes and intercepts based on our fitted model here

\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-244.924 +19.378 + 5.977x\\ &=&-225.546 + 5.977x\\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -244.924 + 5.977x \end{eqnarray*}\]

Both lines have slope of 5.977. The line corresponding to males has an intercept term of -225.546 and the line corresponding to females has an intercept term of -244.924.

Alternative notation

This model was defined as

\[Y_i = \alpha + \beta I_{x_{1i}}+ \gamma x_{2i} + \epsilon_i, \quad \quad i=1,\dots,n.\] where \(Y_i\) is the weight of person \(i\), \(x_{2i}\) is the height of person \(i\) and \[ I_{x_{1i}}= \begin{cases} 1 \hspace{0.5cm} \mbox{if person } i \mbox{ is male } \\ 0 \hspace{0.5cm} \mbox{if person } i \mbox{ is female } \end{cases} \]

corresponding to two regression lines; one for male and one for female. The intercept for male was estimated as \(\hat{\alpha} + \hat{\beta}\) and the intercept for female was estimated as \(\hat{\alpha}\). Both lines have an estimated slope of \(\hat{\gamma}\). In other words, we can write this more more generally as

\[Y_{ij} = \alpha'_{i} + \gamma x_{2ij} + \epsilon_{ij}\] where \(\alpha'_{\mbox{Male}}=\alpha + \beta\) and \(\alpha'_{\mbox{Female}}=\alpha\).

In this case, we now have a double subscript \(ij\). Typically the subscript \(i\) denotes the grouping factor. Here that would be male or female and the subscript \(j\) denotes a person within that group. For example \(Y_{15}\) would indicated the observed value of variable \(Y\) from the 5th person in group 1.