Fitting a linear regression in

The built-in function used to fit a linear regression in is . This function takes two main arguments. The first is the regression model to be fitted, or formula, and the second on the data frame. There is a high amount of flexibility in terms of the types of formula we may specify and we will see man examples throughout this course. For now, suppose we have a data frame with columns labelled and .

We may fit a simple linear regression with

\[\texttt{lm(y} \sim \texttt{x, data=data)}.\] This model is equivalent to \(y_i=\alpha + \beta x_i + \epsilon_i\) for \(i=1,\ldots,n\). will produce least squares estimates of \(\alpha\) and \(\beta\) in addition to a host of other estimates and statistics that we will discuss in the following weeks.

We may fit a simple linear regression through the origin with

\[\texttt{lm(y} \sim \texttt{x-1, data=data)}.\] This model is equivalent to \(y_i=\beta x_i + \epsilon_i\) for \(i=1,\ldots,n\). will produce the least squares estimate of \(\beta\).

Please note that this function makes a range of assumptions. Although we may receive output, we are required to assess model fit before interpreting this output.

Height, weight and gender example in R

Continuing with the height, weight and gender example we will now show how to load in data and fit the three models

  1. different lines: \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta_i \mbox{Height}_{ij}\) for \(i=1,2\).
  2. parallel lines: \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta \mbox{Height}_{ij}\) for \(i=1,2\).
  3. single line: \(E(\mbox{Weight}_{ij}) = \alpha+\beta \mbox{Height}_{ij}\) for \(i=1,2\).
#Load in data
htwtgen <- read.csv(file = "week4/Lecture8/01_heights_weights_genders.csv")

Taking each of this models in turn, we can now interpret the estimated regression coefficients

Different lines

We are fitting the model \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta_i \mbox{Height}_{ij}\) for \(i=1,2\) and we want estimates of parameters \(\begin{aligned} \boldsymbol{\beta} &=&\left( \begin{array}{c} \alpha_1 \\ \beta_1\\ \alpha_2\\ \beta_2 \end{array} \right)\end{aligned}\). In order to fit this model with an interaction between height and gender, we use the syntax `*’ with

\[\texttt{lm(Weight} \sim \texttt{Height*Gender, data=htwtgen)}.\]

#Model 1: different lines
model1<-lm(Weight~Height*Gender, data=htwtgen)
summary(model1)
## 
## Call:
## lm(formula = Weight ~ Height * Gender, data = htwtgen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.194  -6.796  -0.118   6.814  35.813 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -246.01327    3.34973 -73.443  < 2e-16 ***
## Height               5.99405    0.05253 114.103  < 2e-16 ***
## GenderMale          21.51443    4.78534   4.496 7.01e-06 ***
## Height:GenderMale   -0.03227    0.07216  -0.447    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 9996 degrees of freedom
## Multiple R-squared:  0.9028, Adjusted R-squared:  0.9027 
## F-statistic: 3.093e+04 on 3 and 9996 DF,  p-value: < 2.2e-16

The variable Gender is binary and so we need to identify the baseline category used. From the R output, we can see Gender is labelled with Male. This indicates that we compared Male to Female, with Female the baseline category. From this fitted model, we may infer

\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-246.01327 + 5.99405x + 21.51443 - 0.03227x\\ &=&-224.4988 + 5.96178x\\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -246.01327 + 5.99405x\\ \end{eqnarray*}\]

For now, notice that the two regression lines have different slopes and intercept terms.

Parallel lines

We are fitting the model \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta \mbox{Height}_{ij}\) for \(i=1,2\) and we want estimates of parameters \(\begin{aligned} \boldsymbol{\beta} &=&\left( \begin{array}{c} \alpha_1 \\ \alpha_2\\ \beta \end{array} \right)\end{aligned}\). In order to fit this model with height and gender, we use the syntax `+’ with

\[\texttt{lm(Weight} \sim \texttt{Height*Gender, data=htwtgen)}.\]

#Model 2: parallel lines
model2<-lm(Weight~Height+Gender, data=htwtgen)
summary(model2)
## 
## Call:
## lm(formula = Weight ~ Height + Gender, data = htwtgen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.167  -6.786  -0.118   6.800  35.850 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -244.92350    2.29862 -106.55   <2e-16 ***
## Height         5.97694    0.03601  165.97   <2e-16 ***
## GenderMale    19.37771    0.27710   69.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 9997 degrees of freedom
## Multiple R-squared:  0.9027, Adjusted R-squared:  0.9027 
## F-statistic: 4.64e+04 on 2 and 9997 DF,  p-value: < 2.2e-16

\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-244.92335 + 5.97694x + 19.37771\\ &=&-225.5456 + 5.97694x\\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -244.9235 + 5.97694x \end{eqnarray*}\]

For now, notice that the two regression lines have the same slope and different intercept terms.

Single line

We are fitting the model \(E(\mbox{Weight}_{ij}) = \alpha+\beta \mbox{Height}_{ij}\) for \(i=1,2\) and we want estimates of parameters \(\begin{aligned} \boldsymbol{\beta} &=&\left( \begin{array}{c} \alpha \\ \beta \end{array} \right)\end{aligned}\). In order to fit this simple linear regression model with height, we use

\[\texttt{lm(Weight} \sim \texttt{Height, data=htwtgen)}.\]

#Model 3: single line
model3<-lm(Weight~Height, data=htwtgen)
summary(model3)
## 
## Call:
## lm(formula = Weight ~ Height, data = htwtgen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.934  -8.236  -0.119   8.260  46.844 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -350.73719    2.11149  -166.1   <2e-16 ***
## Height         7.71729    0.03176   243.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared:  0.8552, Adjusted R-squared:  0.8552 
## F-statistic: 5.904e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-350.73719 + 7.71729x \\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -350.73719 + 7.71729x \end{eqnarray*}\]

Here the relationship between Weight and Height does not depend on Gender.