Fitting a linear regression in
The built-in function used to fit a linear regression in is . This function takes two main arguments. The first is the regression model to be fitted, or formula, and the second on the data frame. There is a high amount of flexibility in terms of the types of formula we may specify and we will see man examples throughout this course. For now, suppose we have a data frame with columns labelled and .
We may fit a simple linear regression with
\[\texttt{lm(y} \sim \texttt{x, data=data)}.\] This model is equivalent to \(y_i=\alpha + \beta x_i + \epsilon_i\) for \(i=1,\ldots,n\). will produce least squares estimates of \(\alpha\) and \(\beta\) in addition to a host of other estimates and statistics that we will discuss in the following weeks.
We may fit a simple linear regression through the origin with
\[\texttt{lm(y} \sim \texttt{x-1, data=data)}.\] This model is equivalent to \(y_i=\beta x_i + \epsilon_i\) for \(i=1,\ldots,n\). will produce the least squares estimate of \(\beta\).
Please note that this function makes a range of assumptions. Although we may receive output, we are required to assess model fit before interpreting this output.
Height, weight and gender example in R
Continuing with the height, weight and gender example we will now show how to load in data and fit the three models
- different lines: \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta_i \mbox{Height}_{ij}\) for \(i=1,2\).
- parallel lines: \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta \mbox{Height}_{ij}\) for \(i=1,2\).
- single line: \(E(\mbox{Weight}_{ij}) = \alpha+\beta \mbox{Height}_{ij}\) for \(i=1,2\).
Taking each of this models in turn, we can now interpret the estimated regression coefficients
Different lines
We are fitting the model \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta_i \mbox{Height}_{ij}\) for \(i=1,2\) and we want estimates of parameters \(\begin{aligned} \boldsymbol{\beta} &=&\left( \begin{array}{c} \alpha_1 \\ \beta_1\\ \alpha_2\\ \beta_2 \end{array} \right)\end{aligned}\). In order to fit this model with an interaction between height and gender, we use the syntax `*’ with
\[\texttt{lm(Weight} \sim \texttt{Height*Gender, data=htwtgen)}.\]
##
## Call:
## lm(formula = Weight ~ Height * Gender, data = htwtgen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.194 -6.796 -0.118 6.814 35.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -246.01327 3.34973 -73.443 < 2e-16 ***
## Height 5.99405 0.05253 114.103 < 2e-16 ***
## GenderMale 21.51443 4.78534 4.496 7.01e-06 ***
## Height:GenderMale -0.03227 0.07216 -0.447 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 9996 degrees of freedom
## Multiple R-squared: 0.9028, Adjusted R-squared: 0.9027
## F-statistic: 3.093e+04 on 3 and 9996 DF, p-value: < 2.2e-16
The variable Gender is binary and so we need to identify the baseline category used. From the R output, we can see Gender is labelled with Male. This indicates that we compared Male to Female, with Female the baseline category. From this fitted model, we may infer
\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-246.01327 + 5.99405x + 21.51443 - 0.03227x\\ &=&-224.4988 + 5.96178x\\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -246.01327 + 5.99405x\\ \end{eqnarray*}\]
For now, notice that the two regression lines have different slopes and intercept terms.
Parallel lines
We are fitting the model \(E(\mbox{Weight}_{ij}) = \alpha_i+\beta \mbox{Height}_{ij}\) for \(i=1,2\) and we want estimates of parameters \(\begin{aligned} \boldsymbol{\beta} &=&\left( \begin{array}{c} \alpha_1 \\ \alpha_2\\ \beta \end{array} \right)\end{aligned}\). In order to fit this model with height and gender, we use the syntax `+’ with
\[\texttt{lm(Weight} \sim \texttt{Height*Gender, data=htwtgen)}.\]
##
## Call:
## lm(formula = Weight ~ Height + Gender, data = htwtgen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.167 -6.786 -0.118 6.800 35.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -244.92350 2.29862 -106.55 <2e-16 ***
## Height 5.97694 0.03601 165.97 <2e-16 ***
## GenderMale 19.37771 0.27710 69.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 9997 degrees of freedom
## Multiple R-squared: 0.9027, Adjusted R-squared: 0.9027
## F-statistic: 4.64e+04 on 2 and 9997 DF, p-value: < 2.2e-16
\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-244.92335 + 5.97694x + 19.37771\\ &=&-225.5456 + 5.97694x\\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -244.9235 + 5.97694x \end{eqnarray*}\]
For now, notice that the two regression lines have the same slope and different intercept terms.
Single line
We are fitting the model \(E(\mbox{Weight}_{ij}) = \alpha+\beta \mbox{Height}_{ij}\) for \(i=1,2\) and we want estimates of parameters \(\begin{aligned} \boldsymbol{\beta} &=&\left( \begin{array}{c} \alpha \\ \beta \end{array} \right)\end{aligned}\). In order to fit this simple linear regression model with height, we use
\[\texttt{lm(Weight} \sim \texttt{Height, data=htwtgen)}.\]
##
## Call:
## lm(formula = Weight ~ Height, data = htwtgen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.934 -8.236 -0.119 8.260 46.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.73719 2.11149 -166.1 <2e-16 ***
## Height 7.71729 0.03176 243.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8552
## F-statistic: 5.904e+04 on 1 and 9998 DF, p-value: < 2.2e-16
\[\begin{eqnarray*} E(\mbox{Weight}|\mbox{Male}, \mbox{Height}=x)&=&-350.73719 + 7.71729x \\ \\ E(\mbox{Weight}|\mbox{Female}, \mbox{Height}=x) &=& -350.73719 + 7.71729x \end{eqnarray*}\]
Here the relationship between Weight and Height does not depend on Gender.