Chapter 2 Introduction to Linear Models
2.1 Preprocessing Data
This chapter provides examples of how to fit models in R. We'll use the bears and cars2015 datasets discussed in the note slides. First, we'll load these data.
library(Bolstad)
data(bears)
library(Lock5Data)
data(Cars2015)
Now, we'll preprocess the bears data, as was done in the notes. In particular, we'll:
- filter to the first observation on each bear
- Convert month and sex to factor variables
- Create a season variable by grouping together months.
# filter
Bears_Subset <- bears %>% filter(Obs.No == 1)
#convert to factors
Bears_Subset$Month <- as.factor(Bears_Subset$Month)
Bears_Subset$Sex <- as.factor(Bears_Subset$Sex)
# create season variable
Bears_Subset <- Bears_Subset %>% mutate(Season = ifelse(Month %in% 4:5, "Spring",
ifelse(Month %in% 6:8, "Summer", "Fall")))
Bears_Subset$Season <- as.factor(Bears_Subset$Season)
2.2 Fitting Models
2.2.1 Modeling with one Explanatory Variable
Template:
M <- lm(data=Dataset_Name, Response ~ Explanatory)
We see a summary of the model using summary()
.
summary(M)
2.2.1.1 Example: Model Weight Using Age
We model a bear's weight, using age as the explanatory variable.
Bears_M_Age <- lm(data=Bears_Subset, Weight ~ Age)
summary(Bears_M_Age)
##
## Call:
## lm(formula = Weight ~ Age, data = Bears_Subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181.09 -45.88 -12.31 31.64 224.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.7881 17.5322 3.524 0.000874 ***
## Age 2.7432 0.3247 8.448 1.88e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.48 on 54 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.5693, Adjusted R-squared: 0.5613
## F-statistic: 71.37 on 1 and 54 DF, p-value: 1.877e-11
2.2.2 Modeling with two Explanatory Variables
Template:
M <- lm(data=Dataset_Name, Response ~ Explanatory1 + Explanatory2)
summary(M)
2.2.2.1 Example: Model Weight Using Age and Sex
We model a bear's weight, using age and sex as explanatory variables.
Bears_M_Age_Sex <- lm(data=Bears_Subset, Weight ~ Age + Sex)
summary(Bears_M_Age_Sex)
##
## Call:
## lm(formula = Weight ~ Age + Sex, data = Bears_Subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -164.194 -48.483 -3.723 27.766 188.684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.6049 16.4019 5.036 5.84e-06 ***
## Age 2.9242 0.2914 10.035 7.44e-14 ***
## Sex2 -79.8967 20.1416 -3.967 0.00022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.33 on 53 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.6679, Adjusted R-squared: 0.6554
## F-statistic: 53.29 on 2 and 53 DF, p-value: 2.061e-13
2.2.3 Models with 3 or more Variables
For a model with 3 or more variables, and no interactions, simply place +
between each explanatory variable.
2.2.3.1 Model Weight Using Age, Sex, and Season
We'll fit a model between age, sex, and season, without any interactions.
Bears_M_Age_Sex_Season <- lm(data=Bears_Subset, Weight ~ Age + Sex + Season)
summary(Bears_M_Age_Sex_Season)
##
## Call:
## lm(formula = Weight ~ Age + Sex + Season, data = Bears_Subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.170 -53.403 -2.382 26.326 181.932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.8073 18.7050 4.908 9.83e-06 ***
## Age 2.8947 0.2959 9.783 2.71e-13 ***
## Sex2 -78.3078 20.5067 -3.819 0.000365 ***
## SeasonSpring -20.7510 28.7351 -0.722 0.473501
## SeasonSummer -20.7070 22.4298 -0.923 0.360256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.94 on 51 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.675, Adjusted R-squared: 0.6495
## F-statistic: 26.48 on 4 and 51 DF, p-value: 6.525e-12
2.3 Models with Interaction
To model with an interaction, use *
instead of +
Template:
M_Int <- lm(data=Dataset_Name, Response ~ Explanatory1 * Explanatory2)
summary(M_Int)
2.3.1 Example Model with Interaction
We model a bear's weight, using age and sex as explanatory variables, with an interaction.
Bears_M_Age_Sex_Int <- lm(data=Bears_Subset, Weight ~ Age * Sex)
summary(Bears_M_Age_Sex_Int)
##
## Call:
## lm(formula = Weight ~ Age * Sex, data = Bears_Subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -207.583 -38.854 -9.574 23.905 174.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.4322 17.7260 3.973 0.000219 ***
## Age 3.2381 0.3435 9.428 7.65e-13 ***
## Sex2 -31.9574 35.0314 -0.912 0.365848
## Age:Sex2 -1.0350 0.6237 -1.659 0.103037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.18 on 52 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.6846, Adjusted R-squared: 0.6664
## F-statistic: 37.62 on 3 and 52 DF, p-value: 4.552e-13
For a model with 3 or more variables, and all possible interactions (including interactions between 3 or more variables), use *
in place of +
, between each variable.
2.3.2 Example Interaction Using *
We'll fit a model between age, sex, and season, with all possible interactions interactions.
Bears_M_Age_Sex_Season_All_Int <- lm(data=Bears_Subset, Weight ~ Age * Sex * Season)
summary(Bears_M_Age_Sex_Season_All_Int)
##
## Call:
## lm(formula = Weight ~ Age * Sex * Season, data = Bears_Subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.656 -35.976 -4.717 22.399 175.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.9270 21.7599 3.765 0.00049 ***
## Age 3.0945 0.3994 7.748 9.29e-10 ***
## Sex2 -66.1730 56.5290 -1.171 0.24806
## SeasonSpring -135.6420 71.2302 -1.904 0.06343 .
## SeasonSummer -24.2299 43.4702 -0.557 0.58008
## Age:Sex2 -0.1704 0.9489 -0.180 0.85830
## Age:SeasonSpring 5.0341 2.3597 2.133 0.03851 *
## Age:SeasonSummer 0.1615 0.8251 0.196 0.84574
## Sex2:SeasonSpring 152.0467 106.3539 1.430 0.15989
## Sex2:SeasonSummer 63.4826 82.6439 0.768 0.44650
## Age:Sex2:SeasonSpring -6.0365 2.7129 -2.225 0.03125 *
## Age:Sex2:SeasonSummer -1.8806 1.4962 -1.257 0.21541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.85 on 44 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.7356, Adjusted R-squared: 0.6695
## F-statistic: 11.13 on 11 and 44 DF, p-value: 1.86e-09
2.3.3 Example: Interaction Using +
We can also add in interaction effects one-by-one using Var1:Var2
.
We'll fit a model between age, sex, and season, with only the age-season interaction.
Bears_M_Age_Sex_Season_All_Int <- lm(data=Bears_Subset,
Weight ~ Age + Sex + Season + Age:Season)
summary(Bears_M_Age_Sex_Season_All_Int)
##
## Call:
## lm(formula = Weight ~ Age + Sex + Season + Age:Season, data = Bears_Subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.189 -51.063 -2.525 26.993 174.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.1432 21.2898 3.952 0.000248 ***
## Age 3.0794 0.3739 8.235 8.36e-11 ***
## Sex2 -80.2259 20.9939 -3.821 0.000375 ***
## SeasonSpring -10.7831 46.3724 -0.233 0.817093
## SeasonSummer 5.4549 37.2038 0.147 0.884033
## Age:SeasonSpring -0.2221 0.9790 -0.227 0.821455
## Age:SeasonSummer -0.6188 0.6977 -0.887 0.379464
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72.81 on 49 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.6801, Adjusted R-squared: 0.641
## F-statistic: 17.36 on 6 and 49 DF, p-value: 1.208e-10
2.4 ANOVA F-Statistics
2.4.1 anova()
Command
The anova
command calculates an F-statistic for a full model vs a reduced model.
2.4.1.1 anova()
Template:
anova(Reduced_Model, Full_Model)
2.4.1.2 anova()
Example
We'll compare the full model involving age and sex, to a reduced model involving only age.
anova(Bears_M_Age, Bears_M_Age_Sex)
## Analysis of Variance Table
##
## Model 1: Weight ~ Age
## Model 2: Weight ~ Age + Sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 54 349728
## 2 53 269667 1 80061 15.735 0.00022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.4.2 aov()
Command
The aov
command is useful for obtaining an F-statistic for models with a categorical explantory variable. This F-statistic is equivalent to the one we would get using a reduced model of containing only the intercept, and measures the amount of variability between different groups, relative to the amount of variability within groups.
2.4.2.1 aov()
Template
A <- aov(data=Dataset_Name, Response~Categorical_Explanatory_Var)
summary(A)
2.4.2.2 aov()
Commad
Bears_A_Season <- aov(data=Bears_Subset, Weight~Season)
summary(Bears_A_Season)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 2 24699 12350 0.976 0.381
## Residuals 94 1189818 12658
2.5 Making Predictions
We can make predictions using the predict()
function. This requires specifying the model, as well as the new data being predicted. The new data should be specified in the form of a dataframe, which can be defined inside the predict()
command, or before hand.
2.5.1 predict()
Command
We'll predict the weight of a 24 month old female bear, using the model using age, sex, and interaction. Recall that male bears are coded as Sex=1
, and female bears as Sex=2
.
predict(Bears_M_Age_Sex_Int, newdata=data.frame(Age=24, Sex="2"))
## 1
## 91.34993
We'll predict the weight of a 24 month old male bear, using the model using age, sex, and interaction.
predict(Bears_M_Age_Sex_Int, newdata=data.frame(Age=24, Sex="1"))
## 1
## 148.1475
2.5.2 Defining New Data Before predict()
We can also define the new data outside the predict()
function. It's easiest to do this one variable at a time, then combine them into a dataframe.
We'll predict the weights of bears of different ages and sexes.
Age <- c(15, 25, 28, 37, 44, 62)
Sex <- c("1", "2","1", "1", "2", "1" )
NewBears <- data.frame(Age, Sex)
Let's see the new dataset.
NewBears
## Age Sex
## 1 15 1
## 2 25 2
## 3 28 1
## 4 37 1
## 5 44 2
## 6 62 1
Now we'll make predictions on the weights of the bears in the new dataset.
predict(Bears_M_Age_Sex_Int, newdata=NewBears)
## 1 2 3 4 5 6
## 119.00429 93.55306 161.10008 190.24331 135.41248 271.19674