Chapter 2 Introduction to Linear Models

2.1 Preprocessing Data

This chapter provides examples of how to fit models in R. We'll use the bears and cars2015 datasets discussed in the note slides. First, we'll load these data.

library(Bolstad)
data(bears)
library(Lock5Data)
data(Cars2015)

Now, we'll preprocess the bears data, as was done in the notes. In particular, we'll:

  1. filter to the first observation on each bear
  2. Convert month and sex to factor variables
  3. Create a season variable by grouping together months.
# filter
Bears_Subset <- bears %>% filter(Obs.No == 1)
#convert to factors
Bears_Subset$Month <- as.factor(Bears_Subset$Month)
Bears_Subset$Sex <- as.factor(Bears_Subset$Sex)
# create season variable
Bears_Subset <- Bears_Subset %>% mutate(Season = ifelse(Month %in% 4:5, "Spring",
ifelse(Month %in% 6:8, "Summer", "Fall")))
Bears_Subset$Season <- as.factor(Bears_Subset$Season)

2.2 Fitting Models

2.2.1 Modeling with one Explanatory Variable

Template:

M <- lm(data=Dataset_Name, Response ~ Explanatory)

We see a summary of the model using summary().

summary(M)

2.2.1.1 Example: Model Weight Using Age

We model a bear's weight, using age as the explanatory variable.

Bears_M_Age <- lm(data=Bears_Subset, Weight ~ Age)
summary(Bears_M_Age)
## 
## Call:
## lm(formula = Weight ~ Age, data = Bears_Subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -181.09  -45.88  -12.31   31.64  224.52 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.7881    17.5322   3.524 0.000874 ***
## Age           2.7432     0.3247   8.448 1.88e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.48 on 54 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.5693, Adjusted R-squared:  0.5613 
## F-statistic: 71.37 on 1 and 54 DF,  p-value: 1.877e-11

2.2.2 Modeling with two Explanatory Variables

Template:

M <- lm(data=Dataset_Name, Response ~ Explanatory1 + Explanatory2)
summary(M)

2.2.2.1 Example: Model Weight Using Age and Sex

We model a bear's weight, using age and sex as explanatory variables.

Bears_M_Age_Sex <- lm(data=Bears_Subset, Weight ~ Age + Sex)
summary(Bears_M_Age_Sex)
## 
## Call:
## lm(formula = Weight ~ Age + Sex, data = Bears_Subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -164.194  -48.483   -3.723   27.766  188.684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.6049    16.4019   5.036 5.84e-06 ***
## Age           2.9242     0.2914  10.035 7.44e-14 ***
## Sex2        -79.8967    20.1416  -3.967  0.00022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.33 on 53 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.6679, Adjusted R-squared:  0.6554 
## F-statistic: 53.29 on 2 and 53 DF,  p-value: 2.061e-13

2.2.3 Models with 3 or more Variables

For a model with 3 or more variables, and no interactions, simply place + between each explanatory variable.

2.2.3.1 Model Weight Using Age, Sex, and Season

We'll fit a model between age, sex, and season, without any interactions.

Bears_M_Age_Sex_Season <- lm(data=Bears_Subset, Weight ~ Age + Sex + Season)
summary(Bears_M_Age_Sex_Season)
## 
## Call:
## lm(formula = Weight ~ Age + Sex + Season, data = Bears_Subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.170  -53.403   -2.382   26.326  181.932 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.8073    18.7050   4.908 9.83e-06 ***
## Age            2.8947     0.2959   9.783 2.71e-13 ***
## Sex2         -78.3078    20.5067  -3.819 0.000365 ***
## SeasonSpring -20.7510    28.7351  -0.722 0.473501    
## SeasonSummer -20.7070    22.4298  -0.923 0.360256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.94 on 51 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.675,  Adjusted R-squared:  0.6495 
## F-statistic: 26.48 on 4 and 51 DF,  p-value: 6.525e-12

2.3 Models with Interaction

To model with an interaction, use * instead of +

Template:

M_Int <- lm(data=Dataset_Name, Response ~ Explanatory1 * Explanatory2)
summary(M_Int)

2.3.1 Example Model with Interaction

We model a bear's weight, using age and sex as explanatory variables, with an interaction.

Bears_M_Age_Sex_Int <- lm(data=Bears_Subset, Weight ~ Age * Sex)
summary(Bears_M_Age_Sex_Int)
## 
## Call:
## lm(formula = Weight ~ Age * Sex, data = Bears_Subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -207.583  -38.854   -9.574   23.905  174.802 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  70.4322    17.7260   3.973 0.000219 ***
## Age           3.2381     0.3435   9.428 7.65e-13 ***
## Sex2        -31.9574    35.0314  -0.912 0.365848    
## Age:Sex2     -1.0350     0.6237  -1.659 0.103037    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.18 on 52 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.6846, Adjusted R-squared:  0.6664 
## F-statistic: 37.62 on 3 and 52 DF,  p-value: 4.552e-13

For a model with 3 or more variables, and all possible interactions (including interactions between 3 or more variables), use * in place of +, between each variable.

2.3.2 Example Interaction Using *

We'll fit a model between age, sex, and season, with all possible interactions interactions.

Bears_M_Age_Sex_Season_All_Int <- lm(data=Bears_Subset, Weight ~ Age * Sex * Season)
summary(Bears_M_Age_Sex_Season_All_Int)
## 
## Call:
## lm(formula = Weight ~ Age * Sex * Season, data = Bears_Subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -193.656  -35.976   -4.717   22.399  175.228 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             81.9270    21.7599   3.765  0.00049 ***
## Age                      3.0945     0.3994   7.748 9.29e-10 ***
## Sex2                   -66.1730    56.5290  -1.171  0.24806    
## SeasonSpring          -135.6420    71.2302  -1.904  0.06343 .  
## SeasonSummer           -24.2299    43.4702  -0.557  0.58008    
## Age:Sex2                -0.1704     0.9489  -0.180  0.85830    
## Age:SeasonSpring         5.0341     2.3597   2.133  0.03851 *  
## Age:SeasonSummer         0.1615     0.8251   0.196  0.84574    
## Sex2:SeasonSpring      152.0467   106.3539   1.430  0.15989    
## Sex2:SeasonSummer       63.4826    82.6439   0.768  0.44650    
## Age:Sex2:SeasonSpring   -6.0365     2.7129  -2.225  0.03125 *  
## Age:Sex2:SeasonSummer   -1.8806     1.4962  -1.257  0.21541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.85 on 44 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.7356, Adjusted R-squared:  0.6695 
## F-statistic: 11.13 on 11 and 44 DF,  p-value: 1.86e-09

2.3.3 Example: Interaction Using +

We can also add in interaction effects one-by-one using Var1:Var2.

We'll fit a model between age, sex, and season, with only the age-season interaction.

Bears_M_Age_Sex_Season_All_Int <- lm(data=Bears_Subset, 
                                     Weight ~ Age + Sex + Season + Age:Season)
summary(Bears_M_Age_Sex_Season_All_Int)
## 
## Call:
## lm(formula = Weight ~ Age + Sex + Season + Age:Season, data = Bears_Subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -193.189  -51.063   -2.525   26.993  174.271 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       84.1432    21.2898   3.952 0.000248 ***
## Age                3.0794     0.3739   8.235 8.36e-11 ***
## Sex2             -80.2259    20.9939  -3.821 0.000375 ***
## SeasonSpring     -10.7831    46.3724  -0.233 0.817093    
## SeasonSummer       5.4549    37.2038   0.147 0.884033    
## Age:SeasonSpring  -0.2221     0.9790  -0.227 0.821455    
## Age:SeasonSummer  -0.6188     0.6977  -0.887 0.379464    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.81 on 49 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.6801, Adjusted R-squared:  0.641 
## F-statistic: 17.36 on 6 and 49 DF,  p-value: 1.208e-10

2.4 ANOVA F-Statistics

2.4.1 anova() Command

The anova command calculates an F-statistic for a full model vs a reduced model.

2.4.1.1 anova() Template:

anova(Reduced_Model, Full_Model)

2.4.1.2 anova() Example

We'll compare the full model involving age and sex, to a reduced model involving only age.

anova(Bears_M_Age, Bears_M_Age_Sex)
## Analysis of Variance Table
## 
## Model 1: Weight ~ Age
## Model 2: Weight ~ Age + Sex
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)    
## 1     54 349728                                
## 2     53 269667  1     80061 15.735 0.00022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4.2 aov() Command

The aov command is useful for obtaining an F-statistic for models with a categorical explantory variable. This F-statistic is equivalent to the one we would get using a reduced model of containing only the intercept, and measures the amount of variability between different groups, relative to the amount of variability within groups.

2.4.2.1 aov() Template

A <- aov(data=Dataset_Name, Response~Categorical_Explanatory_Var)
summary(A)

2.4.2.2 aov() Commad

Bears_A_Season <- aov(data=Bears_Subset, Weight~Season)
summary(Bears_A_Season)
##             Df  Sum Sq Mean Sq F value Pr(>F)
## Season       2   24699   12350   0.976  0.381
## Residuals   94 1189818   12658

2.5 Making Predictions

We can make predictions using the predict() function. This requires specifying the model, as well as the new data being predicted. The new data should be specified in the form of a dataframe, which can be defined inside the predict() command, or before hand.

2.5.1 predict() Command

We'll predict the weight of a 24 month old female bear, using the model using age, sex, and interaction. Recall that male bears are coded as Sex=1, and female bears as Sex=2.

predict(Bears_M_Age_Sex_Int, newdata=data.frame(Age=24, Sex="2"))
##        1 
## 91.34993

We'll predict the weight of a 24 month old male bear, using the model using age, sex, and interaction.

predict(Bears_M_Age_Sex_Int, newdata=data.frame(Age=24, Sex="1"))
##        1 
## 148.1475

2.5.2 Defining New Data Before predict()

We can also define the new data outside the predict() function. It's easiest to do this one variable at a time, then combine them into a dataframe.

We'll predict the weights of bears of different ages and sexes.

Age <- c(15, 25, 28, 37, 44, 62)
Sex <- c("1", "2","1", "1", "2", "1" )
NewBears <- data.frame(Age, Sex)

Let's see the new dataset.

NewBears
##   Age Sex
## 1  15   1
## 2  25   2
## 3  28   1
## 4  37   1
## 5  44   2
## 6  62   1

Now we'll make predictions on the weights of the bears in the new dataset.

predict(Bears_M_Age_Sex_Int, newdata=NewBears)
##         1         2         3         4         5         6 
## 119.00429  93.55306 161.10008 190.24331 135.41248 271.19674