Chapter 14 Model selection

In practice you will find that often you will have quite a few variables you may want to include in your model. To decide on final model you may want to use some metrics for model comparisons. These include analysis of ratio of explained variance to total, measures such as Adjusted R Squared and AIC/BIC metrics. You may also consider likelihood ratio test if you are dealing with mixed effect models or logistics regression models.

Lets consider the models we used earlier. We want to compare how well they explain the variance in our data, given various modifications we have done.

Lets go back to shark attacks example. We of course noticed that our model that had temperature was a better one, but we can also do an F test to check whether the difference in variance explained was a significant improvement.

Just to remind you the models we had:

#Simple model
sharks_model_1<-lm(SharkAttacks~IceCreamSales, data=sharks)

#Extended Model
sharks_model_2<-lm(SharkAttacks~IceCreamSales + Temperature, data=sharks)

Lets do an F test:

#Use Analysis of Variance
anova(sharks_model_1, sharks_model_2)
## Analysis of Variance Table
## 
## Model 1: SharkAttacks ~ IceCreamSales
## Model 2: SharkAttacks ~ IceCreamSales + Temperature
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1     82 4219.6                                 
## 2     81 2764.8  1    1454.8 42.62 5.398e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lets also do the same for interactions:

#Simple model with not interaction
model_salary_1 <- lm(salary ~ service_m+dept, data = salary)
#Extended Model
model_salary_2 <- lm(salary ~ service_m*dept, data = salary)
#Use Analysis of Variance
anova(model_salary_1,model_salary_2)
## Analysis of Variance Table
## 
## Model 1: salary ~ service_m + dept
## Model 2: salary ~ service_m * dept
##   Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
## 1     47 1103.13                             
## 2     46  976.19  1    126.95 5.982 0.01834 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There may be not a huge increase in R squared when we looked at the model, but an F test is a good sanity check as it may show that the difference is significant.

We can further look at AIC and BIC, these are criterion which look at both explained variation but also model complexity and parsimony. BIC tend to penalize model complexity a bit more heavily.

AIC(model_salary_1,model_salary_2)
##                df      AIC
## model_salary_1  4 304.5882
## model_salary_2  5 300.4755
BIC(model_salary_1,model_salary_2)
##                df      BIC
## model_salary_1  4 312.2363
## model_salary_2  5 310.0356

From these measures above we can say that from AIC comparisons the second model is better so as from BIC, lower score is preferred.