Chapter 18 Model selection
In practice, you will find that often you will have quite a few variables you may want to include in your model. To decide on final model, you may want to use some metrics for model comparisons. These include analysis of ratio of explained variance to total, measures such as Adjusted R Squared and AIC/BIC metrics. You may also consider the likelihood ratio test if you are dealing with mixed effect models or logistics regression models.
Pause for a moment to consider the difference between R^2 and Adjusted R^2 here. R^2 isn’t a very good criterion, as it will always increase with model size. Adjusted R^2 is better, as it ‘penalises’ bigger models.
Lets consider the shark models we used earlier. We want to compare how well they explain the variance in our data, given various modifications we have done.
We of course noticed that our model that had temperature was a better fitting one, but we can also do an F test to check whether the difference in variance explained was a significant improvement.
Just to remind you the models we had:
#Simple model
sharks_model1 <- lm(SharkAttacks ~ IceCreamSales, data = sharks)
#Extended Model
sharks_model2 <- lm(SharkAttacks ~ IceCreamSales + Temperature, data = sharks)
Lets do an F test:
#Use Analysis of Variance
anova(sharks_model1, sharks_model2)
## Analysis of Variance Table
##
## Model 1: SharkAttacks ~ IceCreamSales
## Model 2: SharkAttacks ~ IceCreamSales + Temperature
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 4219.6
## 2 81 2764.8 1 1454.8 42.62 5.398e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What does this output tell us? The result shows that model 2 did indeed provide a significantly better fit to the data compared to model 1.
Lets also do the same for our salary data. There may be not a huge increase in R squared when we looked at the model, but an F test is a good sanity check as it may show that the difference is significant.
#Simple model with not interaction
model_salary1 <- lm(salary ~ service_m + dept, data = salary)
#Extended Model
model_salary2 <- lm(salary ~ service_m*dept, data = salary)
#Use Analysis of Variance
anova(model_salary1,model_salary2)
## Analysis of Variance Table
##
## Model 1: salary ~ service_m + dept
## Model 2: salary ~ service_m * dept
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 1103.13
## 2 46 976.19 1 126.95 5.982 0.01834 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So we can see again, model 2 is a better fit.
18.1 AIC & BIC
We can further look at AIC and BIC. These are criterion which look at both explained variation but also model complexity and parsimony. Given a collection of models for data, they estimate the quality of each model, relative to each of the other models. BIC tends to penalize model complexity a bit more heavily than AIC. Lower AIC and BIC values mean that a model is considered to be closer to the ‘truth’.
AIC(model_salary1,model_salary2)
## df AIC
## model_salary1 4 304.5882
## model_salary2 5 300.4755
BIC(model_salary1,model_salary2)
## df BIC
## model_salary1 4 312.2363
## model_salary2 5 310.0356
From these measures above, we can say that from AIC and BIC comparisons, the second model is preferred. This adds to our evidence from the anova() output.
Table summarising model evaluation statistics, and generally what you’re looking out for when evaluating the statistics (but heavily dependent on context, number of observations, data, number of predictors in your model, and a few other things – think of the shark example and how some of these wouldn’t necessarily apply!):
Statistic | Criterion |
---|---|
Adj R-Squared | Higher the better |
F-Statistic | Higher the better |
Std. Error | Closer to zero the better |
t-statistic | Should be greater 1.96 for p-value to be less than 0.05 |
AIC | Lower the better |
BIC | Lower the better |