9 Review: Comparing Models (R)
9.1 Lab Overview
This lab reviews a fundamental from linear regression we’ll be adding onto: comparing the fit of models!
Research Question: What characteristics are associated with separtist movements in ethnocultural groups?
Lab Files
Data:
Download M.A.R_Cleaned.csv
Codebook: Download MAR_Codebook.pdf
Script File: Download lab4_F2_comparing_models.R
Activity File: Download lab4_F2_activity.R
Packages for this lab:
You may need to install packages that you have never used before with install.packages("packagename")
# Basic data cleaning and visualization packages
library(tidyverse)
# Package for Wald Test
library(car)
# Package for Liklihood Ratio Test
library(lmtest)
Setting factor variables
Some quick data cleaning to set these variables as factors
# Set categorical variables as factors
<- df_MAR %>%
df_MAR mutate(
trsntlkin = factor(trsntlkin, labels = c("No Close Kindred",
"Close Kindred/Not Adjoining Base",
"Close Kindred, Adjoining Base",
"Close Kindred, More than 1 Country")),
kinmatsup = factor(kinmatsup, labels = c("No material kinship support",
"Material kinship support")),
kinmilsup = factor(kinmilsup, labels = c("No military kinship support",
"Military kinship support")),
grplang = factor(grplang, labels = c("<10% Speak Native Lang",
"Some Speak Native Lang",
"Most Speak Native Lang"))
)
9.2 Wald Test
Test overview
The Wald test helps you see whether adding a set of variables collectively improves the fit of the model. Said another way, we can see if removing the two or more variables from the model would substantially harm the fit of the model. This would mean it removes something statistically significant to the fit of the model.
Test hypotheses:
* Null hypothesis: the coefficients are all equal to 0 (\(H_0: X1 = X2 = 0\))
* Alt hypothesis: the coefficients are not all equal to zero (\(H_A: X1 \ or \ X2 \neq 0\))
Interpretation: If the p value for the test is < .05, then we reject the null hypothesis and the variables significantly explain some piece of the model. It is like the p-value for our coefficients, except for multiple coefficients at the same time.
Even though we are looking at the Wald Test first, it is actually just an approximation of the likelihood ratio test coming next. If that is the case (you might be asking), why would we even learn the Wald test? Well, the advantage of the Wald test is that it only requires that we run a single model, rather than several. If you have a huge, computationally intensive model, the Wald test might be the better choice. Otherwise, many recommend the likelihood ratio test over the Wald Test.
R code
STEP 1: Run your regression
<- glm(sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang + pct_ctry,
fit1 data = df_MAR, family = binomial(link = "logit"))
summary(fit1)
Call:
glm(formula = sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang +
pct_ctry, family = binomial(link = "logit"), data = df_MAR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7868 -0.8022 -0.7235 1.2534 2.4011
Coefficients:
Estimate Std. Error z value
(Intercept) -3.0587629 0.4824304 -6.340
trsntlkinClose Kindred/Not Adjoining Base 1.8558380 0.4394281 4.223
trsntlkinClose Kindred, Adjoining Base 1.6448396 0.4544475 3.619
trsntlkinClose Kindred, More than 1 Country 2.4308115 0.4495357 5.407
kinmatsupMaterial kinship support -0.2695913 0.3839679 -0.702
kinmilsupMilitary kinship support 1.5463083 0.4640465 3.332
grplangSome Speak Native Lang 0.2350165 0.2380001 0.987
grplangMost Speak Native Lang 1.0265800 0.2753856 3.728
pct_ctry -0.0002825 0.0061351 -0.046
Pr(>|z|)
(Intercept) 2.29e-10 ***
trsntlkinClose Kindred/Not Adjoining Base 2.41e-05 ***
trsntlkinClose Kindred, Adjoining Base 0.000295 ***
trsntlkinClose Kindred, More than 1 Country 6.40e-08 ***
kinmatsupMaterial kinship support 0.482605
kinmilsupMilitary kinship support 0.000862 ***
grplangSome Speak Native Lang 0.323415
grplangMost Speak Native Lang 0.000193 ***
pct_ctry 0.963279
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.80 on 836 degrees of freedom
Residual deviance: 926.81 on 828 degrees of freedom
(15 observations deleted due to missingness)
AIC: 944.81
Number of Fisher Scoring iterations: 5
STEP 2: STEP 2: Use the linearHypothesis
command
The R command you will be using is the linearHypothesis
command from the car
package. You simply have to list the saved model results two or more variables you want to jointly test.
The only tricky part of this command is working with categorical variables. When you are comparing categorical variables you have to specify the category value before the name of the coefficient. I’m copying these directly from the regression results above. We’ll run two examples so you can see what I mean.
Example 1: Material and Military Support from Kin
In this example because both kinmatsup
and kinmilsup
are binary variables,
we want to test whether the coefficients for “Yes” are equal to 0. We can refer to the coefficients using what appears in the regression results: kinmatsupMaterial kinship support
and kinmilsupMilitary kinship support
.
linearHypothesis(fit1, c("kinmatsupMaterial kinship support = 0",
"kinmilsupMilitary kinship support = 0"))
Linear hypothesis test
Hypothesis:
kinmatsupMaterial kinship support = 0
kinmilsupMilitary kinship support = 0
Model 1: restricted model
Model 2: sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang + pct_ctry
Res.Df Df Chisq Pr(>Chisq)
1 830
2 828 2 11.108 0.003871 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example 2: All the categories of Language Distinctiveness (grplang
)
In this example, we’re only looking at one variable, but because it is a categorical variable with multiple levels we can test to see whether each category, jointly, adds to the model. We will list all the categories of grplang
except for the base category. Again, we pull this from the regression results.
linearHypothesis(fit1, c("grplangSome Speak Native Lang",
"grplangMost Speak Native Lang"))
Linear hypothesis test
Hypothesis:
grplangSome Speak Native Lang = 0
grplangMost Speak Native Lang = 0
Model 1: restricted model
Model 2: sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang + pct_ctry
Res.Df Df Chisq Pr(>Chisq)
1 830
2 828 2 18.999 7.489e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
STEP 3: Interpret the test
If the p-value for a test is LESS THAN .05 we can reject the null hypothesis that the coefficients are both equal to zero, meaning adding these two variables improves the fit of the model.
For example 1, we can reject the null (\(p = 0.0039\)), adding the two variables improves fit.
For example 2, we can also reject the null (\(p < 0.0001\)), adding the categories of grplang
improves fit.
9.3 Likelihood Ratio Test
This is one of the most recommended tests to use when comparing models in the generalized linear model family, which contains linear regression, logistic regression, probit regression, ordinal models, poisson, and negative binomial models.
Test Overview
What is likelihood:
The likelihood of a model refers to the probability that relationship between variables described in our model equation is the true, but unknown real-life relationship between those variables. Said another way, what are the chances that the coefficients from our model could predict the actual observations in our dataset. This is exact likelihood doesn’t tell us much on it’s own. However, it helps regressions choose the set of coefficients that has the highest probability of describing our data. Maximum likelihood estimation, which is what the regressions we are using in this class rely on, calculates coefficients have the greatest likelihood of producing the observed data. Maximum likelihood estimation actually uses the log of the likelihood, because it has some mathematical properties that make it easier to use, which is why you will see the term “log likelihood.” The higher the log-likelihood, the better the model fits the data.
If you really want to dive into the math of this test, the first 14 minutes of this video provide a good, relatively straightforward, explanation of likelihood and log-likelihood. Though beware, there is lots of math notation: https://www.youtube.com/watch?v=ScduwntrMzc
How does the likelihood ratio test work?
Each model that you run will have a log-likelihood. The likelihood ratio test looks to see if the difference between the log-likelihood from two models is statistically significant. If the difference is statistically significant, then one model has a better fit.
Test hypotheses:
* Null hypothesis: the log-likelihoods from one model is equal to the other (\(H_0\): log-likelihood model 1 - log-likelihood model 2 = 0)
* Alt hypothesis: the log-likelihoods are not equal to one another (\(H_A\): log-likelihood model 1 - log-likelihood model 2 \(\neq\) 0)
Interpretation: If the p value for the test is < .05, then we reject the null hypothesis and one model improves the fit of the model.
This is a really math-heavy concept. I wanted to provide an overview for anyone who gets frustrated using these tests without seeing what is in side the black box. If that’s not you…
Here is the takeaway: The likelihood ratio test compares the fit statistic (some number somebody theorized describes how well the model fits the data) from two nested models and tells you if the fit from one is better.
What is a “nested model”:
Two models are “nested” when they have the same dependent variable and the independent variables from one model are a subset of the independent variables in the larger model. Said another way, each model has the same variables, but the second one adds at least one additional independent variable. To be nested the models must also be run on the same sample. This can trip you up when running a likelihood ratio test because R automatically drops missing values during regressions. When you add a new variable that has missing values in your dataset, the larger model might have less cases than the smaller model. You have to make sure you run the smaller model without the cases that the larger model drops.
R code
STEP 1: Drop observations with missing values
There are many ways to do this in R. My preference is to use drop_na()
from the tidyverse.
<- df_MAR %>%
df_MAR drop_na(sepkin, trsntlkin, kinmatsup, kinmilsup, grplang, pct_ctry)
STEP 2: Run two or more “nested” models and save the results
Now we need to run two or more nested models and save the model results.
Basic model with only key explanatory variable: trsntlkin
<- glm(sepkin ~ trsntlkin,
fit_a data = df_MAR, family = binomial(link = "logit"))
summary(fit_a)
Call:
glm(formula = sepkin ~ trsntlkin, family = binomial(link = "logit"),
data = df_MAR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0780 -0.8360 -0.8216 1.2802 2.3926
Coefficients:
Estimate Std. Error z value
(Intercept) -2.8034 0.4204 -6.668
trsntlkinClose Kindred/Not Adjoining Base 1.9319 0.4362 4.429
trsntlkinClose Kindred, Adjoining Base 1.8906 0.4487 4.213
trsntlkinClose Kindred, More than 1 Country 2.5649 0.4469 5.740
Pr(>|z|)
(Intercept) 2.60e-11 ***
trsntlkinClose Kindred/Not Adjoining Base 9.47e-06 ***
trsntlkinClose Kindred, Adjoining Base 2.52e-05 ***
trsntlkinClose Kindred, More than 1 Country 9.47e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.80 on 836 degrees of freedom
Residual deviance: 959.09 on 833 degrees of freedom
AIC: 967.09
Number of Fisher Scoring iterations: 5
Second model adding grplang and pct_countrykinmatsup
<- glm(sepkin ~ trsntlkin + grplang + pct_ctry,
fit_b data = df_MAR, family = binomial(link = "logit"))
summary(fit_b)
Call:
glm(formula = sepkin ~ trsntlkin + grplang + pct_ctry, family = binomial(link = "logit"),
data = df_MAR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3814 -0.8019 -0.7257 1.2542 2.4015
Coefficients:
Estimate Std. Error z value
(Intercept) -3.053297 0.480804 -6.350
trsntlkinClose Kindred/Not Adjoining Base 1.854847 0.438831 4.227
trsntlkinClose Kindred, Adjoining Base 1.763777 0.451477 3.907
trsntlkinClose Kindred, More than 1 Country 2.491353 0.448909 5.550
grplangSome Speak Native Lang 0.231829 0.235364 0.985
grplangMost Speak Native Lang 1.030939 0.272657 3.781
pct_ctry -0.001023 0.006064 -0.169
Pr(>|z|)
(Intercept) 2.15e-10 ***
trsntlkinClose Kindred/Not Adjoining Base 2.37e-05 ***
trsntlkinClose Kindred, Adjoining Base 9.36e-05 ***
trsntlkinClose Kindred, More than 1 Country 2.86e-08 ***
grplangSome Speak Native Lang 0.324634
grplangMost Speak Native Lang 0.000156 ***
pct_ctry 0.865976
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.80 on 836 degrees of freedom
Residual deviance: 938.83 on 830 degrees of freedom
AIC: 952.83
Number of Fisher Scoring iterations: 5
Third model adding kinmatsup and kinmilsup
<- glm(sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang + pct_ctry,
fit_c data = df_MAR, family = binomial(link = "logit"))
summary(fit_c)
Call:
glm(formula = sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang +
pct_ctry, family = binomial(link = "logit"), data = df_MAR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7868 -0.8022 -0.7235 1.2534 2.4011
Coefficients:
Estimate Std. Error z value
(Intercept) -3.0587629 0.4824304 -6.340
trsntlkinClose Kindred/Not Adjoining Base 1.8558380 0.4394281 4.223
trsntlkinClose Kindred, Adjoining Base 1.6448396 0.4544475 3.619
trsntlkinClose Kindred, More than 1 Country 2.4308115 0.4495357 5.407
kinmatsupMaterial kinship support -0.2695913 0.3839679 -0.702
kinmilsupMilitary kinship support 1.5463083 0.4640465 3.332
grplangSome Speak Native Lang 0.2350165 0.2380001 0.987
grplangMost Speak Native Lang 1.0265800 0.2753856 3.728
pct_ctry -0.0002825 0.0061351 -0.046
Pr(>|z|)
(Intercept) 2.29e-10 ***
trsntlkinClose Kindred/Not Adjoining Base 2.41e-05 ***
trsntlkinClose Kindred, Adjoining Base 0.000295 ***
trsntlkinClose Kindred, More than 1 Country 6.40e-08 ***
kinmatsupMaterial kinship support 0.482605
kinmilsupMilitary kinship support 0.000862 ***
grplangSome Speak Native Lang 0.323415
grplangMost Speak Native Lang 0.000193 ***
pct_ctry 0.963279
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.80 on 836 degrees of freedom
Residual deviance: 926.81 on 828 degrees of freedom
AIC: 944.81
Number of Fisher Scoring iterations: 5
NOTE: I always give the smaller model names to models with less variables. You might include the number of variables or degrees of freedom as well to help you tell them apart.
STEP 3: Run the liklihood ratio test
Here we will use the lrest
function from the lmtest
package.
The hypothesis we are testing is: Is difference in fit between one model and another model a statistically significant improvement?
- Null hypothesis: The two models have equal fit.
- Alternate hypothesis: One model has a better fit.
If p < .05 then we reject the null!
Is model 2 an improvement upon model 1?
Log-likelihood model 1: -479.54
Log-likelihood model 2: -469.41
lrtest(fit_b, fit_a)
Likelihood ratio test
Model 1: sepkin ~ trsntlkin + grplang + pct_ctry
Model 2: sepkin ~ trsntlkin
#Df LogLik Df Chisq Pr(>Chisq)
1 7 -469.41
2 4 -479.54 -3 20.259 0.00015 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(fit_a, fit_b) # so you can see that order of models doesn't matter
Likelihood ratio test
Model 1: sepkin ~ trsntlkin
Model 2: sepkin ~ trsntlkin + grplang + pct_ctry
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -479.54
2 7 -469.41 3 20.259 0.00015 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reject the null - model 2 is a better fit
Is model 3 an improvement upon model 2?
Log-likelihood model 2: -469.41
Log-likelihood model 3: -463.40
lrtest(fit_c, fit_b)
Likelihood ratio test
Model 1: sepkin ~ trsntlkin + kinmatsup + kinmilsup + grplang + pct_ctry
Model 2: sepkin ~ trsntlkin + grplang + pct_ctry
#Df LogLik Df Chisq Pr(>Chisq)
1 9 -463.40
2 7 -469.41 -2 12.021 0.002452 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reject the null - model 3 is a better fit
Almost always the larger model will have the larger log-likelihood, so you are testing if that larger log-likelihood is a statistically significant improvement. But remember the test is telling you whether the difference is significant, so check to see which log-likelihood is larger, and that is the one with the better fit if you have a significant p-value.
9.4 AIC and BIC
Test Overview
AIC stands for Akaike Information criteria. BIC stands for Bayesian Information Criteria. Both are forms of probabilistic model selection. Probabilistic model selection or “information criteria” is a type of analysis that creates a score for your model to help you compare model fit and parsimony. You may have seen the AIC at the bottom of your glm()
summary.
What is each test doing?
Both of these tests also rely on log-likelihood statistics and Maximum Likelihood Estimation (MLE). Without going into a full mathematical explanation of MLE and the log liklihood functions it uses, I will simply state that both AIC and BIC come up with a formula using these underlying statistics to create a score for your model. That score says how well your model fits the data.
Here’s a resource to learn more about information criteria: https://machinelearningmastery.com/probabilistic-model-selection-measures/
What is the difference between AIC and BIC
There is a really technical answer to this about how AIC is looking for a hypothetical ideal model that doesn’t actually exist and BIC is looking for the best model among actual options. To be honest, I do not know the full theory behind it. The most meaningful difference is that BIC penalizes complicated models with more covariates more than AIC does. So if you are looking for the simplest model, BIC is usually recommended. Most people look at both, but if you really want to choose, people probably rely on BIC more. They will almost always lead to the same conclusion.
AIC and BIC are extremely easy tests to run after a regression, which is why I find them so helpful to compare models. The lower the AIC or BIC, the better the model!
R code
STEP 1: Run a couple regressions
These models do not need to be nested, but they MUST be run on
the same set of observations (follow the same procedure in the lrtest
code to drop any observations with missing values before you run your
regression).
First model:
<- glm(sepkin ~ trsntlkin + grplang + pct_ctry,
fit_b data = df_MAR, family = binomial(link = "logit"))
summary(fit_b)
Call:
glm(formula = sepkin ~ trsntlkin + grplang + pct_ctry, family = binomial(link = "logit"),
data = df_MAR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3814 -0.8019 -0.7257 1.2542 2.4015
Coefficients:
Estimate Std. Error z value
(Intercept) -3.053297 0.480804 -6.350
trsntlkinClose Kindred/Not Adjoining Base 1.854847 0.438831 4.227
trsntlkinClose Kindred, Adjoining Base 1.763777 0.451477 3.907
trsntlkinClose Kindred, More than 1 Country 2.491353 0.448909 5.550
grplangSome Speak Native Lang 0.231829 0.235364 0.985
grplangMost Speak Native Lang 1.030939 0.272657 3.781
pct_ctry -0.001023 0.006064 -0.169
Pr(>|z|)
(Intercept) 2.15e-10 ***
trsntlkinClose Kindred/Not Adjoining Base 2.37e-05 ***
trsntlkinClose Kindred, Adjoining Base 9.36e-05 ***
trsntlkinClose Kindred, More than 1 Country 2.86e-08 ***
grplangSome Speak Native Lang 0.324634
grplangMost Speak Native Lang 0.000156 ***
pct_ctry 0.865976
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.80 on 836 degrees of freedom
Residual deviance: 938.83 on 830 degrees of freedom
AIC: 952.83
Number of Fisher Scoring iterations: 5
Second model:
<- glm(sepkin ~ kinmatsup + kinmilsup,
fit_d data = df_MAR, family = binomial(link = "logit"))
summary(fit_d)
Call:
glm(formula = sepkin ~ kinmatsup + kinmilsup, family = binomial(link = "logit"),
data = df_MAR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5197 -0.8202 -0.8202 1.5830 1.7271
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.91655 0.07924 -11.566 < 2e-16 ***
kinmatsupMaterial kinship support -0.31986 0.37158 -0.861 0.389330
kinmilsupMilitary kinship support 1.69269 0.44906 3.769 0.000164 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.80 on 836 degrees of freedom
Residual deviance: 998.35 on 834 degrees of freedom
AIC: 1004.3
Number of Fisher Scoring iterations: 4
STEP 2: Note the AIC at the bottom of the regression results
Remember: smaller = better fit.
Model 1
AIC: 952.83
Model 2
AIC: 1004.3
Interpretation: Model 1 is the better fit
Calculate the BIC using the BIC()
funciton
This function is built into R’s base statistics. It does not require loading a package. Remember: smaller = better fit
Model 1
BIC(fit_b)
[1] 985.9374
Model 2
BIC(fit_d)
[1] 1018.535
Model 1 has a better overall fit!
9.5 Class Activity
For lab today you will be put into two groups. In the first group, you will be assigned one of the three comparison procedures described in this lab (Wald, Likelihood Ratio, or AIC/BIC). You will download the activity script file and compare two models using the method you were assigned. You will see a base model and it is your group’s job to create a second model and compare the goodness of fit between the base model and your model.
After your group finishes your analysis you will move to your second group. In this group, each person will have worked with a different comparison test. Each member of the group should describe the fit test that you ran and the results. You should also explain any part of the coding that you found tricky to your group members.