8 Review: Comparing Models (Stata)
8.1 Lab Overview
This lab reviews a fundamental from linear regression we’ll be adding onto: comparing the fit of models!
Research Question: What characteristics are associated with separtist movements in ethnocultural groups?
Lab Files
Data:
Download M.A.R_Cleaned.dta
Codebook: Download MAR_Codebook.pdf
Script File: Download lab4_F2_comparing_models.do
Activity File: Download lab4_F2_activity.do
8.2 Wald Test
Test overview
The Wald test helps you see whether adding a set of variables collectively improves the fit of the model. Said another way, we can see if removing the two or more variables from the model would substantially harm the fit of the model. This would mean it removes something statistically significant to the fit of the model.
Test hypotheses:
* Null hypothesis: the coefficients are all equal to 0 (\(H_0: X1 = X2 = 0\))
* Alt hypothesis: the coefficients are not all equal to zero (\(H_A: X1 \ or \ X2 \neq 0\))
Interpretation: If the p value for the test is < .05, then we reject the null hypothesis and the variables significantly explain some piece of the model. It is like the pvalue for our coefficients, except for multiple coefficients at the same time.
Even though we are looking at the Wald Test first, it is actually just an approximation of the likelihood ratio test coming next. If that is the case (you might be asking), why would we even learn the Wald test? Well, the advantage of the Wald test is that it only requires that we run a single model, rather than several. If you have a huge, computationally intensive model, the Wald test might be the better choice. Otherwise, many recommend the likelihood ratio test over the Wald Test.
Stata code
STEP 1: Run your regression
logit sepkin i.trsntlkin i.kinmatsup i.kinmilsup i.grplang c.pct_ctry
Iteration 0: log likelihood = 506.89792
Iteration 1: log likelihood = 466.02792
Iteration 2: log likelihood = 463.4466
Iteration 3: log likelihood = 463.4036
Iteration 4: log likelihood = 463.4036
Logistic regression Number of obs = 837
LR chi2(8) = 86.99
Prob > chi2 = 0.0000
Log likelihood = 463.4036 Pseudo R2 = 0.0858

sepkin  Coefficient Std. err. z P>z [95% conf. interval]
+
trsntlkin 
Close Kindred/Not Adjoining Base  1.855838 .4394358 4.22 0.000 .9945597 2.717116
Close Kindred, Adjoining Base  1.64484 .4544549 3.62 0.000 .7541243 2.535555
Close Kindred, More than 1 Country  2.430812 .4495432 5.41 0.000 1.549723 3.3119

kinmatsup 
Yes  .2695913 .383968 0.70 0.483 1.022155 .4829722

kinmilsup 
Yes  1.546308 .4640465 3.33 0.001 .6367938 2.455822

grplang 
Some Speak Native Lang  .2350164 .2380003 0.99 0.323 .2314556 .7014885
Most Speak Native Lang  1.02658 .2753858 3.73 0.000 .4868337 1.566326

pct_ctry  .0002825 .0061351 0.05 0.963 .012307 .0117421
_cons  3.058763 .4824376 6.34 0.000 4.004323 2.113203

STEP 2: STEP 2: Use the test
command
The Stata command you will be using is the test
command. You simply have to list the
two or more variables you want to jointly test.
The only tricky part of this command is working with categorical variables. When you are comparing categorical variables you have to specify the category value before the name of the coefficient. We’ll run two examples so you can see what I mean.
Example 1: Material and Military Support from Kin
In this example because both kinmatsup
and kinmilsup
are binary variables,
we want to test whether the coefficients for “Yes” are equal to 0.Yes is coded as 1, so we can refer to that coefficient as 1.kinmatsup
and 1.kinmilsup
.
test 1.kinmatsup 1.kinmilsup
( 1) [sepkin]1.kinmatsup = 0
( 2) [sepkin]1.kinmilsup = 0
chi2( 2) = 11.11
Prob > chi2 = 0.0039
Example 2: All the categories of Language Distinctiveness (grplang
)
In this example, we’re only looking at one variable, but because it is a categorical variable with multiple levels we can test to see whether each category, jointly, adds to the model. We will list all the categories of grplang
except for the base category. 0 is the base category and the other two categories are listed as 1 and 2.
test 1.grplang 2.grplang
( 1) [sepkin]1.grplang = 0
( 2) [sepkin]2.grplang = 0
chi2( 2) = 19.00
Prob > chi2 = 0.0001
STEP 3: Interpret the test
If the pvalue for a test is LESS THAN .05 we can reject the null hypothesis that the coefficients are both equal to zero, meaning adding these two variables improves the fit of the model.
For example 1, we can reject the null (\(p = 0.0039\)), adding the two variables improves fit.
For example 2, we can also reject the null (\(p = 0.0001\)), adding the categories of grplang
improves fit.
8.3 Likelihood Ratio Test
This is one of the most recommended tests to use when comparing models in the generalized linear model family, which contains linear regression, logistic regression, probit regression, ordinal models, poisson, and negative binomial models.
Test Overview
What is likelihood:
The likelihood of a model refers to the probability that relationship between variables described in our model equation is the true, but unknown reallife relationship between those variables. Said another way, what are the chances that the coefficients from our model could predict the actual observations in our dataset. This is exact likelihood doesn’t tell us much on it’s own. However, it helps regressions choose the set of coefficients that has the highest probability of describing our data. Maximum likelihood estimation, which is what the regressions we are using in this class rely on, calculates coefficients have the greatest likelihood of producing the observed data. Maximum likelihood estimation actually uses the log of the likelihood, because it has some mathematical properties that make it easier to use, which is why you will see the term “log likelihood.” The higher the loglikelihood, the better the model fits the data.
If you really want to dive into the math of this test, the first 14 minutes of this video provide a good, relatively straightforward, explanation of likelihood and loglikelihood. Though beware, there is lots of math notation: https://www.youtube.com/watch?v=ScduwntrMzc
How does the likelihood ratio test work?
Each model that you run will have a loglikelihood. You might have actually noticed this number provided in the output of your regression results. The likelihood ratio test looks to see if the difference between the loglikelihood from two models is statistically significant. If the difference is statistically significant, then one model has a better fit.
Test hypotheses:
* Null hypothesis: the loglikelihoods from one model is equal to the other (\(H_0\): loglikelihood model 1  loglikelihood model 2 = 0)
* Alt hypothesis: the loglikelihoods are not equal to one another (\(H_A\): loglikelihood model 1  loglikelihood model 2 \(\neq\) 0)
Interpretation: If the p value for the test is < .05, then we reject the null hypothesis and one model improves the fit of the model.
This is a really mathheavy concept. I wanted to provide an overview for anyone who gets frustrated using these tests without seeing what is in side the black box. If that’s not you…
Here is the takeaway: The likelihood ratio test compares the fit statistic (some number somebody theorized describes how well the model fits the data) from two nested models and tells you if the fit from one is better.
What is a “nested model”:
Two models are “nested” when they have the same dependent variable and the independent variables from one model are a subset of the independent variables in the larger model. Said another way, each model has the same variables, but the second one adds at least one additional independent variable. To be nested the models must also be run on the same sample. This can trip you up when running a likelihood ratio test because Stata automatically drops missing values during regressions. When you add a new variable that has missing values in your dataset, the larger model might have less cases than the smaller model. You have to make sure you run the smaller model without the cases that the larger model drops.
Stata code
STEP 1: Drop observations with missing values
The easiest way to do this in Stata is to generate a new variable that counts the number of missing values across the variables in our larger model. Then use that variable to drop any cases with missing values.
egen missing = rowmiss (sepkin trsntlkin kinmatsup kinmilsup ///
grplang pct_ctry)drop if missing >= 1
> grplang pct_ctry)
(15 observations deleted)
STEP 2: Run two or more “nested” models and store the estimates
Now we need to run two or more nested models and store those estimates using the estout
package. If you have never used the estout
package you will need to
install it by running ssc install estout
first.
Basic model with only key explanatory variable: trsntlkin
logit sepkin i.trsntlkin
// store results and name it "m1" eststo m1
Iteration 0: log likelihood = 506.89792
Iteration 1: log likelihood = 481.51069
Iteration 2: log likelihood = 479.57107
Iteration 3: log likelihood = 479.5439
Iteration 4: log likelihood = 479.54388
Logistic regression Number of obs = 837
LR chi2(3) = 54.71
Prob > chi2 = 0.0000
Log likelihood = 479.54388 Pseudo R2 = 0.0540

sepkin  Coefficient Std. err. z P>z [95% conf. interval]
+
trsntlkin 
Close Kindred/Not Adjoining Base  1.931864 .4362052 4.43 0.000 1.076918 2.786811
Close Kindred, Adjoining Base  1.890581 .448724 4.21 0.000 1.011098 2.770064
Close Kindred, More than 1 Country  2.564946 .4468653 5.74 0.000 1.689106 3.440786

_cons  2.803357 .4204368 6.67 0.000 3.627398 1.979316

Second model adding grplang and pct_countrykinmatsup
logit sepkin i.trsntlkin i.grplang c.pct_ctry
// store results and name it "m2" eststo m2
Iteration 0: log likelihood = 506.89792
Iteration 1: log likelihood = 471.94329
Iteration 2: log likelihood = 469.45513
Iteration 3: log likelihood = 469.41434
Iteration 4: log likelihood = 469.4143
Logistic regression Number of obs = 837
LR chi2(6) = 74.97
Prob > chi2 = 0.0000
Log likelihood = 469.4143 Pseudo R2 = 0.0739

sepkin  Coefficient Std. err. z P>z [95% conf. interval]
+
trsntlkin 
Close Kindred/Not Adjoining Base  1.854842 .4388378 4.23 0.000 .9947358 2.714948
Close Kindred, Adjoining Base  1.763772 .4514839 3.91 0.000 .8788802 2.648665
Close Kindred, More than 1 Country  2.491347 .4489156 5.55 0.000 1.611489 3.371206

grplang 
Some Speak Native Lang  .2318286 .2353641 0.98 0.325 .2294765 .6931337
Most Speak Native Lang  1.030939 .2726573 3.78 0.000 .4965405 1.565337

pct_ctry  .0010235 .0060641 0.17 0.866 .0129089 .010862
_cons  3.053292 .4808103 6.35 0.000 3.995663 2.110921

Third model adding kinmatsup and kinmilsup
logit sepkin i.trsntlkin i.kinmatsup i.kinmilsup i.grplang c.pct_ctry
// store results and name it "m3" eststo m3
Iteration 0: log likelihood = 506.89792
Iteration 1: log likelihood = 466.02792
Iteration 2: log likelihood = 463.4466
Iteration 3: log likelihood = 463.4036
Iteration 4: log likelihood = 463.4036
Logistic regression Number of obs = 837
LR chi2(8) = 86.99
Prob > chi2 = 0.0000
Log likelihood = 463.4036 Pseudo R2 = 0.0858

sepkin  Coefficient Std. err. z P>z [95% conf. interval]
+
trsntlkin 
Close Kindred/Not Adjoining Base  1.855838 .4394358 4.22 0.000 .9945597 2.717116
Close Kindred, Adjoining Base  1.64484 .4544549 3.62 0.000 .7541243 2.535555
Close Kindred, More than 1 Country  2.430812 .4495432 5.41 0.000 1.549723 3.3119

kinmatsup 
Yes  .2695913 .383968 0.70 0.483 1.022155 .4829722

kinmilsup 
Yes  1.546308 .4640465 3.33 0.001 .6367938 2.455822

grplang 
Some Speak Native Lang  .2350164 .2380003 0.99 0.323 .2314556 .7014885
Most Speak Native Lang  1.02658 .2753858 3.73 0.000 .4868337 1.566326

pct_ctry  .0002825 .0061351 0.05 0.963 .012307 .0117421
_cons  3.058763 .4824376 6.34 0.000 4.004323 2.113203

NOTE: I always give the smaller model names to models with less variables. You might include the number of variables or degrees of freedom as well to help you tell them apart.
STEP 3: Run the liklihood ratio test
Here we will use Stata’s lrtest command on our stored estimates.
The hypothesis we are testing is: Is difference in fit between one model and another model a statistically significant improvement?
 Null hypothesis: The two models have equal fit.
 Alternate hypothesis: One model has a better fit.
If p < .05 then we reject the null!
Is model 2 an improvement upon model 1?
Loglikelihood model 1: 479.54388
Loglikelihood model 2: 469.4143
lrtest m2 m1
lrtest m1 m2 // so you can see that order of models doesn't matter
Likelihoodratio test
Assumption: m1 nested within m2
LR chi2(3) = 20.26
Prob > chi2 = 0.0001
Likelihoodratio test
Assumption: m1 nested within m2
LR chi2(3) = 20.26
Prob > chi2 = 0.0001
Reject the null  model 2 is a better fit
Is model 3 an improvement upon model 2?
Loglikelihood model 2: 469.4143
Loglikelihood model 3: 463.4036
lrtest m3 m2
Likelihoodratio test
Assumption: m2 nested within m3
LR chi2(2) = 12.02
Prob > chi2 = 0.0025
Reject the null  model 3 is a better fit
Almost always the larger model will have the larger loglikelihood, so you are testing if that larger loglikelihood is a statistically significant improvement. But remember the test is telling you whether the difference is significant, so check to see which loglikelihood is larger, and that is the one with the better fit if you have a significant pvalue.
8.4 AIC and BIC
Test Overview
AIC stands for Akaike Information criteria. BIC stands for Bayesian Information Criteria. Both are forms of probabilistic model selection. Probabilistic model selection or “information criteria” is a type of analysis that creates a score for your model to help you compare model fit and parsimony.
What is each test doing?
Both of these tests also rely on loglikelihood statistics and Maximum Likelihood Estimation (MLE). Without going into a full mathematical explanation of MLE and the log liklihood functions it uses, I will simply state that both AIC and BIC come up with a formula using these underlying statistics to create a score for your model. That score says how well your model fits the data.
Here’s a resource to learn more about information criteria: https://machinelearningmastery.com/probabilisticmodelselectionmeasures/
What is the difference between AIC and BIC
There is a really technical answer to this about how AIC is looking for a hypothetical ideal model that doesn’t actually exist and BIC is looking for the best model among actual options. To be honest, I do not know the full theory behind it. The most meaningful difference is that BIC penalizes complicated models with more covariates more than AIC does. So if you are looking for the simplest model, BIC is usually recommended. Most people look at both, but if you really want to choose, people probably rely on BIC more. They will almost always lead to the same conclusion.
AIC and BIC are extremely easy tests to run after a regression, which is why I find them so helpful to compare models. The lower the AIC or BIC, the better the model!
Stata code
STEP 1: Run a couple regressions followed by estat ic
These models do not need to be nested, but they MUST be run on
the same set of observations (follow the same procedure in the lrtest
code to drop any observations with missing values before you run your
regression).
For this output I am running the regressions quietly so that you only see the AIC/BIC output.
First model:
quietly logit sepkin i.trsntlkin i.grplang c.pct_ctry
estat ic
Akaike's information criterion and Bayesian information criterion

Model  N ll(null) ll(model) df AIC BIC
+
.  837 506.8979 469.4143 7 952.8286 985.9374

Note: BIC uses N = number of observations. See [R] BIC note.
Second model:
quietly logit sepkin i.kinmatsup i.kinmilsup
estat ic
Akaike's information criterion and Bayesian information criterion

Model  N ll(null) ll(model) df AIC BIC
+
.  837 506.8979 499.1726 3 1004.345 1018.535

Note: BIC uses N = number of observations. See [R] BIC note.
STEP 2: Interpret the AIC/BIC
Remember: smaller = better fit
Model 1
 AIC: 952.8286
 BIC: 985.9374
Model 2
 AIC: 1004.345
 BIC: 1018.535
Model 1 has a better overall fit!
8.5 Class Activity
For lab today you will be put into two groups. In the first group, you will be assigned one of the three comparison procedures described in this lab (Wald, Likelihood Ratio, or AIC/BIC). You will download the activity script file and compare two models using the method you were assigned. You will see a base model and it is your group’s job to create a second model and compare the goodness of fit between the base model and your model.
After your group finishes your analysis you will move to your second group. In this group, each person will have worked with a different comparison test. Each member of the group should describe the fit test that you ran and the results. You should also explain any part of the coding that you found tricky to your group members.