8 Review: Comparing Models (Stata)

8.1 Lab Overview

This lab reviews a fundamental from linear regression we’ll be adding onto: comparing the fit of models!

Research Question: What characteristics are associated with separtist movements in ethnocultural groups?

Lab Files
Data: Download M.A.R_Cleaned.dta

Codebook: Download MAR_Codebook.pdf

Script File: Download lab4_F2_comparing_models.do

Activity File: Download lab4_F2_activity.do

8.2 Wald Test

Test overview

The Wald test helps you see whether adding a set of variables collectively improves the fit of the model. Said another way, we can see if removing the two or more variables from the model would substantially harm the fit of the model. This would mean it removes something statistically significant to the fit of the model.

Test hypotheses:
* Null hypothesis: the coefficients are all equal to 0 (\(H_0: X1 = X2 = 0\))
* Alt hypothesis: the coefficients are not all equal to zero (\(H_A: X1 \ or \ X2 \neq 0\))

Interpretation: If the p value for the test is < .05, then we reject the null hypothesis and the variables significantly explain some piece of the model. It is like the p-value for our coefficients, except for multiple coefficients at the same time.


Even though we are looking at the Wald Test first, it is actually just an approximation of the likelihood ratio test coming next. If that is the case (you might be asking), why would we even learn the Wald test? Well, the advantage of the Wald test is that it only requires that we run a single model, rather than several. If you have a huge, computationally intensive model, the Wald test might be the better choice. Otherwise, many recommend the likelihood ratio test over the Wald Test.

Stata code

STEP 1: Run your regression

logit sepkin i.trsntlkin i.kinmatsup i.kinmilsup i.grplang c.pct_ctry
Iteration 0:   log likelihood = -506.89792  
Iteration 1:   log likelihood = -466.02792  
Iteration 2:   log likelihood =  -463.4466  
Iteration 3:   log likelihood =  -463.4036  
Iteration 4:   log likelihood =  -463.4036  

Logistic regression                                     Number of obs =    837
                                                        LR chi2(8)    =  86.99
                                                        Prob > chi2   = 0.0000
Log likelihood = -463.4036                              Pseudo R2     = 0.0858

-----------------------------------------------------------------------------------------------------
                             sepkin | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
------------------------------------+----------------------------------------------------------------
                          trsntlkin |
  Close Kindred/Not Adjoining Base  |   1.855838   .4394358     4.22   0.000     .9945597    2.717116
     Close Kindred, Adjoining Base  |    1.64484   .4544549     3.62   0.000     .7541243    2.535555
Close Kindred, More than 1 Country  |   2.430812   .4495432     5.41   0.000     1.549723      3.3119
                                    |
                          kinmatsup |
                               Yes  |  -.2695913    .383968    -0.70   0.483    -1.022155    .4829722
                                    |
                          kinmilsup |
                               Yes  |   1.546308   .4640465     3.33   0.001     .6367938    2.455822
                                    |
                            grplang |
            Some Speak Native Lang  |   .2350164   .2380003     0.99   0.323    -.2314556    .7014885
            Most Speak Native Lang  |    1.02658   .2753858     3.73   0.000     .4868337    1.566326
                                    |
                           pct_ctry |  -.0002825   .0061351    -0.05   0.963     -.012307    .0117421
                              _cons |  -3.058763   .4824376    -6.34   0.000    -4.004323   -2.113203
-----------------------------------------------------------------------------------------------------

STEP 2: STEP 2: Use the test command

The Stata command you will be using is the test command. You simply have to list the two or more variables you want to jointly test.

The only tricky part of this command is working with categorical variables. When you are comparing categorical variables you have to specify the category value before the name of the coefficient. We’ll run two examples so you can see what I mean.

Example 1: Material and Military Support from Kin
In this example because both kinmatsup and kinmilsup are binary variables, we want to test whether the coefficients for “Yes” are equal to 0.Yes is coded as 1, so we can refer to that coefficient as 1.kinmatsup and 1.kinmilsup.

test 1.kinmatsup 1.kinmilsup
 ( 1)  [sepkin]1.kinmatsup = 0
 ( 2)  [sepkin]1.kinmilsup = 0

           chi2(  2) =   11.11
         Prob > chi2 =    0.0039

Example 2: All the categories of Language Distinctiveness (grplang)
In this example, we’re only looking at one variable, but because it is a categorical variable with multiple levels we can test to see whether each category, jointly, adds to the model. We will list all the categories of grplang except for the base category. 0 is the base category and the other two categories are listed as 1 and 2.

test 1.grplang 2.grplang
 ( 1)  [sepkin]1.grplang = 0
 ( 2)  [sepkin]2.grplang = 0

           chi2(  2) =   19.00
         Prob > chi2 =    0.0001

STEP 3: Interpret the test

If the p-value for a test is LESS THAN .05 we can reject the null hypothesis that the coefficients are both equal to zero, meaning adding these two variables improves the fit of the model.

For example 1, we can reject the null (\(p = 0.0039\)), adding the two variables improves fit.

For example 2, we can also reject the null (\(p = 0.0001\)), adding the categories of grplang improves fit.

8.3 Likelihood Ratio Test

This is one of the most recommended tests to use when comparing models in the generalized linear model family, which contains linear regression, logistic regression, probit regression, ordinal models, poisson, and negative binomial models.

Test Overview

What is likelihood:
The likelihood of a model refers to the probability that relationship between variables described in our model equation is the true, but unknown real-life relationship between those variables. Said another way, what are the chances that the coefficients from our model could predict the actual observations in our dataset. This is exact likelihood doesn’t tell us much on it’s own. However, it helps regressions choose the set of coefficients that has the highest probability of describing our data. Maximum likelihood estimation, which is what the regressions we are using in this class rely on, calculates coefficients have the greatest likelihood of producing the observed data. Maximum likelihood estimation actually uses the log of the likelihood, because it has some mathematical properties that make it easier to use, which is why you will see the term “log likelihood.” The higher the log-likelihood, the better the model fits the data.

If you really want to dive into the math of this test, the first 14 minutes of this video provide a good, relatively straightforward, explanation of likelihood and log-likelihood. Though beware, there is lots of math notation: https://www.youtube.com/watch?v=ScduwntrMzc

How does the likelihood ratio test work?
Each model that you run will have a log-likelihood. You might have actually noticed this number provided in the output of your regression results. The likelihood ratio test looks to see if the difference between the log-likelihood from two models is statistically significant. If the difference is statistically significant, then one model has a better fit.

Test hypotheses:
* Null hypothesis: the log-likelihoods from one model is equal to the other (\(H_0\): log-likelihood model 1 - log-likelihood model 2 = 0)
* Alt hypothesis: the log-likelihoods are not equal to one another (\(H_A\): log-likelihood model 1 - log-likelihood model 2 \(\neq\) 0)

Interpretation: If the p value for the test is < .05, then we reject the null hypothesis and one model improves the fit of the model.

This is a really math-heavy concept. I wanted to provide an overview for anyone who gets frustrated using these tests without seeing what is in side the black box. If that’s not you…

Here is the takeaway: The likelihood ratio test compares the fit statistic (some number somebody theorized describes how well the model fits the data) from two nested models and tells you if the fit from one is better.

What is a “nested model”:
Two models are “nested” when they have the same dependent variable and the independent variables from one model are a subset of the independent variables in the larger model. Said another way, each model has the same variables, but the second one adds at least one additional independent variable. To be nested the models must also be run on the same sample. This can trip you up when running a likelihood ratio test because Stata automatically drops missing values during regressions. When you add a new variable that has missing values in your dataset, the larger model might have less cases than the smaller model. You have to make sure you run the smaller model without the cases that the larger model drops.

Stata code

STEP 1: Drop observations with missing values

The easiest way to do this in Stata is to generate a new variable that counts the number of missing values across the variables in our larger model. Then use that variable to drop any cases with missing values.

egen missing = rowmiss (sepkin trsntlkin kinmatsup kinmilsup ///
  grplang pct_ctry)
drop if missing >= 1
>   grplang pct_ctry)

(15 observations deleted)

STEP 2: Run two or more “nested” models and store the estimates

Now we need to run two or more nested models and store those estimates using the estout package. If you have never used the estout package you will need to install it by running ssc install estout first.

Basic model with only key explanatory variable: trsntlkin

logit sepkin i.trsntlkin
eststo m1 // store results and name it "m1"
Iteration 0:   log likelihood = -506.89792  
Iteration 1:   log likelihood = -481.51069  
Iteration 2:   log likelihood = -479.57107  
Iteration 3:   log likelihood =  -479.5439  
Iteration 4:   log likelihood = -479.54388  

Logistic regression                                     Number of obs =    837
                                                        LR chi2(3)    =  54.71
                                                        Prob > chi2   = 0.0000
Log likelihood = -479.54388                             Pseudo R2     = 0.0540

-----------------------------------------------------------------------------------------------------
                             sepkin | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
------------------------------------+----------------------------------------------------------------
                          trsntlkin |
  Close Kindred/Not Adjoining Base  |   1.931864   .4362052     4.43   0.000     1.076918    2.786811
     Close Kindred, Adjoining Base  |   1.890581    .448724     4.21   0.000     1.011098    2.770064
Close Kindred, More than 1 Country  |   2.564946   .4468653     5.74   0.000     1.689106    3.440786
                                    |
                              _cons |  -2.803357   .4204368    -6.67   0.000    -3.627398   -1.979316
-----------------------------------------------------------------------------------------------------

Second model adding grplang and pct_countrykinmatsup

logit sepkin i.trsntlkin i.grplang c.pct_ctry
eststo m2 // store results and name it "m2"
Iteration 0:   log likelihood = -506.89792  
Iteration 1:   log likelihood = -471.94329  
Iteration 2:   log likelihood = -469.45513  
Iteration 3:   log likelihood = -469.41434  
Iteration 4:   log likelihood =  -469.4143  

Logistic regression                                     Number of obs =    837
                                                        LR chi2(6)    =  74.97
                                                        Prob > chi2   = 0.0000
Log likelihood = -469.4143                              Pseudo R2     = 0.0739

-----------------------------------------------------------------------------------------------------
                             sepkin | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
------------------------------------+----------------------------------------------------------------
                          trsntlkin |
  Close Kindred/Not Adjoining Base  |   1.854842   .4388378     4.23   0.000     .9947358    2.714948
     Close Kindred, Adjoining Base  |   1.763772   .4514839     3.91   0.000     .8788802    2.648665
Close Kindred, More than 1 Country  |   2.491347   .4489156     5.55   0.000     1.611489    3.371206
                                    |
                            grplang |
            Some Speak Native Lang  |   .2318286   .2353641     0.98   0.325    -.2294765    .6931337
            Most Speak Native Lang  |   1.030939   .2726573     3.78   0.000     .4965405    1.565337
                                    |
                           pct_ctry |  -.0010235   .0060641    -0.17   0.866    -.0129089     .010862
                              _cons |  -3.053292   .4808103    -6.35   0.000    -3.995663   -2.110921
-----------------------------------------------------------------------------------------------------

Third model adding kinmatsup and kinmilsup

logit sepkin i.trsntlkin i.kinmatsup i.kinmilsup i.grplang c.pct_ctry 
eststo m3 // store results and name it "m3"
Iteration 0:   log likelihood = -506.89792  
Iteration 1:   log likelihood = -466.02792  
Iteration 2:   log likelihood =  -463.4466  
Iteration 3:   log likelihood =  -463.4036  
Iteration 4:   log likelihood =  -463.4036  

Logistic regression                                     Number of obs =    837
                                                        LR chi2(8)    =  86.99
                                                        Prob > chi2   = 0.0000
Log likelihood = -463.4036                              Pseudo R2     = 0.0858

-----------------------------------------------------------------------------------------------------
                             sepkin | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
------------------------------------+----------------------------------------------------------------
                          trsntlkin |
  Close Kindred/Not Adjoining Base  |   1.855838   .4394358     4.22   0.000     .9945597    2.717116
     Close Kindred, Adjoining Base  |    1.64484   .4544549     3.62   0.000     .7541243    2.535555
Close Kindred, More than 1 Country  |   2.430812   .4495432     5.41   0.000     1.549723      3.3119
                                    |
                          kinmatsup |
                               Yes  |  -.2695913    .383968    -0.70   0.483    -1.022155    .4829722
                                    |
                          kinmilsup |
                               Yes  |   1.546308   .4640465     3.33   0.001     .6367938    2.455822
                                    |
                            grplang |
            Some Speak Native Lang  |   .2350164   .2380003     0.99   0.323    -.2314556    .7014885
            Most Speak Native Lang  |    1.02658   .2753858     3.73   0.000     .4868337    1.566326
                                    |
                           pct_ctry |  -.0002825   .0061351    -0.05   0.963     -.012307    .0117421
                              _cons |  -3.058763   .4824376    -6.34   0.000    -4.004323   -2.113203
-----------------------------------------------------------------------------------------------------

NOTE: I always give the smaller model names to models with less variables. You might include the number of variables or degrees of freedom as well to help you tell them apart.

STEP 3: Run the liklihood ratio test

Here we will use Stata’s lrtest command on our stored estimates.

The hypothesis we are testing is: Is difference in fit between one model and another model a statistically significant improvement?

  • Null hypothesis: The two models have equal fit.
  • Alternate hypothesis: One model has a better fit.

If p < .05 then we reject the null!

Is model 2 an improvement upon model 1?
Log-likelihood model 1: -479.54388
Log-likelihood model 2: -469.4143

lrtest m2 m1
lrtest m1 m2 // so you can see that order of models doesn't matter
Likelihood-ratio test
Assumption: m1 nested within m2

 LR chi2(3) =  20.26
Prob > chi2 = 0.0001


Likelihood-ratio test
Assumption: m1 nested within m2

 LR chi2(3) =  20.26
Prob > chi2 = 0.0001

Reject the null - model 2 is a better fit

Is model 3 an improvement upon model 2?
Log-likelihood model 2: -469.4143
Log-likelihood model 3: -463.4036

lrtest m3 m2
Likelihood-ratio test
Assumption: m2 nested within m3

 LR chi2(2) =  12.02
Prob > chi2 = 0.0025

Reject the null - model 3 is a better fit

Almost always the larger model will have the larger log-likelihood, so you are testing if that larger log-likelihood is a statistically significant improvement. But remember the test is telling you whether the difference is significant, so check to see which log-likelihood is larger, and that is the one with the better fit if you have a significant p-value.

8.4 AIC and BIC

Test Overview

AIC stands for Akaike Information criteria. BIC stands for Bayesian Information Criteria. Both are forms of probabilistic model selection. Probabilistic model selection or “information criteria” is a type of analysis that creates a score for your model to help you compare model fit and parsimony.

What is each test doing?
Both of these tests also rely on log-likelihood statistics and Maximum Likelihood Estimation (MLE). Without going into a full mathematical explanation of MLE and the log liklihood functions it uses, I will simply state that both AIC and BIC come up with a formula using these underlying statistics to create a score for your model. That score says how well your model fits the data.

Here’s a resource to learn more about information criteria: https://machinelearningmastery.com/probabilistic-model-selection-measures/

What is the difference between AIC and BIC
There is a really technical answer to this about how AIC is looking for a hypothetical ideal model that doesn’t actually exist and BIC is looking for the best model among actual options. To be honest, I do not know the full theory behind it. The most meaningful difference is that BIC penalizes complicated models with more covariates more than AIC does. So if you are looking for the simplest model, BIC is usually recommended. Most people look at both, but if you really want to choose, people probably rely on BIC more. They will almost always lead to the same conclusion.

AIC and BIC are extremely easy tests to run after a regression, which is why I find them so helpful to compare models. The lower the AIC or BIC, the better the model!

Stata code

STEP 1: Run a couple regressions followed by estat ic

These models do not need to be nested, but they MUST be run on the same set of observations (follow the same procedure in the lrtest code to drop any observations with missing values before you run your regression).

For this output I am running the regressions quietly so that you only see the AIC/BIC output.

First model:

quietly logit sepkin i.trsntlkin i.grplang c.pct_ctry 
estat ic 
Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |        837  -506.8979  -469.4143       7   952.8286   985.9374
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] BIC note.

Second model:

quietly logit sepkin i.kinmatsup i.kinmilsup 
estat ic 
Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |        837  -506.8979  -499.1726       3   1004.345   1018.535
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] BIC note.

STEP 2: Interpret the AIC/BIC

Remember: smaller = better fit

Model 1

  • AIC: 952.8286
  • BIC: 985.9374

Model 2

  • AIC: 1004.345
  • BIC: 1018.535

Model 1 has a better overall fit!

8.5 Class Activity

For lab today you will be put into two groups. In the first group, you will be assigned one of the three comparison procedures described in this lab (Wald, Likelihood Ratio, or AIC/BIC). You will download the activity script file and compare two models using the method you were assigned. You will see a base model and it is your group’s job to create a second model and compare the goodness of fit between the base model and your model.

After your group finishes your analysis you will move to your second group. In this group, each person will have worked with a different comparison test. Each member of the group should describe the fit test that you ran and the results. You should also explain any part of the coding that you found tricky to your group members.