Chapter 13 Dummy Dependent Variables
We will learn techniques in R to estimate and interpret models in which the dependent variable is categorical. In particular we will learn to estimate linear probability models, probit models, and logit models. We will use the libraries below.
data(mtcars)
library(tidyverse)
library(magrittr)
library(broom)
13.1 Probit Estimation
The probit model is given by Pr(Yi=1)=Φ(β0+β1X1i) where Φ() is the standard normal CDF. Let’s make use of the mtcars
36 data set to estimate a probit model to determine engine type as a function of mpg. Engine type, vs, is coded as 0 for V-shaped and 1 for straight.
13.1.1 EDA
Let’s look at a scatter plot and a box plot of mpg vs vs.
%>%
mtcars ggplot(aes(x = mpg, y = vs)) +
geom_point()
%>%
mtcars mutate(vs = as.factor(vs)) %>%
ggplot(aes(x = vs, y = mpg)) +
geom_boxplot()
The boxplot indicates that there is difference in mpg between straight vs v-shaped engines. Note the difference in the code between the two similar calls. We need to treat vs as a factor in the boxplot but not in the scatter diagram.
13.1.2 Estimate the model
Let’s estimate the probit model Pr(vsi=1)=Φ(β0+β1mpgi). glm
is used to fit dummy dependent variable models.37 To estimate the probit model glm
requires three arguments: formula
, family
, and data
. You are familiar with the data and formula arguments. The family argument is description of the error distribution. In this case our family argument will be binomial(link = "probit")
.
%$%
mtcars glm(vs ~ mpg, family = binomial(link = "probit")) %>%
tidy()
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5.09 1.64 -3.10 0.00193
2 mpg 0.246 0.0825 2.98 0.00286
13.1.3 Estimated Effects
13.1.3.1 Discrete Difference
13.1.3.1.1 X1 is Continuous
To estimate the effect of a change in the independent variable on the probability of observing the dependent variable we need to calculate the average difference between the fitted values of the model, P1, and the predicted values of the model when the independent variable we are interested in is changed by one standard deviation, P2.
Fitted values, P1, are easily obtained from the glm object as follows:
vs_glm <- mtcars %$%
glm(vs ~ mpg, family = binomial(link = "probit"))
vs_glm$fitted P1 <-
The fitted variables have had pnorm()
applied to the linear estimates, so they are P1.
To obtain marginal effects, we need to let mpg vary by one standard deviation and obtain the predicted values from the estimated equation. To find P2, the predicted values resulting from a one standard deviation change in the independent variable, we will make use of predict.glm
. predict.glm
38 will require two arguments to estimate P2, the equation object and the newdata predict.glm(object, newdata = df
. Unfortunately the expose pipe %$%
does not function with predict.glm
, so we will have to create a data frame of the changed independent variable. We will use the dplyr
verbs select
and mutate
to create the new data frame. We calculate P2 below:
# Create the new data
newdata <-%>%
mtcars dplyr::select(mpg) %>% #I used this form to avoid the conflict with select in the MASS package
mutate(mpg = mpg + sd(mpg))
# Create P2
P2 <- predict.glm(vs_glm, newdata) %>%
pnorm()
# Marginal Effect
mean(P2-P1)
[1] 0.339
So, a one standard deviation increase in mpg will yield a 33.89% increase in the probability that the car has straight engine.
13.1.3.1.2 Independent variable is a dummy.
Let’s add am, transmission type, to the model. am is coded as 0 if the car has an automatic transmission and 1 if it has a manual transmission. First, estimate the model Pr(vsi=1)=Φ(β0+β1am+β2mpgi).
%$%
mtcars glm(vs ~ am + mpg, family = binomial(link = "probit"))
Call: glm(formula = vs ~ am + mpg, family = binomial(link = "probit"))
Coefficients:
(Intercept) am mpg
-7.47 -1.84 0.40
Degrees of Freedom: 31 Total (i.e. Null); 29 Residual
Null Deviance: 43.9
Residual Deviance: 20.3 AIC: 26.3
We will follow similar steps as those above to interpret a change from automatic to manual transmission on the probability that the engine is straight. We will estimate P0, the fitted values, when am = 0, and P1, the fitted values when am = 1.
# Estimate the model
vs_am_glm <- mtcars %$%
glm(vs ~ am + mpg, family = binomial(link = "probit"))
# P0
newdata <- mtcars %>%
dplyr::select(am, mpg) %>%
mutate(am = 0)
P0 <- predict.glm(vs_am_glm, newdata) %>%
pnorm()
# P1
newdata <- mtcars %>%
dplyr::select(am, mpg) %>%
mutate(am = 1)
P1 <- predict.glm(vs_am_glm, newdata) %>%
pnorm()
mean(P1-P0)
[1] -0.269
A car with an manual transmission is 26.9% less likely, on average, to have a straight engine, ceteris paribus.
13.1.3.2 Marginal Effects
If X1 is continuous we can estimate the marginal effects of a change in X1 as ϕ(ˆβ0+ˆβ1X1i+ˆβ2X2i)ˆβ1. Where ϕ() is the normal PDF. Let’s estimate the marginal effect of mpg on vs using the model above.
marg_effect <-dnorm(vs_am_glm$coef[1] + vs_am_glm$coef[2]*mtcars$am + vs_am_glm$coef[3]*mtcars$mpg) * vs_am_glm$coef[3]
mean(marg_effect)
[1] 0.0692
The marginal effect of mpg on type of engine is 0.069.
13.1.3.2.1 mfx
and margins
Packages
We can use the mfx
and margins
packages to estimate the marginal effect of a continuous variable directly from the model we estimate. mfx::probitmfx(formula, data, atmean = F)
and margins::margins(model)
are the respective function calls to estimate marginal effects from the two packages.
# mfx
::probitmfx(vs_am_glm, mtcars, atmean = F) mfx
Call:
mfx::probitmfx(formula = vs_am_glm, data = mtcars, atmean = F)
Marginal Effects:
dF/dx Std. Err. z P>|z|
am -0.26942 0.09154 -2.94 0.0032 **
mpg 0.06925 0.00765 9.05 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dF/dx is for discrete change for the following variables:
[1] "am"
Note that these values are identically to the ones calculated by hand above.
# margins
::margins(vs_am_glm, data = mtcars) margins
am mpg
-0.3185 0.06925
The marginal effect of mpg is the same, while the effect of am is similar. ?margins
or An Introduction to ‘margins’ for more on the margins
package.
13.2 Logit Estimation
The logit model takes the form Pr(Yi=1)=eβ0+β1X1i1+eβ0+β1X1i An alternative form of the logit model might be easier to interpret. With appropriate algebraic gymnastics we can write the logistic model as ln(pi1−pi)=β0+β1X1i, where ln(p11−pi) is the log of the odds ratio.
Let’s estimate the model from above as a logit rather than a probit. All we need to do is change the link argument to logit to estimate the model.
%$%
mtcars glm(vs ~ mpg + am, family = binomial(link = "logit"))
Call: glm(formula = vs ~ mpg + am, family = binomial(link = "logit"))
Coefficients:
(Intercept) mpg am
-12.705 0.681 -3.007
Degrees of Freedom: 31 Total (i.e. Null); 29 Residual
Null Deviance: 43.9
Residual Deviance: 20.6 AIC: 26.6
Suppose we’d like to know the probability that a vehicle with automatic transmission that gets 25 mpg has a straight engine. Calculate the odds ratio as ln(p11−pi)=−12.7051+0.6809∗25−3.0073∗0=4.9474. Exponentiate both sides and solve for p. eln(pi1−pi)=e4.9474. We know that an exponentiated natural log is just itself so we have pi1−pi=140.808. Solving for p yields pi=140.808141.808=.9925. The probability we are looking for is 99.25%. So, ˆp=eˆβ0+ˆβ1X11+eˆβ0+ˆβ1X1.
13.2.1 Discrete Differences
The discrete-difference can be calculated as the difference in two probabilities. We can estimate the mean change in probability from an increase in mpg of 1.
vs_logit <- mtcars %$%
glm(vs ~ mpg + am, family = binomial(link = "logit"))
# p1 are the fitted values of the regression
vs_logit$fitted
p1 <-# to calculate p2 add one to mpg and find the predicted values
newdata <- mtcars %>%
dplyr::select(mpg, am) %>%
mutate(mpg = mpg + 1)
exp(predict(vs_logit, newdata))/(1+exp(predict(vs_logit, newdata)))
p2 <-# calcualte the mean difference between the p2 and p1
mean(p2-p1)
[1] 0.0727
On average an increase of 1 mpg will increase the probability the car has straight engine by 7.3%.
13.2.2 Marginal Effects
Use the mfx
or margins
package to estimate the marginal effects of a change in an independent variable.
# mfx
::logitmfx(vs_logit, mtcars, atmean = F) mfx
Call:
mfx::logitmfx(formula = vs_logit, data = mtcars, atmean = F)
Marginal Effects:
dF/dx Std. Err. z P>|z|
mpg 0.0692 0.0453 1.53 0.1267
am -0.2618 0.0941 -2.78 0.0054 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dF/dx is for discrete change for the following variables:
[1] "am"
# margins
::margins(vs_logit, mtcars) margins
mpg am
0.06923 -0.3057
13.3 Testing Hypotheses
Let’s estimate a new probit model Pr(vsi=1)=Φ(β0+β1am+β2mpgi+β3hpi) using the mtcars
data set and test the hypothesis that our model has overall explanatory power. H0:β1=β2=β3=0
vs. @ least one β≠0
We an estimate a restricted model and compare the likelihood ratios to the likelihood ratio of the unrestricted model and perform the LR test where LR=2(logLUR−logLR)~χ2df. Where the df is equal to the number of restrictions or number of equal signs in H0.
ur_model <- mtcars %$%
glm(vs ~ am + mpg + hp, family = binomial(link = "probit"))
r_model <- mtcars %$%
glm(vs ~ 1, family = binomial(link = "probit"))
2*(logLik(ur_model)[1]-logLik(r_model)[1])
lr <-1 - pchisq(lr, 3)
[1] 0.000000302
We can reject H0.
Instead, let’s use lrtest
from the lmtest
package to test hypotheses about our limited dependent variable models. We can specify the restrictions as an argument in the call.
::lrtest(ur_model, c("am", "mpg", "hp")) lmtest
Likelihood ratio test
Model 1: vs ~ am + mpg + hp
Model 2: vs ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -5.36
2 1 -21.93 -3 33.1 0.0000003 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s test the null hypothesis H0:β2=β3 H1:β2≠β3 The restricted model becomes Pr(vsi=1)=Φ(β0+β1am+β2(mpgi+hpi))
r_model <- mtcars %$%
glm(vs ~ am + I(mpg + hp), family = binomial(link = "probit"))
::lrtest(ur_model, r_model) lmtest
Likelihood ratio test
Model 1: vs ~ am + mpg + hp
Model 2: vs ~ am + I(mpg + hp)
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -5.36
2 3 -6.63 -1 2.53 0.11
We fail to reject H0 and conclude that we have no evidence to believe that β2≠β3.
We would test hypotheses concerning logit models in same way.
13.4 Graphing Probit and Logit Models
%>%
mtcars ggplot(aes(x = mpg, y = vs)) +
geom_point() +
geom_smooth(method = "glm", method.args=list(family=binomial(link = "probit")), se = F) +
ggtitle("Probit")
%>%
mtcars ggplot(aes(x = mpg, y = vs)) +
geom_point() +
geom_smooth(method = "glm", method.args=list(family=binomial(link = "logit")), se = F) +
ggtitle("Logit")