11.3 Estimation and Inference in the Logit and Probit Models
So far nothing has been said about how Logit and Probit models are estimated by statistical software. The reason why this is interesting is that both models are nonlinear in the parameters and thus cannot be estimated using OLS. Instead one relies on maximum likelihood estimation (MLE). Another approach is estimation by nonlinear least squares (NLS).
Nonlinear Least Squares
Consider the multiple regression Probit model E(Yi|X1i,…,Xki)=P(Yi=1|X1i,…,Xki)=Φ(β0+β1X1i+⋯+βkXki).E(Yi|X1i,…,Xki)=P(Yi=1|X1i,…,Xki)=Φ(β0+β1X1i+⋯+βkXki).(11.8)Similarly to OLS, NLS estimates the parameters β0,β1,…,βkβ0,β1,…,βk by minimizing the sum of squared mistakes n∑i=1[Yi−Φ(b0+b1X1i+⋯+bkXki)]2.n∑i=1[Yi−Φ(b0+b1X1i+⋯+bkXki)]2. NLS estimation is a consistent approach that produces estimates which are normally distributed in large samples. In R there are functions like nls() from package stats which provide algorithms for solving nonlinear least squares problems. However, NLS is inefficient, meaning that there are estimation techniques that have a smaller variance which is why we will not dwell any further on this topic.
Maximum Likelihood Estimation
In MLE we seek to estimate the unknown parameters choosing them such that the likelihood of drawing the sample observed is maximized. This probability is measured by means of the likelihood function, the joint probability distribution of the data treated as a function of the unknown parameters. Put differently, the maximum likelihood estimates of the unknown parameters are the values that result in a model which is most likely to produce the data observed. It turns out that MLE is more efficient than NLS.
As maximum likelihood estimates are normally distributed in large samples, statistical inference for coefficients in nonlinear models like Logit and Probit regression can be made using the same tools that are used for linear regression models: we can compute tt-statistics and confidence intervals.
Many software packages use an MLE algorithm for estimation of nonlinear models. The function glm() uses an algorithm named iteratively reweighted least squares.
Measures of Fit
It is important to be aware that the usual R2R2 and ˉR2¯R2 are invalid for nonlinear regression models. The reason for this is simple: both measures assume that the relation between the dependent and the explanatory variable(s) is linear. This obviously does not hold for Probit and Logit models. Thus R2R2 need not lie between 00 and 11 and there is no meaningful interpretation. However, statistical software sometimes reports these measures anyway.
There are many measures of fit for nonlinear regression models and there is no consensus which one should be reported. The situation is even more complicated because there is no measure of fit that is generally meaningful. For models with a binary response variable like denydeny one could use the following rule: If Yi=1Yi=1 and ^P(Yi|Xi1,…,Xik)>0.5ˆP(Yi|Xi1,…,Xik)>0.5 or if Yi=0Yi=0 and ^P(Yi|Xi1,…,Xik)<0.5ˆP(Yi|Xi1,…,Xik)<0.5, consider the YiYi as correctly predicted. Otherwise YiYi is said to be incorrectly predicted. The measure of fit is the share of correctly predicted observations. The downside of such an approach is that it does not mirror the quality of the prediction: whether ^P(Yi=1|Xi1,…,Xik)=0.51ˆP(Yi=1|Xi1,…,Xik)=0.51 or ^P(Yi=1|Xi1,…,Xik)=0.99ˆP(Yi=1|Xi1,…,Xik)=0.99 is not reflected, we just predict Yi=1Yi=1.8
An alternative to the latter are so called pseudo-R2R2 measures. In order to measure the quality of the fit, these measures compare the value of the maximized (log-)likelihood of the model with all regressors (the full model) to the likelihood of a model with no regressors (null model, regression on a constant).
For example, consider a Probit regression. The pseudo-R2pseudo-R2 is given by pseudo-R2=1−ln(fmaxfull)ln(fmaxnull)pseudo-R2=1−ln(fmaxfull)ln(fmaxnull) where fmaxj∈[0,1]fmaxj∈[0,1] denotes the maximized likelihood for model jj.
The reasoning behind this is that the maximized likelihood increases as additional regressors are added to the model, similarly to the decrease in SSRSSR when regressors are added in a linear regression model. If the full model has a similar maximized likelihood as the null model, the full model does not really improve upon a model that uses only the information in the dependent variable, so pseudo-R2≈0pseudo-R2≈0. If the full model fits the data very well, the maximized likelihood should be close to 11 such that ln(fmaxfull)≈0ln(fmaxfull)≈0 and pseudo-R2≈1pseudo-R2≈1. See Appendix 11.2 of the book for more on MLE and pseudo-R2R2 measures.
summary() does not report pseudo-R2pseudo-R2 for models estimated by glm() but we can use the entries residual deviance (deviance) and null deviance (null.deviance) instead. These are computed as deviance=−2×[ln(fmaxsaturated)−ln(fmaxfull)]deviance=−2×[ln(fmaxsaturated)−ln(fmaxfull)] and
null deviance=−2×[ln(fmaxsaturated)−ln(fmaxnull)] where fmaxsaturated is the maximized likelihood for a model which assumes that each observation has its own parameter (there are n+1 parameters to be estimated which leads to a perfect fit). For models with a binary dependent variable, it holds that pseudo-R2=1−deviancenull deviance=1−ln(fmaxfull)ln(fmaxnull).
We now compute the pseudo-R2 for the augmented Probit model of mortgage denial.
# compute pseudo-R2 for the probit model of mortgage denial
pseudoR2 <- 1 - (denyprobit2$deviance) / (denyprobit2$null.deviance)
pseudoR2
## [1] 0.08594259
Another way to obtain the pseudo-R2 is to estimate the null model using glm() and extract the maximized log-likelihoods for both the null and the full model using the function logLik().
# compute the null model
denyprobit_null <- glm(formula = deny ~ 1,
family = binomial(link = "probit"),
data = HMDA)
# compute the pseudo-R2 using 'logLik'
1 - logLik(denyprobit2)[1]/logLik(denyprobit_null)[1]
## [1] 0.08594259
This is in contrast to the case of a numeric dependent variable where we use the squared errors for assessment of the quality of the prediction.↩