This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

11.3 Estimation and Inference in the Logit and Probit Models

So far nothing has been said about how Logit and Probit models are estimated by statistical software. The reason why this is interesting is that both models are nonlinear in the parameters and thus cannot be estimated using OLS. Instead one relies on maximum likelihood estimation (MLE). Another approach is estimation by nonlinear least squares (NLS).

Nonlinear Least Squares

Consider the multiple regression Probit model \[\begin{align} E(Y_i\vert X_{1i}, \dots, X_{ki}) = P(Y_i=1\vert X_{1i}, \dots, X_{ki}) = \Phi(\beta_0 + \beta_1 X_{1i} + \dots + \beta_k X_{ki}). \tag{11.8} \end{align}\]

Similarly to OLS, NLS estimates the parameters \(\beta_0,\beta_1,\dots,\beta_k\) by minimizing the sum of squared mistakes \[\sum_{i=1}^n\left[ Y_i - \Phi(b_0 + b_1 X_{1i} + \dots + b_k X_{ki}) \right]^2.\] NLS estimation is a consistent approach that produces estimates which are normally distributed in large samples. In R there are functions like nls() from package stats which provide algorithms for solving nonlinear least squares problems. However, NLS is inefficient, meaning that there are estimation techniques that have a smaller variance which is why we will not dwell any further on this topic.

Maximum Likelihood Estimation

In MLE we seek to estimate the unknown parameters choosing them such that the likelihood of drawing the sample observed is maximized. This probability is measured by means of the likelihood function, the joint probability distribution of the data treated as a function of the unknown parameters. Put differently, the maximum likelihood estimates of the unknown parameters are the values that result in a model which is most likely to produce the data observed. It turns out that MLE is more efficient than NLS.

As maximum likelihood estimates are normally distributed in large samples, statistical inference for coefficients in nonlinear models like Logit and Probit regression can be made using the same tools that are used for linear regression models: we can compute \(t\)-statistics and confidence intervals.

Many software packages use an MLE algorithm for estimation of nonlinear models. The function glm() uses an algorithm named iteratively reweighted least squares.

Measures of Fit

It is important to be aware that the usual \(R^2\) and \(\bar{R}^2\) are invalid for nonlinear regression models. The reason for this is simple: both measures assume that the relation between the dependent and the explanatory variable(s) is linear. This obviously does not hold for Probit and Logit models. Thus \(R^2\) need not lie between \(0\) and \(1\) and there is no meaningful interpretation. However, statistical software sometimes reports these measures anyway.

There are many measures of fit for nonlinear regression models and there is no consensus which one should be reported. The situation is even more complicated because there is no measure of fit that is generally meaningful. For models with a binary response variable like \(deny\) one could use the following rule: If \(Y_i = 1\) and \(\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} > 0.5\) or if \(Y_i = 0\) and \(\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} < 0.5\), consider the \(Y_i\) as correctly predicted. Otherwise \(Y_i\) is said to be incorrectly predicted. The measure of fit is the share of correctly predicted observations. The downside of such an approach is that it does not mirror the quality of the prediction: whether \(\widehat{P(Y_i = 1|X_{i1}, \dots, X_{ik}) = 0.51}\) or \(\widehat{P(Y_i =1|X_{i1}, \dots, X_{ik}) = 0.99}\) is not reflected, we just predict \(Y_i = 1\).8

An alternative to the latter are so called pseudo-\(R^2\) measures. In order to measure the quality of the fit, these measures compare the value of the maximized (log-)likelihood of the model with all regressors (the full model) to the likelihood of a model with no regressors (null model, regression on a constant).

For example, consider a Probit regression. The \(\text{pseudo-}R^2\) is given by \[\text{pseudo-}R^2 = 1 - \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}\] where \(f^{max}_j \in [0,1]\) denotes the maximized likelihood for model \(j\).

The reasoning behind this is that the maximized likelihood increases as additional regressors are added to the model, similarly to the decrease in \(SSR\) when regressors are added in a linear regression model. If the full model has a similar maximized likelihood as the null model, the full model does not really improve upon a model that uses only the information in the dependent variable, so \(\text{pseudo-}R^2 \approx 0\). If the full model fits the data very well, the maximized likelihood should be close to \(1\) such that \(\ln(f^{max}_{full}) \approx 0\) and \(\text{pseudo-}R^2 \approx 1\). See Appendix 11.2 of the book for more on MLE and pseudo-\(R^2\) measures.

summary() does not report \(\text{pseudo-}R^2\) for models estimated by glm() but we can use the entries residual deviance (deviance) and null deviance (null.deviance) instead. These are computed as \[\text{deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{full}) \right]\] and

\[\text{null deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{null}) \right]\] where \(f^{max}_{saturated}\) is the maximized likelihood for a model which assumes that each observation has its own parameter (there are \(n+1\) parameters to be estimated which leads to a perfect fit). For models with a binary dependent variable, it holds that \[\text{pseudo-}R^2 = 1 - \frac{\text{deviance}}{\text{null deviance}} = 1- \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}.\]

We now compute the \(\text{pseudo-}R^2\) for the augmented Probit model of mortgage denial.

# compute pseudo-R2 for the probit model of mortgage denial
pseudoR2 <- 1 - (denyprobit2$deviance) / (denyprobit2$null.deviance)
pseudoR2
## [1] 0.08594259

Another way to obtain the \(\text{pseudo-}R^2\) is to estimate the null model using glm() and extract the maximized log-likelihoods for both the null and the full model using the function logLik().

# compute the null model
denyprobit_null <- glm(formula = deny ~ 1, 
                       family = binomial(link = "probit"), 
                       data = HMDA)

# compute the pseudo-R2 using 'logLik'
1 - logLik(denyprobit2)[1]/logLik(denyprobit_null)[1]
## [1] 0.08594259

  1. This is in contrast to the case of a numeric dependent variable where we use the squared errors for assessment of the quality of the prediction.