4.5 Inference for model parameters

The assumptions on which the logistic model is constructed allow to specify what is the asymptotic distribution of the random vector β^. Again, the distribution is derived conditionally on the sample predictors X1,,Xn. In other words, we assume that the randomness of Y comes only from Y|(X1=x1,,Xk=xk)Ber(p(x)) and not from the predictors. To denote this, we employ lowercase for the sample predictors x1,,xn.

There is an important difference between the inference results for the linear model and for logistic regression:

  • In linear regression the inference is exact. This is due to the nice properties of the normal, least squares estimation and linearity. As a consequence, the distributions of the coefficients are perfectly known assuming that the assumptions hold.
  • In logistic regression the inference is asymptotic. This means that the distributions of the coefficients are unknown except for large sample sizes n, for which we have approximations. The reason is the more complexity of the model in terms of non-linearity. This is the usual situation for the majority of the regression models.

4.5.1 Distributions of the fitted coefficients

The distribution of β^ is given by the asymptotic theory of MLE: (4.10)β^Nk+1(β,I(β)1) where must be understood as approximately distributed as […] when n for the rest of this chapter. I(β) is known as the Fisher information matrix, and receives that name because it measures the information available in the sample for estimating β. Therefore, the larger the matrix is, the more precise is the estimation of β, because that results in smaller variances in (4.10). The inverse of the Fisher information matrix is (4.11)I(β)1=(XTVX)1, where V is a diagonal matrix containing the different variances for each Yi (remember that p(x)=1/(1+e(β0+β1x1++βkxk))): V=(p(X1)(1p(X1))p(X2)(1p(X2))p(Xn)(1p(Xn))) In the case of the multiple linear regression, I(β)1=σ2(XTX)1 (see (3.6)), so the presence of V here is revealing the heteroskedasticity of the model.

The interpretation of (4.10) and (4.11) gives some useful insights on what concepts affect the quality of the estimation:

  • Bias. The estimates are asymptotically unbiased.

  • Variance. It depends on:

    • Sample size n. Hidden inside XTVX. As n grows, the precision of the estimators increases.
    • Weighted predictor sparsity (XTVX)1. The more sparse the predictor is (small |(XTVX)1|), the more precise β^ is.
The precision of β^ is affected by the value of β, which is hidden inside V. This contrasts sharply with the linear model, where the precision of the least squares estimator was not affected by the value of the unknown coefficients (see (3.6)). The reason is partially due to the heteroskedasticity of logistic regression, which implies a dependence of the variance of Y in the logistic curve, hence in β.

Figure 4.8: Illustration of the randomness of the fitted coefficients (β^0,β^1) and the influence of n, (β0,β1) and sx2. The sample predictors x1,,xn are fixed and new responses Y1,,Yn are generated each time from a logistic model Y|X=xBer(p(X)). Application also available here.

Similar to linear regression, the problem with (4.10) and (4.11) is that V is unknown in practice because it depends on β. Plugging-in the estimate β^ to β in V results in V^. Now we can use V^ to get (4.12)β^jβjSE^(β^j)N(0,1),SE^(β^j)2=vj2 where vj is the j-th element of the diagonal of (XTV^X)1. The LHS of (3.7) is the Wald statistic for βj, j=0,,k. They are employed for building confidence intervals and hypothesis tests.

4.5.2 Confidence intervals for the coefficients

Thanks to (4.12), we can have the 100(1α)% CI for the coefficient βj, j=0,,k: (4.13)(β^j±SE^(β^j)zα/2) where zα/2 is the α/2-upper quantile of the N(0,1). In case we are interested in the CI for eβj, we can just simply take the exponential on the above CI. So the 100(1α)% CI for eβj, j=0,,k, is e(β^j±SE^(β^j)zα/2). Of course, this CI is not the same as (eβ^j±eSE^(β^j)zα/2), which is not a CI for eβ^j.

Let’s see how we can compute the CIs. We return to the challenger dataset, so in case you do not have it loaded, you can download it here. We analyze the CI for the coefficients of fail.field ~ temp.

# Fit model
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)

# Confidence intervals at 95%
confint(nasa)
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept)  1.3364047 17.7834329
## temp        -0.9237721 -0.1089953

# Confidence intervals at other levels
confint(nasa, level = 0.90)
## Waiting for profiling to be done...
##                    5 %       95 %
## (Intercept)  2.2070301 15.7488590
## temp        -0.8222858 -0.1513279

# Confidence intervals for the factors affecting the odds
exp(confint(nasa))
## Waiting for profiling to be done...
##                 2.5 %       97.5 %
## (Intercept) 3.8053375 5.287456e+07
## temp        0.3970186 8.967346e-01

In this example, the 95% confidence interval for β0 is (1.3364,17.7834) and for β1 is (0.9238,0.1090). For eβ0 and eβ1, the CIs are (3.8053,5.2874×107) and (0.3070,0.8967), respectively. Therefore, we can say with a 95% confidence that:

  • When temp=0, the probability of fail.field=1 is significantly lager than the probability of fail.field=0 (using the CI for β0). Indeed, fail.field=1 is between 3.8053 and 5.2874×107 more likely than fail.field=0 (using the CI for eβ0).
  • temp has a significantly negative effect in the probability of fail.field=1 (using the CI for β1). Indeed, each unit increase in temp produces a reduction of the odds of fail.field by a factor between 0.3070 and 0.8967 (using the CI for eβ1).

Compute and interpret the CIs for the exponentiated coefficients, at level α=0.05, for the following regressions (challenger dataset):

  • fail.field ~ temp + pres.field
  • fail.nozzle ~ temp + pres.nozzle
  • fail.field ~ temp + pres.nozzle
  • fail.nozzle ~ temp + pres.field
The interpretation of the variables is given above Table 4.1.

4.5.3 Testing on the coefficients

The distributions in (4.12) also allow to conduct a formal hypothesis test on the coefficients βj, j=0,,k. For example, the test for significance: H0:βj=0 for j=0,,k. The test of H0:βj=0 with 1jk is especially interesting, since it allows to answer whether the variable Xj has a significant effect on P[Y=1]. The statistic used for testing for significance is the Wald statistic β^j0SE^(β^j), which is asymptotically distributed as a N(0,1) under the (veracity of) the null hypothesis. H0 is tested against the bilateral alternative hypothesis H1:βj0.

The tests for significance are built-in in the summary function. However, a note of caution is required when applying the rule of thumb:

Is the CI for βj below (above) 0 at level α?

  • Yes reject H0 at level α.
  • No the criterion is not conclusive.

The significances given in summary and the output of confint are slightly incoherent and the previous rule of thumb does not apply. The reason is because MASS’s confint is using a more sophisticated method (profile likelihood) to estimate the standard error of β^j, SE^(β^j), and not the asymptotic distribution behind Wald statistic.

By changing confint to R’s default confint.default, the results of the latter will be completely equivalent to the significances in summary, and the rule of thumb still be completely valid. For the contents of this course we prefer confint.default due to its better interpretability.

To illustrate this we consider the regression of fail.field ~ temp + pres.field:

# Significances with asymptotic approximation for the standard errors
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",
             data = challenger)
summary(nasa2)
## 
## Call:
## glm(formula = fail.field ~ temp + pres.field, family = "binomial", 
##     data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2109  -0.6081  -0.4292   0.3498   2.0913  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  6.642709   4.038547   1.645   0.1000  
## temp        -0.435032   0.197008  -2.208   0.0272 *
## pres.field   0.009376   0.008821   1.063   0.2878  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 19.078  on 20  degrees of freedom
## AIC: 25.078
## 
## Number of Fisher Scoring iterations: 5

# CIs with asymptotic approximation - coherent with summary
confint.default(nasa2, level = 0.90)
##                      5 %        95 %
## (Intercept) -0.000110501 13.28552771
## temp        -0.759081468 -0.11098301
## pres.field  -0.005132393  0.02388538
confint.default(nasa2, level = 0.99)
##                   0.5 %      99.5 %
## (Intercept) -3.75989977 17.04531697
## temp        -0.94249107  0.07242659
## pres.field  -0.01334432  0.03209731

# CIs with profile likelihood - incoherent with summary
confint(nasa2, level = 0.90) # intercept still significant
## Waiting for profiling to be done...
##                      5 %        95 %
## (Intercept)  0.945372123 14.93392497
## temp        -0.845250023 -0.16532086
## pres.field  -0.004184814  0.02602181
confint(nasa2, level = 0.99) # temp still significant
## Waiting for profiling to be done...
##                   0.5 %      99.5 %
## (Intercept) -1.86541750 21.49637422
## temp        -1.17556090 -0.04317904
## pres.field  -0.01164943  0.03836968

For the previous exercise, check the differences of using confint or confint.default for computing the CIs.