5.3 Inference for model parameters

The assumptions on which a generalized linear model is constructed allow us to specify what is the asymptotic distribution of the random vector \(\hat{\boldsymbol{\beta}}\) through the theory of MLE. Again, the distribution is derived conditionally on the predictors’ sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n.\) In other words, we assume that the randomness of \(Y\) comes only from \(Y|(X_1=x_1,\ldots,X_p=x_p)\) and not from the predictors.

For the ease of exposition, we will focus only on the logistic model rather than in the general case. The conceptual differences are not so big, but the simplification in terms of notation and the benefits on the intuition side are important.

There is an important difference between the inference results for the linear model and for logistic regression:

In linear regression the inference is exact. This is due to the nice properties of the normal, least squares estimation, and linearity. As a consequence, the distributions of the coefficients are perfectly known assuming that the assumptions hold.
In generalized linear models the inference is asymptotic. This means that the distributions of the coefficients are unknown except for large sample sizes \(n,\) for which we have approximations.¹⁶⁷ The reason is the higher complexity of the model in terms of nonlinearity. This is the usual situation for the majority of regression models.¹⁶⁸

5.3.1 Distributions of the fitted coefficients

The distribution of \(\hat{\boldsymbol{\beta}}\) is given by the asymptotic theory of MLE:

\[\begin{align} \hat{\boldsymbol{\beta}}\stackrel{a}{\sim}\mathcal{N}_{p+1}\left(\boldsymbol{\beta},\mathcal{I}(\boldsymbol{\beta})^{-1}\right) \tag{5.26} \end{align}\]

where \(\stackrel{a}{\sim} [\ldots]\) means “asymptotically distributed as \([\ldots]\) when \(n\to\infty\)” and

\[\begin{align*} \mathcal{I}(\boldsymbol{\beta}):=-\mathbb{E}\left[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}'}\right] \end{align*}\]

is the Fisher information matrix. The name comes from the fact that it measures the information available in the sample for estimating \(\boldsymbol{\beta}\). The “larger” the matrix (larger eigenvalues) is, the more precise the estimation of \(\boldsymbol{\beta}\) is, because that results in smaller variances in (5.26).

The inverse of the Fisher information matrix is¹⁶⁹

\[\begin{align} \mathcal{I}(\boldsymbol{\beta})^{-1}=(\mathbf{X}'\mathbf{V}\mathbf{X})^{-1}, \tag{5.27} \end{align}\]

where \(\mathbf{V}=\mathrm{diag}(V_1,\ldots,V_n)\) and \(V_i=\mathrm{logistic}(\eta_i)(1-\mathrm{logistic}(\eta_i)),\) with \(\eta_i=\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}.\) In the case of the multiple linear regression, \(\mathcal{I}(\boldsymbol{\beta})^{-1}=\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\) (see (2.11)), so the presence of \(\mathbf{V}\) here is a consequence of the heteroscedasticity of the logistic model.

The interpretation of (5.26) and (5.27) gives some useful insights on what concepts affect the quality of the estimation:

Bias. The estimates are asymptotically unbiased.
Variance. It depends on:
- Sample size \(n\). Hidden inside \(\mathbf{X}'\mathbf{V}\mathbf{X}.\) As \(n\) grows, the precision of the estimators increases.
- Weighted predictor sparsity \((\mathbf{X}'\mathbf{V}\mathbf{X})^{-1}\). The more “disperse”¹⁷⁰ the predictors are, the more precise \(\hat{\boldsymbol{\beta}}\) is. When \(p=1,\) \(\mathbf{X}'\mathbf{V}\mathbf{X}\) is a weighted version of \(s_x^2.\)

Figure 5.7 aids visualizing these insights.

The precision of \(\hat{\boldsymbol{\beta}}\) is affected by the value of \(\boldsymbol{\beta}\), which is hidden inside \(\mathbf{V}.\) This contrasts sharply with the linear model, where the precision of the least squares estimator was not affected by \(\boldsymbol{\beta}\) (see (2.11)). The reason is partially due to the heteroscedasticity of logistic regression, which implies a dependence of the variance of \(Y\) in the logistic curve, hence in \(\boldsymbol{\beta}.\)

Similar to linear regression, the problem with (5.26) and (5.27) is that \(\mathbf{V}\) is unknown in practice because it depends on \(\boldsymbol{\beta}.\) Plugging-in the estimate \(\hat{\boldsymbol{\beta}}\) into \(\boldsymbol{\beta}\) in \(\mathbf{V}\) gives the estimator \(\hat{\mathbf{V}}.\) Now we can use \(\hat{\mathbf{V}}\) to get

\[\begin{align} \frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\stackrel{a}{\sim} \mathcal{N}(0,1),\quad\hat{\mathrm{SE}}(\hat\beta_j)^2:=v_j\tag{5.28} \end{align}\]

where

\[\begin{align*} v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}'\hat{\mathbf{V}}\mathbf{X})^{-1}. \end{align*}\]

The LHS of (5.28) is the Wald statistic for \(\beta_j,\) \(j=0,\ldots,p.\) They are employed for building marginal confidence intervals and hypothesis tests, in a completely analogous way to how the \(t\)-statistics in linear regression operate.

Figure 5.7: Illustration of the randomness of the fitted coefficients \((\hat\beta_0,\hat\beta_1)\) and the influence of \(n,\) \((\beta_0,\beta_1)\) and \(s_x^2.\) The predictors’ sample \(x_1,\ldots,x_n\) are fixed and new responses \(Y_1,\ldots,Y_n\) are generated each time from a simple logistic model \(Y|X=x\sim\mathrm{Ber}(\mathrm{logistic}(\beta_0+\beta_1x)).\) Application available here.

5.3.2 Confidence intervals for the coefficients

Thanks to (5.28), we can have the \(100(1-\alpha)\%\) CI for the coefficient \(\beta_j,\) \(j=0,\ldots,p\):

\[\begin{align} \left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)\tag{5.29} \end{align}\]

where \(z_{\alpha/2}\) is the \(\alpha/2\)-upper quantile of the \(\mathcal{N}(0,1)\). In case we are interested in the CI for \(e^{\beta_j},\) we can just simply take the exponential on the above CI.¹⁷¹ So the \(100(1-\alpha)\%\) CI for \(e^{\beta_j},\) \(j=0,\ldots,p,\) is

\[\begin{align*} e^{\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)}. \end{align*}\]

Of course, this CI is not the same as \(\left(e^{\hat\beta_j}\pm e^{\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\right),\) which is not a valid CI for \(e^{\beta_j}\)!

5.3.3 Testing on the coefficients

The distributions in (5.28) also allow us to conduct a formal hypothesis test on the coefficients \(\beta_j,\) \(j=0,\ldots,p.\) For example, the test for significance:

\[\begin{align*} H_0:\beta_j=0 \end{align*}\]

for \(j=0,\ldots,p.\) The test of \(H_0:\beta_j=0\) with \(1\leq j\leq p\) is especially interesting, since it allows us to answer whether the variable \(X_j\) has a significant effect on \(Y\). The statistic used for testing for significance is the Wald statistic

\[\begin{align*} \frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)}, \end{align*}\]

which is asymptotically distributed as a \(\mathcal{N}(0,1)\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the two-sided alternative hypothesis \(H_1:\beta_j\neq 0.\)

Is the CI for \(\beta_j\) below (above) \(0\) at level \(\alpha\)?

Yes \(\rightarrow\) reject \(H_0\) at level \(\alpha.\) Conclude \(X_j\) has a significant negative (positive) effect on \(Y\) at level \(\alpha.\)
No \(\rightarrow\) the criterion is not conclusive.

The tests for significance are built-in in the summary function. However, due to discrepancies between summary and confint, a note of caution is required when applying the previous rule of thumb for rejecting \(H_0\) in terms of the CI.

The significances given in summary and the output of MASS::confint are slightly incoherent and the previous rule of thumb does not apply. The reason is because MASS::confint is using a more sophisticated method (profile likelihood) to estimate the standard error of \(\hat\beta_j,\) \(\hat{\mathrm{SE}}(\hat\beta_j),\) and not the asymptotic distribution behind the Wald statistic.

By changing confint to R’s default confint.default, the results of the latter will be completely equivalent to the significances in summary, and the rule of thumb still be completely valid. For the contents of this course we prefer confint.default due to its better interpretability. This point is exemplified in the next section.

5.3.4 Case study application

Let’s compute the summary of the nasa model in order to address the significance of the coefficients. At the sight of this curve and the summary of the model we can conclude that the temperature was increasing the probability of an O-ring incident (Q2). Indeed, the confidence intervals for the coefficients show a significant negative correlation at level \(\alpha=0.05\):

# Summary of the model
summary(nasa)
## 
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0566  -0.7575  -0.3818   0.4571   2.2195  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   7.5837     3.9146   1.937   0.0527 .
## temp         -0.4166     0.1940  -2.147   0.0318 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.335  on 21  degrees of freedom
## AIC: 24.335
## 
## Number of Fisher Scoring iterations: 5

# Confidence intervals at 95%
confint.default(nasa)
##                   2.5 %      97.5 %
## (Intercept) -0.08865488 15.25614140
## temp        -0.79694430 -0.03634877

# Confidence intervals at other levels
confint.default(nasa, level = 0.90)
##                    5 %        95 %
## (Intercept)  1.1448638 14.02262275
## temp        -0.7358025 -0.09749059

# Confidence intervals for the factors affecting the odds
exp(confint.default(nasa))
##                 2.5 %       97.5 %
## (Intercept) 0.9151614 4.223359e+06
## temp        0.4507041 9.643039e-01

The coefficient for temp is significant at \(\alpha=0.05\) and the intercept is not (it is for \(\alpha=0.10\)). The \(95\%\) confidence interval for \(\beta_0\) is \((-0.0887, 15.2561)\) and for \(\beta_1\) is \((-0.7969, -0.0363).\) For \(e^{\beta_0}\) and \(e^{\beta_1},\) the CIs are \((0.9151, 4.2233\times10^6)\) and \((0.4507, 0.9643),\) respectively. Therefore, we can say with a \(95\%\) confidence that:

When temp=0, the probability of fail.field=1 is not significantly larger than the probability of fail.field=0 (using the CI for \(\beta_0\)). fail.field=1 is between \(0.9151\) and \(4.2233\times10^6\) more likely than fail.field=0 (using the CI for \(e^{\beta_0}\)).
temp has a significantly negative effect on the probability of fail.field=1 (using the CI for \(\beta_1\)). Indeed, each unit increase in temp produces a reduction of the odds of fail.field by a factor between \(0.4507\) and \(0.9643\) (using the CI for \(e^{\beta_1}\)).

This completes the answers to Q1 and Q2.

We conclude by illustrating the incoherence of summary and confint.

# Significances with asymptotic approximation for the standard errors
summary(nasa)
## 
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0566  -0.7575  -0.3818   0.4571   2.2195  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   7.5837     3.9146   1.937   0.0527 .
## temp         -0.4166     0.1940  -2.147   0.0318 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.335  on 21  degrees of freedom
## AIC: 24.335
## 
## Number of Fisher Scoring iterations: 5

# CIs with asymptotic approximation -- coherent with summary
confint.default(nasa, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) -0.08865488 15.25614140
## temp        -0.79694430 -0.03634877
confint.default(nasa, level = 0.99)
##                  0.5 %      99.5 %
## (Intercept) -2.4994971 17.66698362
## temp        -0.9164425  0.08314945

# CIs with profile likelihood -- incoherent with summary
confint(nasa, level = 0.95) # intercept still significant
##                  2.5 %     97.5 %
## (Intercept)  1.3364047 17.7834329
## temp        -0.9237721 -0.1089953
confint(nasa, level = 0.99) # temp still significant
##                  0.5 %      99.5 %
## (Intercept) -0.3095128 22.26687651
## temp        -1.1479817 -0.02994011

That work quite well in practice and deliver many valuable insights.↩︎
The linear model is an exception in terms of exact and simple inference, not the rule.↩︎
Recall expression (5.23) for the general case of \(\mathcal{I}(\boldsymbol{\beta}).\)↩︎
Undestood as small \(|(\mathbf{X}'\mathbf{V}\mathbf{X})^{-1}|.\)↩︎
Because \(e^{\hat\beta_j-\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\leq e^{\beta_j}\leq e^{\hat\beta_j+\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\iff \hat\beta_j-\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\leq \beta_j \leq\hat\beta_j+\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\) since the exponential is a monotone increasing function.↩︎