5.4 Prediction

Prediction in general linear models focuses mainly on predicting the values of the conditional mean

\[\begin{align*} \mathbb{E}[Y|X_1=x_1,\ldots,X_p=x_p]=g^{-1}(\eta)=g^{-1}(\beta_0+\beta_1x_1+\cdots+\beta_px_p) \end{align*}\]

by means of \(\hat{\eta}:=\hat\beta_0+\hat\beta_1x_1+\cdots+\hat\beta_px_p\) and not on predicting the conditional response. The reason is that confidence intervals, the main difference between both kinds of prediction, depend heavily on the family we are considering for the response.170

For the logistic model, the prediction of the conditional response follows immediately from \(\mathrm{logistic}(\hat{\eta})\):

\[\begin{align*} \hat{Y}|(X_1=x_1,\ldots,X_p=x_p)=\left\{ \begin{array}{ll} 1,&\text{with probability }\mathrm{logistic}(\hat{\eta}),\\ 0,&\text{with probability }1-\mathrm{logistic}(\hat{\eta}).\end{array}\right. \end{align*}\]

As a consequence, we can predict \(Y\) as \(1\) if \(\mathrm{logistic}(\hat{\eta})>\frac{1}{2}\) and as \(0\) otherwise.

To make predictions and compute CIs in practice we use predict. There are two differences with respect to its use for lm:

  • The argument type. type = "link" returns \(\hat{\eta}\) (the log-odds in the logistic model), type = "response" returns \(g^{-1}(\hat{\eta})\) (the probabilities in the logistic model). Observe that type = "response" has a different behavior than predict for lm, where it returned the predictions for the conditional response.
  • There is no interval argument for using predict with glm. That means that the computation of CIs for prediction is not implemented and has to be done manually from the standard errors returned when se.fit = TRUE (see Section 5.4.1).

Figure 5.8 gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section 2.5 and Figure 2.15.

Figure 5.8: Illustration of the CIs for the conditional probability in the simple logistic regression. Application available here.

5.4.1 Case study application

Let’s compute what was the probability of having at least one incident with the O-rings in the launch day (answers Q3):

predict(nasa, newdata = data.frame(temp = -0.6), type = "response")
##        1 
## 0.999604

Recall that there is a serious problem of extrapolation in the prediction, which makes it less precise (or more variable). But this extrapolation, together with the evidences raised by the simple analysis we did, should have been strong arguments for postponing the launch.

Since it is a bit cumbersome to compute the CIs for the conditional response, we can code the function predictCIsLogistic to do it automatically.

# Function for computing the predictions and CIs for the conditional probability
predictCIsLogistic <- function(object, newdata, level = 0.95) {

  # Compute predictions in the log-odds
  pred <- predict(object = object, newdata = newdata, se.fit = TRUE)

  # CI in the log-odds
  za <- qnorm(p = (1 - level) / 2)
  lwr <- pred$fit + za * pred$se.fit
  upr <- pred$fit - za * pred$se.fit

  # Transform to probabilities
  fit <- 1 / (1 + exp(-pred$fit))
  lwr <- 1 / (1 + exp(-lwr))
  upr <- 1 / (1 + exp(-upr))

  # Return a matrix with column names "fit", "lwr" and "upr"
  result <- cbind(fit, lwr, upr)
  colnames(result) <- c("fit", "lwr", "upr")
  return(result)

}

Let’s apply the function to our model:

# Data for which we want a prediction
newdata <- data.frame(temp = -0.6)

# Prediction of the conditional log-odds, the default
predict(nasa, newdata = newdata, type = "link")
##        1 
## 7.833731

# Prediction of the conditional probability
predict(nasa, newdata = newdata, type = "response")
##        1 
## 0.999604

# Simple call
predictCIsLogistic(nasa, newdata = newdata)
##        fit       lwr       upr
## 1 0.999604 0.4838505 0.9999999
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)

Finally, let’s answer Q4 and see what was the probability of having at least one incident with the O-rings if the launch was postponed until the temperature was above \(11.67\) degrees Celsius.

# Estimated probability for launching at 53 degrees Fahrenheit
predictCIsLogistic(nasa, newdata = data.frame(temp = 11.67))
##         fit       lwr       upr
## 1 0.9382822 0.3504908 0.9976707

The maximum predicted probability is \(0.94.\) Notice that is the maximum in accordance to the suggestion of launching above \(11.67\) degrees Celsius. The probability of having at least one incident171 with the O-rings is still very high.

For the challenger dataset, do the following:

  1. Regress fail.nozzle on temp and pres.nozzle.
  2. Compute the predicted probability of fail.nozzle=1 for temp \(=15\) and pres.nozzle \(=200.\) What is the predicted probability for fail.nozzle=0?
  3. Compute the confidence interval for the two predicted probabilities at level \(95\%.\)

  1. For example, the CI for the conditional response in the logistic model is not be very informative, as it can either be \(\{0\},\) \(\{1\}\) or \(\{0,1\}.\) Predictions and CIs for the conditional response are carried out on a model-by-model basis.↩︎

  2. Whether at this temperature it would have been a fatal incident or not is left to speculation.↩︎