## 4.6 Prediction

Prediction in logistic regression focuses mainly on predicting the values of the logistic curve $p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}$ by means of $\hat p(x_1,\ldots,x_k)=\hat{\mathbb{P}}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_kx_k)}}.$ From the perspective of the linear model, this is the same as predicting the conditional mean (not the conditional response) of the response, but this time this conditional mean is also a conditional probability. The prediction of the conditional response is not so interesting since it follows immediately from $$\hat p(x_1,\ldots,x_k)$$: $\hat{Y}|(X_1=x_1,\ldots,X_k=x_k)=\left\{\begin{array}{ll}1,&\text{with probability }\hat p(x_1,\ldots,x_k),\\0,&\text{with probability }1-\hat p(x_1,\ldots,x_k).\end{array}\right.$ As a consequence, we can predict $$Y$$ as $$1$$ if $$\hat p(x_1,\ldots,x_k)>\frac{1}{2}$$ and as $$0$$ if $$\hat p(x_1,\ldots,x_k)<\frac{1}{2}$$.

Let’s focus then on how to make predictions and compute CIs in practice with predict. Similarly to the linear model, the objects required for predict are: first, the output of glm; second, a data.frame containing the locations $$\mathbf{x}=(x_1,\ldots,x_k)$$ where we want to predict $$p(x_1,\ldots,x_k)$$. However, there are two differences with respect to the use of predict for lm:

• The argument type. type = "link", gives the predictions in the log-odds, this is, returns $$\log\frac{\hat p(x_1,\ldots,x_k)}{1-\hat p(x_1,\ldots,x_k)}$$. type = "response" gives the predictions in the probability space $$[0,1]$$, this is, returns $$\hat p(x_1,\ldots,x_k)$$.
• There is no interval argument for using predict for glm. That means that there is no easy way of computing CIs for prediction.

Since it is a bit cumbersome to compute by yourself the CIs, we can code the function predictCIsLogistic so that it computes them automatically for you, see below.

# Data for which we want a prediction
# Important! You have to name the column with the predictor name!
newdata <- data.frame(temp = -0.6)

# Prediction of the conditional log-odds - the default
predict(nasa, newdata = newdata, type = "link")
##        1
## 7.833731

# Prediction of the conditional probability
predict(nasa, newdata = newdata, type = "response")
##        1
## 0.999604

# Function for computing the predictions and CIs for the conditional probability
predictCIsLogistic <- function(object, newdata, level = 0.95) {

# Compute predictions in the log-odds
pred <- predict(object = object, newdata = newdata, se.fit = TRUE)

# CI in the log-odds
za <- qnorm(p = (1 - level) / 2)
lwr <- pred$fit + za * pred$se.fit
upr <- pred$fit - za * pred$se.fit

# Transform to probabilities
fit <- 1 / (1 + exp(-pred\$fit))
lwr <- 1 / (1 + exp(-lwr))
upr <- 1 / (1 + exp(-upr))

# Return a matrix with column names "fit", "lwr" and "upr"
result <- cbind(fit, lwr, upr)
colnames(result) <- c("fit", "lwr", "upr")
return(result)

}

# Simple call
predictCIsLogistic(nasa, newdata = newdata)
##        fit       lwr       upr
## 1 0.999604 0.4838505 0.9999999
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)

For the challenger dataset, do the following:

• Regress fail.nozzle on temp and pres.nozzle.
• Compute the predicted probability of fail.nozzle=1 for temp=15 and pres.nozzle=200. What is the predicted probability for fail.nozzle=0?
• Compute the confidence interval for the two predicted probabilities at level 95%.

Finally, Figure 4.9 gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section 2.6 and Figure 2.23.

Figure 4.9: Illustration of the CIs for the conditional probability in the simple logistic regression. Application also available here.