3.5 Prediction

As in the simple linear model, the forecast of $Y$ from $\mathbf{X}=\mathbf{x}$ (this is, $X_1=x_1,\ldots,X_k=x_k$ ) is approached by two different ways:

Inference on the conditional mean of $Y$ given $\mathbf{X}=\mathbf{x}$ , $\mathbb{E}[Y|\mathbf{X}=\mathbf{x}]$ . This is a deterministic quantity, which equals $\beta_0+\beta_1x_1+\cdots+\beta_{k}x_k$ .
Prediction of the conditional response $Y|\mathbf{X}=\mathbf{x}$ . This is a random variable distributed as $\mathcal{N}(\beta_0+\beta_1x_1+\cdots+\beta_{k}x_k,\sigma^2)$ .

The prediction and computation of CIs can be done with the R function predict (unfortunately, there is no R Commander shortcut for this one). The objects required for predict are: first, the output of lm; second, a data.frame containing the locations $\mathbf{x}=(x_1,\ldots,x_k)$ where we want to predict $\beta_0+\beta_1x_1+\cdots+\beta_{k}x_k$ . The prediction is $\hat\beta_0+\hat\beta_1x_1+\cdots+\hat\beta_{k}x_k$

It is mandatory to name the columns of the data frame with the same names of the predictors used in lm. Otherwise predict will generate an error, see below.

To illustrate the use of predict, we return to the wine dataset.

# Fit a linear model for the price on WinterRain, HarvestRain and AGST
modelW <- lm(Price ~ WinterRain + HarvestRain + AGST, data = wine)
summary(modelW)
## 
## Call:
## lm(formula = Price ~ WinterRain + HarvestRain + AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62816 -0.17923  0.02274  0.21990  0.62859 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.9506001  1.9694011  -2.514  0.01940 *  
## WinterRain   0.0012820  0.0005765   2.224  0.03628 *  
## HarvestRain -0.0036242  0.0009646  -3.757  0.00103 ** 
## AGST         0.7123192  0.1087676   6.549 1.11e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3436 on 23 degrees of freedom
## Multiple R-squared:  0.7407, Adjusted R-squared:  0.7069 
## F-statistic:  21.9 on 3 and 23 DF,  p-value: 6.246e-07

# Data for which we want a prediction
# Important! You have to name the column with the predictor name!
weather <- data.frame(WinterRain = 500, HarvestRain = 123,
                      AGST = 18)

## Prediction of the mean

# Prediction of the mean at 95% - the defaults
predict(modelW, newdata = weather)
##        1 
## 8.066342

# Prediction of the mean with 95% confidence interval (the default)
# CI: (lwr, upr)
predict(modelW, newdata = weather, interval = "confidence")
##        fit      lwr      upr
## 1 8.066342 7.714178 8.418507
predict(modelW, newdata = weather, interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 8.066342 7.714178 8.418507

# Other levels
predict(modelW, newdata = weather, interval = "confidence", level = 0.90)
##        fit      lwr      upr
## 1 8.066342 7.774576 8.358108
predict(modelW, newdata = weather, interval = "confidence", level = 0.99)
##        fit      lwr      upr
## 1 8.066342 7.588427 8.544258

## Prediction of the response

# Prediction of the mean at 95% - the defaults
predict(modelW, newdata = weather)
##        1 
## 8.066342

# Prediction of the response with 95% confidence interval (the default)
# CI: (lwr, upr)
predict(modelW, newdata = weather, interval = "prediction")
##        fit      lwr      upr
## 1 8.066342 7.273176 8.859508
predict(modelW, newdata = weather, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 8.066342 7.273176 8.859508

# Other levels
predict(modelW, newdata = weather, interval = "prediction", level = 0.90)
##        fit      lwr      upr
## 1 8.066342 7.409208 8.723476
predict(modelW, newdata = weather, interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 8.066342 6.989951 9.142733

# Predictions for several values
weather2 <- data.frame(WinterRain = c(500, 200), HarvestRain = c(123, 200),
                       AGST = c(17, 18))
predict(modelW, newdata = weather2, interval = "prediction")
##        fit      lwr      upr
## 1 7.354023 6.613835 8.094211
## 2 7.402691 6.533945 8.271437

For the wine dataset, do the following:

Regress WinterRain on HarvestRain and AGST. Name the fitted model modExercise.
Compute the estimate for the conditional mean of WinterRain for HarvestRain $=123.0$ and AGST $=16.15$ . What is the CI at $\alpha=0.01$ ?
Compute the estimate for the conditional response for HarvestRain $=125.0$ and AGST $=15$ . What is the CI at $\alpha=0.10$ ?
Check that modExercise$fitted.values is the same as predict(modExercise, newdata = data.frame(HarvestRain = wine$HarvestRain, AGST = wine$AGST)). Why is so?

Similarities and differences in the prediction of the conditional mean $\mathbb{E}[Y|\mathbf{X}=\mathbf{x}]$ and conditional response $Y|\mathbf{X}=\mathbf{x}$ :

Similarities. The estimate is the same, $\hat y=\hat\beta_0+\hat\beta_1x_1+\cdots+\hat\beta_kx_k$ . Both CI are centered in $\hat y$ .
Differences. $\mathbb{E}[Y|\mathbf{X}=\mathbf{x}]$ is deterministic and $Y|\mathbf{X}=\mathbf{x}$ is random. Therefore, the variance is larger for the prediction of $Y|\mathbf{X}=\mathbf{x}$ than for the prediction of $\mathbb{E}[Y|\mathbf{X}=\mathbf{x}]$ .