2.5 Prediction
The forecast of from (this is, ) is approached in two different ways:
- Through the estimation of the conditional mean of given This is a deterministic quantity, which equals
- Through the prediction of the conditional response This is a random variable distributed as
There are similarities and differences in the prediction of the conditional mean and conditional response which we highlight next:
- Similarities. The estimate is the same,36 The CIs for both quantities are centered in
- Differences. is deterministic and is a random variable. The prediction of the latter is noisier, because it has to take into account the randomness of Therefore, the variance is larger for the prediction of than for the prediction of This has a direct consequence on the length of the prediction intervals, which are longer for than for
The inspection of the CIs for the conditional mean and conditional response in the simple linear model offers great insight into the previous similarities and differences, and also on what components affect precisely the quality of the prediction:
- The CI for the conditional mean is
- The CI for the conditional response is
Figure 2.15: Illustration of the CIs for the conditional mean and response. Note how the width of the CIs is influenced by especially for the conditional mean (the conditional response has a constant term affecting the width). Application available here.
Notice the dependence of both CIs on and each of them with a clear effect on the resulting length of the interval. Note also the high similarity between (2.18) and (2.19) (both intervals are centered at and have a similar variance) and its revealing unique difference: the extra in (2.19),37 consequence of the “extra randomness” of the conditional response with respect to the conditional mean. Figure 2.15 helps to visualize these concepts and the difference between CIs interactively.
2.5.1 Case study application
The prediction and the computation of prediction CIs can be done with predict
. The objects required for predict
are: first, an lm
object; second, a data.frame
containing the locations where we want to predict The prediction is and the CIs returned are either (2.18) or (2.19).
It is mandatory to name the columns of the data.frame
with the same names of the predictors used in
lm
. Otherwise predict
will throw an error.
# Fit a linear model for the price on WinterRain, HarvestRain, and AGST
modWine4 <- lm(Price ~ WinterRain + HarvestRain + AGST, data = wine)
summary(modWine4)
##
## Call:
## lm(formula = Price ~ WinterRain + HarvestRain + AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62816 -0.17923 0.02274 0.21990 0.62859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9506001 1.9694011 -2.514 0.01940 *
## WinterRain 0.0012820 0.0005765 2.224 0.03628 *
## HarvestRain -0.0036242 0.0009646 -3.757 0.00103 **
## AGST 0.7123192 0.1087676 6.549 1.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3436 on 23 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7069
## F-statistic: 21.9 on 3 and 23 DF, p-value: 6.246e-07
# Data for which we want a prediction
# Important! You have to name the column with the predictor's name!
weather <- data.frame(WinterRain = 500, HarvestRain = 123, AGST = 18)
weatherBad <- data.frame(500, 123, 18)
# Prediction of the mean
# Prediction of the mean at 95% -- the defaults
predict(modWine4, newdata = weather)
## 1
## 8.066342
predict(modWine4, newdata = weatherBad) # Error
## Error in eval(predvars, data, env): object 'WinterRain' not found
# Prediction of the mean with 95% confidence interval (the default)
# CI: (lwr, upr)
predict(modWine4, newdata = weather, interval = "confidence")
## fit lwr upr
## 1 8.066342 7.714178 8.418507
predict(modWine4, newdata = weather, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 8.066342 7.714178 8.418507
# Other levels
predict(modWine4, newdata = weather, interval = "confidence", level = 0.90)
## fit lwr upr
## 1 8.066342 7.774576 8.358108
predict(modWine4, newdata = weather, interval = "confidence", level = 0.99)
## fit lwr upr
## 1 8.066342 7.588427 8.544258
# Prediction of the response
# Prediction of the mean at 95% -- the defaults
predict(modWine4, newdata = weather)
## 1
## 8.066342
# Prediction of the response with 95% confidence interval
# CI: (lwr, upr)
predict(modWine4, newdata = weather, interval = "prediction")
## fit lwr upr
## 1 8.066342 7.273176 8.859508
predict(modWine4, newdata = weather, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 8.066342 7.273176 8.859508
# Other levels
predict(modWine4, newdata = weather, interval = "prediction", level = 0.90)
## fit lwr upr
## 1 8.066342 7.409208 8.723476
predict(modWine4, newdata = weather, interval = "prediction", level = 0.99)
## fit lwr upr
## 1 8.066342 6.989951 9.142733
# Predictions for several values
weather2 <- data.frame(WinterRain = c(500, 200), HarvestRain = c(123, 200),
AGST = c(17, 18))
predict(modWine4, newdata = weather2, interval = "prediction")
## fit lwr upr
## 1 7.354023 6.613835 8.094211
## 2 7.402691 6.533945 8.271437
For the wine
dataset, do the following:
- Regress
WinterRain
onHarvestRain
andAGST
. Name the fitted modelmodExercise
. - Compute the estimate for the conditional mean of
WinterRain
forHarvestRain
andAGST
What is the CI at ? - Compute the estimate for the conditional response for
HarvestRain
andAGST
What is the CI at ? - Check that
modExercise$fitted.values
is the same aspredict(modExercise, newdata = data.frame(HarvestRain = wine$HarvestRain, AGST = wine$AGST))
. Why is this so?