CHAPTER 5 Inference on the Slope and Mean Response, and Prediction of New Observations
Aside from point estimation of parameters and the ANOVA for regression relation, we extend our inference to confidence interval and hypothesis testing of each parameter.
We also infer about the mean response of Y given values of the regressors. Prediction of new responses is also discussed, which is one of the main goals of regression after building the model.
5.1 Inference on the Slope Parameters
Confidence Interval Estimation for Slope Parameters
Confidence interval estimates of the parameters give a better picture than point estimates (e.g. BLUE, UMVUE). The width of these confidence intervals is a measure of the overall quality of the regression line.
A possible pivotal quantity we can use to create confidence intervals for the \(\beta_j\)s are given by the following: \[ \frac{\hat{\beta_j}-\beta_j}{\widehat{s.e.(\hat{\beta_j})}}=\frac{\hat{\beta_j}-\beta_j}{\sqrt{MSE\underline{e}_{j+1}'(\textbf{X}'\textbf{X})^{−1}\underline{e}_{j+1}}}\sim t_{(n-p)} \] Recall Section 4.3 for the form of \(\widehat{s.e(\hat{\beta}_j)}\)
Theorem 5.1 A \((1-\alpha)100\%\) CI for \(\beta_j\) is given by \[ \left( \hat{\beta}_j \mp t_{(\frac{\alpha}{2},n-p)}\widehat{s.e.(\hat{\beta_j})} \right) \]
Notes:
- These confidence intervals have the usual frequentist interpretation.
- That is, if we are to take repeated samples of the same size at the same levels of the independent variables, and construct \((1 − \alpha)100\%\) confidence intervals for \(\beta_j\) for each sample, then approximately \((1 − \alpha)100\%\) of these intervals will contain the true value of \(\beta_j\).
Ilustration in R
Let’s use again the Anscombe
dataset with the following variables:
- \(y\):
income
- \(x_1\):
education
- \(x_2\):
young
- \(x_3\):
urban
library(carData)
data(Anscombe)
mod_anscombe <- lm(income~education + young + urban, data = Anscombe)
In R, the confidence intervals of the regression parameters are not displayed together with the other metrics.
##
## Call:
## lm(formula = income ~ education + young + urban, data = Anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -438.54 -145.21 -41.65 119.20 739.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3011.4633 609.4769 4.941 1.03e-05 ***
## education 7.6313 0.8798 8.674 2.56e-11 ***
## young -6.8544 1.6617 -4.125 0.00015 ***
## urban 1.7692 0.2591 6.828 1.49e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.7 on 47 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.785
## F-statistic: 61.86 on 3 and 47 DF, p-value: 2.384e-16
The confint()
computes confidence intervals for one or more parameters in a fitted model. The 95% confidence intervals are shown by default.
## 2.5 % 97.5 %
## (Intercept) 1785.353882 4237.572728
## education 5.861365 9.401299
## young -10.197214 -3.511526
## urban 1.247888 2.290464
Hypothesis Testing for Individual Slope Parameters
Testing the parameters with p-value outputs may provide us faster interpretations compared to confidence intervals.
Recall that the statistic \(\frac{\hat{\beta_j}-\beta_j}{\widehat{s.e.(\hat{\beta_j})}}\) follows the T distribution.
Definition 5.1 (T-test for a Regression Slope) For testing \(Ho:\beta_j=0\) vs \(Ha: \beta_j\neq0\), the test statistic is \[ T=\frac{\hat{\beta_j}}{\widehat{s.e.(\hat{\beta_j})}}\sim t_{(n-p)}, \quad \text{under } Ho \] The critical region of the test is \(|T|>t_{\alpha/2,(n-p)}\)
This is somehow similar with the Section [Testing one slope parameter equal to 0]. But instead of T-tests, F-tests are used in the next chapter.
Illustration in R
Results of the hypothesis test on individual coefficients is pretty straightforward in R. Just use the summary()
function on an lm
object, then the estimated coefficient, standard error, t statistic, and p-values are already summarized.
##
## Call:
## lm(formula = income ~ education + young + urban, data = Anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -438.54 -145.21 -41.65 119.20 739.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3011.4633 609.4769 4.941 1.03e-05 ***
## education 7.6313 0.8798 8.674 2.56e-11 ***
## young -6.8544 1.6617 -4.125 0.00015 ***
## urban 1.7692 0.2591 6.828 1.49e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.7 on 47 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.785
## F-statistic: 61.86 on 3 and 47 DF, p-value: 2.384e-16
If the model is \(y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\varepsilon\), after performing the 3 t-tests for \(j=1,2,3\), what are the 3 conclusions at 0.05 level of significance?
Again, the t-test presented here are for individual slope parameters. Although similar in nature and goal, hypothesis tests in the next chapter will be used to show if one model is indeed better than the other.
5.2 Inference about the mean response
One of the major goals usually is to estimate the mean for one or more probability distributions of \(Y\) .
For given values of the independent variables, say \(\textbf{x}_h\), the expected or mean response is \(E(Y_h) = \textbf{x}'_h\boldsymbol{\beta}\).
In this section, we want to make inferences about the mean \(E(Y_h)\) for given values of the independent variable.
Point Estimation
Definition 5.2 (Estimated Mean Response)
The estimated mean response corresponding to \(\textbf{x}_h\), denoted by \(\widehat{Y}_h\), is
\[
\widehat{Y}_h=\textbf{x}'_h\hat{\boldsymbol{\beta}}
\]
Note: Not to be confused with the Fitted Values in Definition 4.1. Here, instead of using the same set of \(\textbf{x}\) that was used in fitting, we can now use new values of the independent variables \(\textbf{x}_h\) to estimate the expected response \(Y_h\).
Theorem 5.2 The estimated mean response \(\hat{Y}_h = \textbf{x}^′_h \boldsymbol{\hat{\beta}}\) is the BLUE for \(E (Y_h)\).
Exercise 5.1 Prove Theorem 5.2
Hint
Recall result and remarks from the Gauss-Markov Theorem.
Show first that for any vector of constants \(\boldsymbol{\lambda}\) and other linear unbiased estimator \(\overset{\sim}{\boldsymbol{\beta}}\), \(Var(\boldsymbol{\lambda}'\overset{\sim}{\boldsymbol{\beta}})\geq Var(\boldsymbol{\lambda}'\hat{\boldsymbol{\beta}})\).
(This implies that a linear combination of the OLS estimators \(\hat{\beta_0},\hat{\beta_k},...,\hat{\beta_k}\) are the Best Linear Unbiased Estimators for the same linear combination of \(\beta_0,\beta_1,...,\beta_k\) )
Change notations to match the notations in Theorem 5.2
Confidence Interval Estimation
For the normal error regression model, \[ \widehat{Y}_h = \textbf{x}^′_h \boldsymbol{\hat{\beta}}\sim Normal(\textbf{x}^′_h \boldsymbol{\beta}, \sigma^2\textbf{x}^′_h(\textbf{X}'\textbf{X})^{-1}\textbf{x}_h) \]Based on this distribution, a possible pivot to create a CI estimator for \(E(Y_h)\) is given by: \[ \frac{\hat{Y}_h-\textbf{x}^′_h \boldsymbol{\beta}} {\sqrt{MSE\cdot\textbf{x}^′_h(\textbf{X}'\textbf{X})^{-1}\textbf{x}_h}} \sim t_{(n-p)} \]
Theorem 5.3 A \((1-\alpha)100\%\) confidence interval for \(E(Y_h)\) \[ \left(\hat{Y}_h \mp t_{\frac{\alpha}{2}, n-p} \cdot \sqrt{MSE \cdot \textbf{x}'_h(\textbf{X}'\textbf{X})^{-1}\textbf{x}_h} \right) \]
Visualization for Simple Linear Regression
The geom_smooth
layer of a ggplot
object shows fitted least squares line (specified by parameter method = "lm"
) and the confidence bands as well (se=T
). Default confidence level is level=0.95
.
library(ggplot2)
library(carData)
ggplot(data = Anscombe, aes(x=education, y = income))+
geom_point()+
geom_smooth(method = "lm", se = T) +
theme_bw()
Hypothesis Testing
Following previous results, we have the following test about the expected or mean response.
Theorem 5.4 The following can be used to test the following hypotheses:
\[ Ho:E(Y_h)=d\quad\text{vs}\quad Ha:E(Y_h) \neq d \]
Test Statistic: \[ t=\frac{\hat{Y}_h-d} {\sqrt{MSE\cdot\textbf{x}^′_h(\textbf{X}'\textbf{X})^{-1}\textbf{x}_h}} \]
Critical Region: Reject \(Ho\) if \[ t>t(\alpha/2,n-p) \]
5.3 Prediction of a new observation
In the case that we are not estimating a parameter but predicting a particular value of the random variable \(Y_i\) for a particular \(\textbf{x}_i\) , we can use the concept of prediction intervals. In this case, we are predicting an individual outcome drawn from the distribution of \(Y\).
The new observation on \(Y\) is viewed as the result of a new trial, independent of the trials on which the regression analysis is based. We also assume that the underlying regression model applicable for the basic sample data continues to be appropriate for the new observation.
Case 1: All parameters are known
If all parameters are known, take note that \[ Y_i\sim N(\textbf{x}_i'\boldsymbol{\beta},\sigma^2) \] A possible pivot to create a prediction interval is: \[ \frac{Y_i-\textbf{x}_i'\boldsymbol{\beta}}{\sigma}\sim N(0,1) \] From this, we can now have the following prediction interval:
Theorem 5.5 A prediction interval for \(Y_{h(new)}\) is given by \[ (\textbf{x}_h'\boldsymbol{\beta} \mp z_{\alpha/2}\sigma) \]
Case 2: All parameters are unknown
Assumptions:
- parameters have already been estimated (but true value still unknown)
- the given \(\textbf{x}_h\) must be independent of the previous sample
Theorem 5.6 A prediction interval for \(Y_{h(new)}\) is given by: \[ (\textbf{x}_h'\hat{\boldsymbol{\beta}} \mp t_{(\alpha/2,n-p)}\sqrt{MSE\cdot[\textbf{x}^′_h(\textbf{X}'\textbf{X})^{-1}\textbf{x}_h+1]}) \]
Remarks:
- Why is there a “+1” in Case 2?
- Note that in predicting a new observation, there are two sources of variability since you are estimating two things technically:
- The mean response \(E(Y_h)\)
- The new observation based on the estimated mean response.
- For the case of unknown parameters, prediction intervals are wider than confidence intervals.
- Note that in predicting a new observation, there are two sources of variability since you are estimating two things technically:
- How to interpret prediction intervals?
- Under the assumption that the additional observation is obtained in the same way as you obtained your past data, we can interpret \((1 − \alpha)\) as the probability that the new observation is inside the prediction interval.
- This makes sense since we assume that the \(Y_{h(new)}\) is a random variable. Contrary to Confidence Intervals where the constructed interval is for a fixed parameter value that is not random.
- Take note again of the caution of using the model for \(\textbf{x}_h\) not being inside the ”scope” of the model!
Illustration of Predicting a Response in R
Recall the Anscombe
dataset.
We use education
, young
, and urban
as linear predictors of income
.
library(carData)
data(Anscombe)
mod <- lm(income ~ education + urban + young, data = Anscombe)
summary(mod)
##
## Call:
## lm(formula = income ~ education + urban + young, data = Anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -438.54 -145.21 -41.65 119.20 739.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3011.4633 609.4769 4.941 1.03e-05 ***
## education 7.6313 0.8798 8.674 2.56e-11 ***
## urban 1.7692 0.2591 6.828 1.49e-08 ***
## young -6.8544 1.6617 -4.125 0.00015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.7 on 47 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.785
## F-statistic: 61.86 on 3 and 47 DF, p-value: 2.384e-16
Now, suppose we want to predict the income
of 2 new states, and data for education
, young
ratio, and urban
ratio are available. Data is stored in dataframe new_states
.
new_state1 <- c(education = 190, young = 350, urban = 500)
new_state2 <- c(education = 300, young = 275, urban = 250)
new_states <- data.frame(rbind(new_state1, new_state2))
new_states
## education young urban
## new_state1 190 350 500
## new_state2 300 275 250
Note that the new dataframe new_states
has the same columns required for the predictors stored in the lm
object mod
.
We can estimate the expected income
, which is the same as predicting the income
, of the two new states given the predictors using the fitted model mod
and the function predict()
. Confidence interval or prediction interval can be shown by using the interval=
argument.
# estimate of expected income with 95% confidence intervals
predict(mod, new_states, interval = "confidence", level = 0.95)
## fit lwr upr
## new_state1 2946.975 2828.350 3065.600
## new_state2 3858.205 3351.406 4365.004
# prediction of income with 95% prediction intervals
predict(mod, new_states, interval = "prediction", level = 0.95)
## fit lwr upr
## new_state1 2946.975 2411.320 3482.63
## new_state2 3858.205 3130.401 4586.01
Notice that the confidence interval of \(E(Y_{h(new)})\) is narrower than the prediction interval of \(Y_{h(new)}\)
Exercise 5.2 In terms of interpretation, what is the difference between the Expected Value and the Predicted Value in regression?
5.4 The Inverse Prediction Problem
Definition 5.3 A regression model of \(Y\) on \(X\) may be used to predict the value of \(X\) which gave rise to a new observation \(Y\) . This is known as an inverse prediction.
Example: A regression analysis of the amount of decrease in cholesterol level \((Y)\) achieved with a given dosage of a new drug \((X)\) has been conducted, based on observations for 50 patients. A physician is treating a new patient for whom the cholesterol level should decrease to amount \(Y_{h(new)}\). It is desired to estimate the appropriate dosage level \(X_{h(new)}\) to be administered to bring about the needed cholesterol decrease \(Y_{h(new)}\).
The model is assumed to be \(Y_i=\beta_0+\beta X_i+\varepsilon_i\) and the estimated regression line is \(\hat{Y}_i=\hat{\beta}_0+\hat{\beta}X_i\).
If a new observation \(Y_{h(new)}\) becomes available, a natural point estimator for level \(X_{h(new)}\) which gave rise to \(Y_{h(new)}\) is
\[ \hat{X}_{h(new)}=\frac{\hat{Y}_{h(new)}-\hat{\beta}_0}{\hat{\beta}} \]
Illustrations:
- Suppose \(Y\) = mortality rate of grouper juveniles and \(X\) = concentration of heavy metal (lead), and the estimated regression model is \(\hat{Y}=30+0.60X\). What should be the value of \(X\) so \(\hat{Y}=50\) is attained?
- Suppose \(Y\) = sales and \(X\) = expenditure to print advertisement. Suppose that a 10% increase in sales is targeted, current sales level is at 100 → 10% increase to 110. Can it be supported by increase in expenditure to print advertisement? Suppose \(\hat{Y} = 100 + 2.5X\) , find \(X\) when \(\hat{Y} = 110\).
Remarks:
Estimates for \(X\) are functions of parameter estimates. Standard errors for the predictions can be approximated from the standard errors of the least square estimates of the regression coefficients.
However, since the standard errors of the predictions depends on the original regression model, one may encounter problem in using them for prediction intervals. Sometimes the upper and lower bounds may create some problems.
The inverse prediction problem is also known as a calibration problem since it is applicable when inexpensive, quick, and approximate measurements \(Y\) are related to precise, often expensive, and time-consuming measurements \(X\) based on \(n\) observations.
The resulting regression model is used to estimate the precise measurement \(X\) for a new approximate measurement \(Y\). One can easily extend the notion of the calibration problem to two or more independent variables. However, calibration should be done to one independent variable only. That is, hold other independent variables constant while finding the value \(X_j\) for the \(j^{th}\) independent variable.
The inverse prediction problem has aroused controversy among statisticians. Some statisticians have suggested that inverse predictions should be made in direct fashion by regressing \(X\) on \(Y\). That is, use the \(X\) as dependent variable, and use \(Y\) as the independent variable. This regression is called inverse regression.
5.5 Assessment of Predictive Ability of Model
Selecting a model based on metrics that are calculated using observations not included in the estimation helps prevent overfitting.
Cross Validation using Train and Test Sets
In this section, we perform a cross validation method to assess how good the model is in predicting a set of new observations. Often, we first split the dataset into a training set and a test set.
Definition 5.4 (Train and Test Sets)
- Train dataset is a subset of the original dataset that we use to create the model.
- Test dataset is a subset of the original dataset that we assume to be new or unobserved, which we use to predict values of the dependent variable to assess the model.
Remarks:
- It is common to split the dataframe to 70-30. 70% of the dataset will be used for training the model, and the rest will be used for testing the model.
- In linear regression, observations belonging in the train or test set are chosen randomly (SRS).
- Prior to splitting, we already assumed that observations in the dataset were sampled independently. Even if we take a sample without replacement (SRSWOR) from the already collected dataset, independence assumption will not be violated if we regress using the training set.
Illustration in R:
Choosing observations randomly that will belong in the train and test datasets.
# obtaining random numbers, 70% of the number of observations
index <- sample(1:nrow(Anscombe),
size = round(nrow(Anscombe)*0.7,0))
index
## [1] 35 43 44 42 47 3 4 22 45 31 27 10 50 16 48 24 18 28 26 37 36 21 34 2 33
## [26] 14 6 23 29 51 9 41 11 40 15 17
# splitting the dataset using the random numbers generated
library(carData)
train <- Anscombe[index,]
test <- Anscombe[-index,]
Test dataset:
education | income | young | urban | |
---|---|---|---|---|
ME | 189 | 2824 | 350.7 | 508 |
RI | 180 | 3549 | 327.1 | 871 |
NY | 261 | 4151 | 326.2 | 856 |
NJ | 214 | 3954 | 333.5 | 889 |
IL | 189 | 3981 | 348.9 | 830 |
MI | 233 | 3675 | 369.2 | 738 |
SD | 187 | 2876 | 368.7 | 446 |
NE | 148 | 3239 | 349.9 | 615 |
VA | 180 | 3068 | 353.0 | 631 |
FL | 191 | 3191 | 336.0 | 805 |
TN | 137 | 2579 | 342.8 | 588 |
TX | 155 | 3029 | 369.4 | 797 |
MT | 238 | 2942 | 368.9 | 534 |
NV | 225 | 3957 | 385.1 | 809 |
CA | 273 | 3968 | 348.4 | 909 |
Train dataset:
education | income | young | urban | |
---|---|---|---|---|
AR | 134 | 2322 | 351.9 | 500 |
NM | 227 | 2651 | 421.5 | 698 |
AZ | 207 | 3027 | 387.5 | 796 |
CO | 192 | 3340 | 358.1 | 785 |
WA | 215 | 3688 | 341.3 | 726 |
VT | 230 | 3072 | 348.5 | 322 |
MA | 168 | 3835 | 335.3 | 846 |
DE | 248 | 3795 | 375.9 | 722 |
UT | 201 | 2790 | 412.4 | 804 |
KY | 140 | 2645 | 349.3 | 523 |
NC | 155 | 2664 | 354.1 | 450 |
OH | 172 | 3509 | 354.5 | 753 |
AK | 372 | 4146 | 439.7 | 484 |
IO | 234 | 3265 | 343.8 | 572 |
OR | 233 | 3317 | 332.7 | 671 |
DC | 246 | 4425 | 352.1 | 1000 |
ND | 177 | 2730 | 369.1 | 443 |
SC | 149 | 2380 | 376.7 | 476 |
WV | 149 | 2470 | 328.8 | 390 |
OK | 135 | 2880 | 329.8 | 680 |
LA | 162 | 2634 | 389.6 | 661 |
KA | 196 | 3303 | 339.9 | 661 |
MS | 130 | 2081 | 385.2 | 445 |
NH | 169 | 3259 | 345.9 | 564 |
AL | 112 | 2337 | 362.2 | 584 |
WI | 209 | 3363 | 360.7 | 659 |
CT | 193 | 4256 | 341.0 | 774 |
MD | 247 | 3742 | 364.1 | 766 |
GA | 156 | 2781 | 370.6 | 603 |
HI | 212 | 3513 | 382.9 | 831 |
PA | 201 | 3419 | 326.2 | 715 |
WY | 238 | 3190 | 365.6 | 605 |
IN | 194 | 3412 | 359.3 | 649 |
ID | 170 | 2668 | 367.7 | 541 |
MN | 262 | 3341 | 365.4 | 664 |
MO | 177 | 3257 | 336.1 | 701 |
Creating a model using the training dataset.
##
## Call:
## lm(formula = income ~ education + young + urban, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -464.4 -145.7 4.1 117.4 704.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3513.3056 641.5381 5.476 4.97e-06 ***
## education 8.4617 0.9685 8.737 5.54e-10 ***
## young -8.6524 1.8044 -4.795 3.60e-05 ***
## urban 1.7514 0.3010 5.819 1.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252 on 32 degrees of freedom
## Multiple R-squared: 0.8201, Adjusted R-squared: 0.8032
## F-statistic: 48.62 on 3 and 32 DF, p-value: 5.078e-12
Now, using the model that we trained from the training dataset, we predict the income in the test dataset.
## ME RI NY NJ IL MI SD NE
## 2967.911 3731.721 4398.633 3995.569 3547.447 3582.986 2686.657 2815.307
## VA FL TN TX MT NV CA
## 3087.282 3632.199 2736.372 3024.579 3270.598 3502.072 4400.917
We now compare the predicted values and the actual values in the test dataset.
actual.income | predicted.income | |
---|---|---|
ME | 2824 | 2967.911 |
RI | 3549 | 3731.721 |
NY | 4151 | 4398.633 |
NJ | 3954 | 3995.569 |
IL | 3981 | 3547.447 |
MI | 3675 | 3582.986 |
SD | 2876 | 2686.657 |
NE | 3239 | 2815.307 |
VA | 3068 | 3087.282 |
FL | 3191 | 3632.199 |
TN | 2579 | 2736.372 |
TX | 3029 | 3024.579 |
MT | 2942 | 3270.598 |
NV | 3957 | 3502.072 |
CA | 3968 | 4400.917 |
We can visualize the relationship of the actual and predicted values using a scatterplot, with a \(y=x\) line.
Through visual inspection, the model created using the training dataset seems to predict the income well, since the predicted values and the actual values are close to each other.
Remarks:
- After obtaining predicted values on the test dataset, we compare them against the actual values in the test dataset.
- Some metrics that can be used to compare the predicted vs actual values are the \(RMSE\), \(MAE\), \(MAPE\), among others. Computations of these metrics are available in the R package
Metrics
- These metrics will not be discussed further in Stat 136, but you can explore them on your own.
- You can also calculate the SST and SSE using the test set and the predicted values.
# SST of the actual income in the test set
SST_income <- sum((test_prediction$actual.income - mean(test_prediction$actual.income))^2)
# SSE
SSE_income <- sum((test_prediction$predicted.income - test_prediction$actual.income)^2)
1- SSE_income/SST_income
## [1] 0.6660822
76.96% of the variation of income in the test set can be explained by the model that was estimated using the train set.
Prediction Error Sum of Squares (PRESS)
Definition 5.5 (Prediction Error Sum of Squares) The PRESS statistic involves regressing using all observations but the \(i^{th}\) observation to obtain a predictor of \(Y_i\) and forming the sum of squared differences between the observed and predicted values for \(i = 1, 2, . . . , n\).
Delete first set of observations \(i\) on the independent and dependent variables.
Fit the proposed regression model to the remaining \((n − 1)\) to obtain estimates \(\hat{\boldsymbol{\beta}}_{(i)}\), the estimate for \(\boldsymbol{\beta}\) when observation \(i\) was removed in the dataset.
Use the fitted model to predict \(Y_i\) by \(\hat{Y}_{(i)}=\textbf{x}_{i}'\hat{\boldsymbol{\beta}}_{(i)}\), where \(\textbf{x}_i\) is the vector of values of independent variables for observation \(i\).
Obtain a predictive discrepancy \((Y_i − \hat{Y}_{(i)})\)
Repeat steps 1 to 4 for \(i=1,...,n\)
- Calculate the sum of squares of the predictive discrepancy\[ PRESS=\sum_{i=1}^n(Y_i-\hat{Y}_{(i)})^2 \]
Notes:
The PRESS statistic is a leave-one-out refitting and prediction method, as described in Allen (1971). This is related with the Jackknife method of model validation approach.
The PRESS statistic tells you how well the model will predict new data that is not included in creating the model.
The smaller the value of PRESS, the better.
PRESS has the advantage of providing a lot of detailed information about the stability of various fitted models over the data space.
Disadvantages
A major disadvantage is the enormous amount of computation required.
The range of values of PRESS may be dependent on the scaling and range of values of the dependent variable, which makes it hard to interpret sometimes. You need to compare this to the PRESS of other models to be interpretable.
Illustration in R
The ols_press()
function in olsrr
package calculates the PRESS statistic for a model, but not for all possible regression. We can just calculate them manually for candidate regression models.
## [1] 17392465
## [1] 4793604
## [1] 4194578
Predicted \(R^2\)
PRESS can also be used to calculate the predicted \(R^2\) (denoted by \(𝑅_{pred}^2\)) which is generally more intuitive to interpret than PRESS itself.
Definition 5.6 (Predicted R-squared) The predicted \(R^2\) is defined as
\[ R^2_{pred}=1-\frac{PRESS}{SST} \]
Like the coefficient of determination, the predicted \(R^2\) also ranges from 0 to 1.
The predicted \(R^2\) is a useful way in validating predictive ability of the model without splitting the data into training and test sets.
You can use the
ols_pred_rsq()
function from theolsrr
package to compute the \(R^2_{pred}\) of a model.## [1] 0.7325135