CHAPTER 7 Variable Selection Procedures
In a regression problem, there are many possible models. One should just use logic and sound judgement to arrive at the most useful model.
To quote the British statistician George Box:
“All models are wrong, some are useful.”
In this Chapter, Sections 7.1 and 7.2 discuss variable selection procedures, while Section 4.6 show a possible preliminary before fitting models for better interpretability of coefficients.
Suppose we want to establish a linear regression equation for a particular dependent variable \(Y\) in terms of the independent variables \(X_1, X_2, . . . , X_k\) (or some transformations of the independent variables).
Two opposing criteria of selecting a resultant equation are usually involved:
- To make the equation useful for predictive purposes we should want our model to include as many regressors as possible so that reliable fitted values can be determined.
- Because of the costs involved in obtaining information on a large number of regressor variables and subsequently monitoring them, we should like the equation to include as few regressors as possible.
The compromise between these is what is usually called ”selection of independent variables”.
Some of the independent variables can be screened out due to the following reasons:
- they may not be fundamental to the problem
- they are subject to large errors, and
- they effectively duplicate another variable in the list.
In the following procedures, we will use R functions from the olsrr
package.
7.1 All-Possible-Regression Approach (APR)
If you have a dataset with \(k\) possible independent variables \(X_1,X_2,...,X_k\) , then how many possible linear regression models can be made? (excluding the model with only the intercept as coefficient).
\[\begin{align} \text{1 independent variable} &: y=\beta_0+\beta_1x_1, \quad y=\beta_0+\beta_2x_2,\quad...\quad, y=\beta_0+\beta_kx_k\\ \text{2 independent variables}&:y=\beta_0 + \beta_1x_1+\beta_2x_2, \quad ...\quad, y=\beta_0+\beta_{k-1}x_{k-1}+\beta_kx_k\\ \vdots &\\ \text{k variables}&:y = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k \end{align}\]Answer
There are \(2^k-1\) possible combinations.
If we are selecting the “best” regression from all-possible-regression, we need some criteria what constitutes as “best”.
The possible criteria for comparing and selecting regression models with the all-possible-regression selection procedure are \(R_p^2\), \(MSE_p\) or \(R_a^2\), and \(C_p\), where \(p\) is the number of parameters in the model.
Definition 7.1 (R-squared Criterion) The \(R_p^2\) criterion calls for an examination of the coefficient of multiple determination \(R^2\) in order to select one or several subsets of the independent variables.
The models with the highest \(R^2\) per class are selected. The best model is chosen based on parsimony.
The intent here is to find the point where adding more independent variables is not worthwhile because it leads to a very small increase in \(R_p^2\).
However, since it is expected that the \(R^2\) increases as the number of independent variables increases, and the \(\max (R_p^2)\) is always attained by the full model, then this criterion is biased towards models with more number of independent variables. It is suggested to use the adjusted \(R^2\) instead.
Definition 7.2 (Adjusted R-squared or MSE Criterion) The \(R_a^2\) criterion (or the \(MSE_p\) criterion) uses the adjusted coefficient of multiple determination in selecting the model.
The \(R_a^2\) takes into account the number of parameters in the model through the degrees of freedom, hence penalizes models with more independent variables.
Note that \(R_a^2\) increases if and only if the mean square error decreases. Hence, \(R_a^2\) and \(MSE\) are equivalent criteria.
Definition 7.3 (Mallows' Cp Criterion) Mallows’ \(C_p\) statistic is an estimator of the measure \(\Gamma_p\), which is the size of the bias that is introduced into the predicted responses by having an underspecified model.
This appeals nicely to common sense and is developed from considerations of the proper compromise between excessive bias incurred when one underfits (chooses too few model terms) and excessive prediction variance produced when one overfits (has redundancies in the model).
The \(C_p\) for a particular subset of model is an estimate of the following:
\[\begin{align} \Gamma_p&=\frac{1}{\sigma^2}\left\{\sum_{i=1}^n[E(\hat{Y}_i)-E(Y_i)]^2 + \sum_{i=1}^nVar(\hat{Y}_i)\right\}\\ &=\frac{1}{\sigma^2}\left\{\sum_{i=1}^n[BIAS(\hat{Y}_i)]^2 + \sum_{i=1}^nVar(\hat{Y}_i)\right\} \end{align}\]\(\Gamma_p\) is called the standardized measure of the total variation in the predicted response.
We can see here that if the bias is 0, \(\Gamma_p\) achieves its smallest possible value, which is \(p\), the number of slope parameters in the linear model.
An estimator of \(\Gamma_p\) is \(C_p\), which is defined as follows: \[\begin{align} C_p=\frac{SSE_p}{MSE(X_1,X_2,...,X_k)}-(n-2p)=\frac{SSE_R}{MSE_F}-(n-2p) \end{align}\]
In using this criterion, look for models where the \(C_p\) value is small and near \(p\).
Notes:
Since computation of \(C_p\) criterion involves \(MSE\) of the model containing all potential independent variables, this metric requires careful development of the set of potential independent variables. Some suggestions:
- express independent variables in appropriate form (linear, quadratic, etc.)
- exclude useless variables so that \(MSE(x_1, x_2, ..., x_k)\) or MSE of the full model, provides an unbiased estimate of the error variance \(\sigma^2\).
The model containing all candidate independent variables will have the lowest \(C_p\), since \(SSE(R)=SSE(F)\)
\[ \begin{align} \frac{SSE(R)}{MSE(F)}-(n-2p)&=\frac{SSE(F)}{SSE(F)/(n-p)}-n+2p\\ &=n-p-n+2p\\ &=p \end{align} \]
Limitation: we cannot use this to evaluate the model containing all candidate independent variables.
Definition 7.4 The Akaike Information Criterion (AIC) is an information criterion that uses number of parameters \(p\) and the likelihood function \(L\) to balance the trade-off between the goodness-of-fit of the model and the simplicity of the parameters.
\[ AIC=2(p)-2\ln L(\hat{\boldsymbol{\beta}_p}|\textbf{Y},\textbf{X}_p) \]
- AIC rewards models with good fitting through the likelihood function.
- AIC penalizes models with more number of parameters \(p\) to prevent overfitting.
- Model with lowest \(AIC\) is preferred.
Illustration in R:
Create an lm
object containing all possible independent variables.
The ols_step_all_possible()
function from the olsrr
package shows values for different criteria for selecting the best model from the \(2^k-1=7\) possible regression models. Make sure to put $result
to return a dataframe of the values of the different criteria.
mindex | n | predictors | rsquare | adjr | rmse | predrsq | cp | aic | sbic | sbc | msep | fpe | apc | hsp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | 1 | urban | 0.4698527 | 0.4590334 | 403.7443 | 0.4192797 | 76.30922 | 762.8115 | 615.0464 | 768.6069 | 8653084 | 176316.32 | 0.5734246 | 3534.643 |
1 | 2 | 1 | education | 0.4456595 | 0.4343464 | 412.8539 | 0.3931971 | 81.93642 | 765.0873 | 617.2070 | 770.8828 | 9047966 | 184362.49 | 0.5995928 | 3695.946 |
2 | 3 | 1 | young | 0.0263608 | 0.0064906 | 547.1509 | -0.1091104 | 179.46291 | 793.8136 | 644.7821 | 799.6091 | 15891776 | 323812.81 | 1.0531200 | 6491.530 |
5 | 4 | 2 | education urban | 0.7247758 | 0.7133081 | 290.9052 | 0.6943138 | 19.01558 | 731.3775 | 585.3431 | 739.1048 | 4587799 | 95204.04 | 0.3096273 | 1913.084 |
4 | 5 | 2 | education young | 0.5975156 | 0.5807454 | 351.7893 | 0.5592806 | 48.61556 | 750.7610 | 602.8835 | 758.4883 | 6709139 | 139225.16 | 0.4527949 | 2797.669 |
6 | 6 | 2 | young urban | 0.4744752 | 0.4525784 | 401.9802 | 0.2945371 | 77.23405 | 764.3648 | 615.4573 | 772.0921 | 8760138 | 181786.61 | 0.5912154 | 3652.922 |
7 | 7 | 3 | education young urban | 0.7979314 | 0.7850334 | 249.2628 | 0.7325135 | 4.00000 | 717.6195 | 573.5542 | 727.2787 | 3441570 | 72707.61 | 0.2364633 | 1465.647 |
Complete list of criteria in the outputs of the function ols_step_all_possible()
are as follows:
rsquare
- \(R^2\) of the modeladjr
- \(R_a^2\), the adjusted R-squaredpredrsq
- predicted R squarecp
- Mallows’ \(C_p\)aic
- Akaike Information Criterionsbic
- Sawa Bayesian Information Criterionsbc
- Schwarz Bayes Information Criterionmsep
- Estimated MSE of prediction, assuming multivariate normalityfpe
- Final Prediction Errorapc
- Amemiya Prediction Criterionhsp
- Hocking’s \(S_p\) Criterion
A major disadvantage of the all-possible-regression selection procedure is the amount of computation required, and it may be difficult for an investigator to evaluate all models fitted. All \(2^k-1\) models must be examined manually.
7.2 Automatic Search Procedures (ASP)
Automatic Search Procedures perform algorithms containing conditions or criteria in including or excluding variables. Three procedures are demonstrated here.
In the examples, we use the mtcars
dataset to find predictors of mpg
, or the fuel efficiency of cars in miles per gallon.
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
Forward Selection Procedure
Definition 7.5 (Forward Selection) Forward selection is a variable selection technique used to build a model by adding a predictor variable one at a time based on a specified criterion.
In forward selection, there are different criteria in determining which model should be added. The following is an example algorithm that rely on the the concept of General Linear Test and Sequential F-tests from the previous Chapter.
Specify a significance level for entering, say \(\alpha_E\)
Perform a simple linear regression involving the potential variable most highly correlated with \(Y\).
If the overall F-value is insignificant, the procedure terminates and we adopt the equation \(\hat{Y} = \bar{Y}\) as best;
otherwise, proceed.
The partial F-value (or F-to-enter) is calculated for every variable, treated as though it were the last to enter the regression equation.
(selection condition) The highest partial F-value, say \(F_H\) , is compared with \(F_{\alpha_E}\), which is the critical value given \(\alpha_E\).
- If \(F_H>F_{\alpha_E}\) (or p-value < \(\alpha_E\) ) add the variable \(x_H\) , which gave rise to \(F_H\) , to the equation and recompute the regression equation. Go back to step 3.
- If \(F_H\leq F_{\alpha_E}\) (or p-value \(\geq \alpha_E\)), stop. Adopt the regression equation as calculated.
Illustration of Forward Selection in R
As an initial inspection (but not required), we check which variable has the highest correlation with mpg
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1.0000 | -0.8522 | -0.8476 | -0.7762 | 0.6812 | -0.8677 | 0.4187 | 0.6640 | 0.5998 | 0.4803 | -0.5509 |
cyl | -0.8522 | 1.0000 | 0.9020 | 0.8324 | -0.6999 | 0.7825 | -0.5912 | -0.8108 | -0.5226 | -0.4927 | 0.5270 |
disp | -0.8476 | 0.9020 | 1.0000 | 0.7909 | -0.7102 | 0.8880 | -0.4337 | -0.7104 | -0.5912 | -0.5556 | 0.3950 |
hp | -0.7762 | 0.8324 | 0.7909 | 1.0000 | -0.4488 | 0.6587 | -0.7082 | -0.7231 | -0.2432 | -0.1257 | 0.7498 |
drat | 0.6812 | -0.6999 | -0.7102 | -0.4488 | 1.0000 | -0.7124 | 0.0912 | 0.4403 | 0.7127 | 0.6996 | -0.0908 |
wt | -0.8677 | 0.7825 | 0.8880 | 0.6587 | -0.7124 | 1.0000 | -0.1747 | -0.5549 | -0.6925 | -0.5833 | 0.4276 |
qsec | 0.4187 | -0.5912 | -0.4337 | -0.7082 | 0.0912 | -0.1747 | 1.0000 | 0.7445 | -0.2299 | -0.2127 | -0.6562 |
vs | 0.6640 | -0.8108 | -0.7104 | -0.7231 | 0.4403 | -0.5549 | 0.7445 | 1.0000 | 0.1683 | 0.2060 | -0.5696 |
am | 0.5998 | -0.5226 | -0.5912 | -0.2432 | 0.7127 | -0.6925 | -0.2299 | 0.1683 | 1.0000 | 0.7941 | 0.0575 |
gear | 0.4803 | -0.4927 | -0.5556 | -0.1257 | 0.6996 | -0.5833 | -0.2127 | 0.2060 | 0.7941 | 1.0000 | 0.2741 |
carb | -0.5509 | 0.5270 | 0.3950 | 0.7498 | -0.0908 | 0.4276 | -0.6562 | -0.5696 | 0.0575 | 0.2741 | 1.0000 |
The wt
has highest correlation with mpg
, followed by cyl
. In the forward selection procedure, we expect that the algorithm will first select wt
in the first step.
# Define the full model (with all predictor variables)
full_model <- lm(mpg ~ ., data = mtcars)
# Perform forward selection using F test at alpha = 0.05 as criterion
forward_model <- ols_step_forward_p(full_model, p_val = 0.05,
progress=TRUE, details = TRUE)
## Forward Selection Method
## ------------------------
##
## Candidate Terms:
##
## 1. cyl
## 2. disp
## 3. hp
## 4. drat
## 5. wt
## 6. qsec
## 7. vs
## 8. am
## 9. gear
## 10. carb
##
##
## Step => 0
## Model => mpg ~ 1
## R2 => 0
##
## Initiating stepwise selection...
##
## Selection Metrics Table
## ---------------------------------------------------------------
## Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
## ---------------------------------------------------------------
## wt 0.00000 0.753 0.745 166.029
## cyl 0.00000 0.726 0.717 169.306
## disp 0.00000 0.718 0.709 170.209
## hp 0.00000 0.602 0.589 181.239
## drat 2e-05 0.464 0.446 190.800
## vs 3e-05 0.441 0.422 192.147
## am 0.00029 0.360 0.338 196.484
## carb 0.00108 0.304 0.280 199.181
## gear 0.00540 0.231 0.205 202.364
## qsec 0.01708 0.175 0.148 204.588
## ---------------------------------------------------------------
##
## Step => 1
## Selected => wt
## Model => mpg ~ wt
## R2 => 0.753
##
## Selection Metrics Table
## ---------------------------------------------------------------
## Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
## ---------------------------------------------------------------
## cyl 0.00106 0.830 0.819 156.010
## hp 0.00145 0.827 0.815 156.652
## qsec 0.00150 0.826 0.814 156.720
## vs 0.01293 0.801 0.787 161.095
## carb 0.02565 0.792 0.778 162.440
## disp 0.06362 0.781 0.766 164.168
## drat 0.33085 0.761 0.744 166.968
## gear 0.73267 0.754 0.737 167.898
## am 0.98791 0.753 0.736 168.029
## ---------------------------------------------------------------
##
## Step => 2
## Selected => cyl
## Model => mpg ~ wt + cyl
## R2 => 0.83
##
## Selection Metrics Table
## ---------------------------------------------------------------
## Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
## ---------------------------------------------------------------
## hp 0.14002 0.843 0.826 155.477
## carb 0.15154 0.842 0.826 155.617
## qsec 0.21106 0.840 0.822 156.190
## gear 0.50752 0.833 0.815 157.499
## disp 0.53322 0.833 0.815 157.558
## vs 0.74973 0.831 0.813 157.892
## am 0.89334 0.830 0.812 157.989
## drat 0.99032 0.830 0.812 158.010
## ---------------------------------------------------------------
##
##
## No more variables to be added.
##
## Variables Selected:
##
## => wt
## => cyl
When using the ols_step_forward_p
function, selection is based on the t-test of the parameter of an independent variable, assuming that a previous variable that has been selected was already in the model. Recall that the t-test of the parameters is essentially the same as the result of the general linear test.
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
After entering wt
and cyl
in the model, based on the general linear test (partial F-test) when adding a single variable, adding other variables does not contribute that much in the model. Hence, the forward selection algorithm stopped.
Therefore, the final selected model using forward selection procedure is:
\[ \widehat{mpg}=39.686-3.191(wt) - 1.508(cyl) \]
Backward Elimination Procedure
Definition 7.6 (Backward Elimination) Backward selection builds regression model from a set of candidate predictor variables by removing predictors based on a criterion in a stepwise manner until there is no variable left to remove anymore.
The following is an example algorithm that rely on the the concept of General Linear Test and partial F-tests from the previous Chapter.
Specify a significance level for staying, say \(\alpha_S\).
Compute a regression equation containing all potential independent. If the overall F-value is insignificant at \(\alpha_S\), the procedure terminates and we adopt the equation \(\hat{Y} = \bar{Y}\) as best; otherwise, proceed.
The partial F-value (called F-to-remove or F-to-stay) is calculated for every variable, treated as though it were the last variable to enter the regression equation.
(elimination condition) The lowest partial F-value, say \(F_L\) , is compared with \(F_S\).
- If \(F_L\leq F_S\), delete the variable \(x_L\) , which gave rise to \(F_L\) , from the model and recompute the regression equation in the remaining variables. Go back to step 3.
- If \(F_L > F_S\), stop. Adopt the regression equation as calculated.
Illustration of Backward Elimination in R:
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1. cyl
## 2. disp
## 3. hp
## 4. drat
## 5. wt
## 6. qsec
## 7. vs
## 8. am
## 9. gear
## 10. carb
##
##
## Step => 0
## Model => mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## R2 => 0.869
##
## Initiating stepwise selection...
##
## Step => 1
## Removed => cyl
## Model => mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## R2 => 0.86894
##
## Step => 2
## Removed => vs
## Model => mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## R2 => 0.86871
##
## Step => 3
## Removed => carb
## Model => mpg ~ disp + hp + drat + wt + qsec + am + gear
## R2 => 0.8681
##
## Step => 4
## Removed => gear
## Model => mpg ~ disp + hp + drat + wt + qsec + am
## R2 => 0.86671
##
## Step => 5
## Removed => drat
## Model => mpg ~ disp + hp + wt + qsec + am
## R2 => 0.86374
##
## Step => 6
## Removed => disp
## Model => mpg ~ hp + wt + qsec + am
## R2 => 0.85785
##
## Step => 7
## Removed => hp
## Model => mpg ~ wt + qsec + am
## R2 => 0.84966
##
##
## No more variables to be removed.
##
## Variables Removed:
##
## => cyl
## => vs
## => carb
## => gear
## => drat
## => disp
## => hp
From the 10 potential regressors, 7 were removed.
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Using backward elimination procedure, the selected model is: \[ \widehat{mpg}=9.618 - 3.917(wt) + 1.226(qsec) + 2.936(am) \]
Stepwise Selection Procedure
Definition 7.7 (Stepwise Selection) Stepwise selection procedure builds a regression model from a set of candidate predictor variables by entering and removing predictors based on a criteria, in a stepwise manner until there is no variable left to enter or remove any more.
There are times, if a variable \(X_a\) was already in the model, and another variable \(X_b\) is added, the previous variable \(X_a\) may become weak in predicting \(Y\).
The following is a stepwise selection algorithm that involves both forward and backward selection procedures.
Specify the significance level for entry, \(\alpha_E\), and for staying \(\alpha_S\).
Compute a first-order simple linear regression equation involving the potential variable that explains the largest proportion of variation in \(Y\) (is most highly correlated with the response). If the overall F-value is not significant, the procedure terminates and we adopt the equation \(\hat{Y}=\bar{Y}\) as best; otherwise, proceed.
(entry condition) If there are no other candidates for entry into the model, the procedure terminates and we adopt the regression equation as calculated. Otherwise, the partial F-values (or F-to-enter) for all variables not in regression at this stage are calculated. The highest F-to-enter, \(F_H\) , is compared with \(F_{\alpha_E}\)
- If \(F_H > F_{\alpha_E}\), add the variable \(x_H\) , which gave rise to \(F_H\) , to the model. Go to the next step.
- If \(F_H \leq F_{\alpha_E}\), stop. Adopt the regression as calculated.
(elimination condition) Recompute the regression equation in the potential variables included in the model at this stage. The overall regression is checked for significance, the changes in \(R_2\), \(MSE\), and \(C_p\) values are noted, and the partial F-values (or F-to-stay) for all variables now in the equation are examined. The lowest F-to-stay, \(F_L\), is compared with \(F_S\).
If \(F_L \leq F_{\alpha_E}\) , delete the variable \(x_L\) , which gave rise to \(F_L\) , from the model. Reenter step 4.
If \(F_L > F_{\alpha_E}\), go back to step 3.
Illustration of Stepwise Selection in R:
stepwise_1 <- ols_step_both_p(full_model,
p_enter = 0.05, p_remove = 0.05,
progress = TRUE, details=TRUE)
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1. cyl
## 2. disp
## 3. hp
## 4. drat
## 5. wt
## 6. qsec
## 7. vs
## 8. am
## 9. gear
## 10. carb
##
##
## Step => 0
## Model => mpg ~ 1
## R2 => 0
##
## Initiating stepwise selection...
##
## Step => 1
## Selected => wt
## Model => mpg ~ wt
## R2 => 0.753
##
## Step => 2
## Selected => cyl
## Model => mpg ~ wt + cyl
## R2 => 0.83
##
##
## No more variables to be added or removed.
In this example, none was removed after including wt
and cyl
. Hence, the final model is the same as the result of the forward selection procedure.
You can also force a variable to be included, regardless of the result of the criterion. For example, we want to include drat
in the model.
stepwise_2 <- ols_step_both_p(full_model, include = "drat",
p_enter = 0.05, p_remove = 0.05,
progress = TRUE, details = TRUE)
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1. cyl
## 2. disp
## 3. hp
## 4. drat
## 5. wt
## 6. qsec
## 7. vs
## 8. am
## 9. gear
## 10. carb
##
##
## Step => 0
## Model => mpg ~ drat
## R2 => 0.464
##
## Initiating stepwise selection...
##
## Step => 1
## Selected => wt
## Model => mpg ~ drat + wt
## R2 => 0.761
##
## Step => 2
## Selected => qsec
## Model => mpg ~ drat + wt + qsec
## R2 => 0.837
##
##
## No more variables to be added or removed.
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1152 -1.8273 -0.2696 1.0502 5.5010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3945 8.0689 1.412 0.16892
## drat 1.6561 1.2269 1.350 0.18789
## wt -4.3978 0.6781 -6.485 5.01e-07 ***
## qsec 0.9462 0.2616 3.616 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.8196
## F-statistic: 47.93 on 3 and 28 DF, p-value: 3.723e-11
Note:
- The order in which the independent variables enter the equation in the stepwise selection procedure does not reflect the importance of the independent variables in the model.
- \(\alpha_S \geq \alpha_E\) , since we need to be more strict in our criteria for adding new variables, than the criteria for staying.
Notes on Variable Selection
Which variable selection procedure should I use?
It depends on the aspect that you want to focus or improve.
- Prediction Accuracy - R-squared, Adjusted R-Squared
- Precise estimates of the fitted values - \(C_p\) Mallows, PRESS procedure
- Final model includes significant coefficients - automatic search procedures using partial F-test as criterion