CHAPTER 7 Variable Selection Procedures

In a regression problem, there are many possible models. One should just use logic and sound judgement to arrive at the most useful model.

To quote the British statistician George Box:

“All models are wrong, some are useful.”

In this Chapter, Sections 7.1 and 7.2 discuss variable selection procedures.

Suppose we want to establish a linear regression equation for a particular dependent variable $Y$ in terms of the independent variables $X_1, X_2, . . . , X_k$ (or some transformations of the independent variables).

Two opposing criteria of selecting a resultant equation are usually involved:

To make the equation useful for predictive purposes we should want our model to include as many regressors as possible so that reliable fitted values can be determined.
Because of the costs involved in obtaining information on a large number of regressor variables and subsequently monitoring them, we should like the equation to include as few regressors as possible.

The compromise between these is what is usually called “selection of independent variables”.

Some of the independent variables can be screened out due to the following reasons:

they may not be fundamental to the problem
they are subject to large errors, and
they effectively duplicate another variable in the list.

In the following procedures, we will use R functions from the olsrr package.

library(olsrr)

7.1 All-Possible-Regression Approach (APR)

If you have a dataset with $k$ possible independent variables $X_1,X_2,...,X_k$ , then how many possible linear regression models can be made? (excluding the model with only the intercept as coefficient).

$\begin{align} \text{1 independent variable} &: y=\beta_0+\beta_1x_1, \quad y=\beta_0+\beta_2x_2,\quad...\quad, y=\beta_0+\beta_kx_k\\ \text{2 independent variables}&:y=\beta_0 + \beta_1x_1+\beta_2x_2, \quad ...\quad, y=\beta_0+\beta_{k-1}x_{k-1}+\beta_kx_k\\ \vdots &\\ \text{k variables}&:y = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k \end{align}$

Answer

There are $2^k-1$ possible combinations.

If we are selecting the “best” regression from all-possible-regression, we need some criteria what constitutes as “best”.

The possible criteria for comparing and selecting regression models with the all-possible-regression selection procedure are $R_p^2$ , $MSE_p$ or $R_a^2$ , and $C_p$ , where $p$ is the number of parameters in the model.

Definition 7.1 (R-squared Criterion) The $R_p^2$ criterion calls for an examination of the coefficient of multiple determination $R^2$ in order to select one or several subsets of the independent variables.

The models with the highest $R^2$ per class are selected. The best model is chosen based on parsimony.
The intent here is to find the point where adding more independent variables is not worthwhile because it leads to a very small increase in $R_p^2$ .
However, since it is expected that the $R^2$ increases as the number of independent variables increases, and the $\max (R_p^2)$ is always attained by the full model, then this criterion is biased towards models with more number of independent variables. It is suggested to use the adjusted $R^2$ instead.

Definition 7.2 (Adjusted R-squared or MSE Criterion) The $R_a^2$ criterion (or the $MSE_p$ criterion) uses the adjusted coefficient of multiple determination in selecting the model.

The $R_a^2$ takes into account the number of parameters in the model through the degrees of freedom, hence penalizes models with more independent variables.
Note that $R_a^2$ increases if and only if the mean square error decreases. Hence, $R_a^2$ and $MSE$ are equivalent criteria.

Definition 7.3 (Mallows' Cp Criterion) Mallows’ $C_p$ statistic is an estimator of the measure $\Gamma_p$ , which is the size of the bias that is introduced into the predicted responses by having an underspecified model.

This appeals nicely to common sense and is developed from considerations of the proper compromise between excessive bias incurred when one underfits (chooses too few model terms) and excessive prediction variance produced when one overfits (has redundancies in the model).

The $C_p$ for a particular subset of model is an estimate of the following:

$\begin{align} \Gamma_p&=\frac{1}{\sigma^2}\left\{\sum_{i=1}^n[E(\hat{Y}_i)-E(Y_i)]^2 + \sum_{i=1}^nVar(\hat{Y}_i)\right\}\\ &=\frac{1}{\sigma^2}\left\{\sum_{i=1}^n[BIAS(\hat{Y}_i)]^2 + \sum_{i=1}^nVar(\hat{Y}_i)\right\} \end{align}$

$\Gamma_p$ is called the standardized measure of the total variation in the predicted response.

We can see here that if the bias is 0, $\Gamma_p$ achieves its smallest possible value, which is $p$ , the number of slope parameters in the linear model.

An estimator of $\Gamma_p$ is $C_p$ , which is defined as follows: $\begin{align} C_p=\frac{SSE_p}{MSE(X_1,X_2,...,X_k)}-(n-2p)=\frac{SSE_R}{MSE_F}-(n-2p) \end{align}$

In using this criterion, look for models where the $C_p$ value is small and near $p$ .

Notes:

Since computation of $C_p$ criterion involves $MSE$ of the model containing all potential independent variables, this metric requires careful development of the set of potential independent variables. Some suggestions:
- express independent variables in appropriate form (linear, quadratic, etc.)
- exclude useless variables so that $MSE(x_1, x_2, ..., x_k)$ or MSE of the full model, provides an unbiased estimate of the error variance $\sigma^2$ .
The model containing all candidate independent variables will have the lowest $C_p$ , since $SSE(R)=SSE(F)$

$\begin{align} \frac{SSE(R)}{MSE(F)}-(n-2p)&=\frac{SSE(F)}{SSE(F)/(n-p)}-n+2p\\ &=n-p-n+2p\\ &=p \end{align}$

Limitation: we cannot use this to evaluate the model containing all candidate independent variables.

Definition 7.4 The Akaike Information Criterion (AIC) is an information criterion that uses number of parameters $p$ and the likelihood function $L$ to balance the trade-off between the goodness-of-fit of the model and the simplicity of the parameters.

$AIC=2(p)-2\ln L(\hat{\boldsymbol{\beta}_p}|\textbf{Y},\textbf{X}_p)$

AIC rewards models with good fitting through the likelihood function.
AIC penalizes models with more number of parameters $p$ to prevent overfitting.
Model with lowest $AIC$ is preferred.

Illustration in R:

Create an lm object containing all possible independent variables.

library(carData)
data("Anscombe")
anscombe_full <- lm(income ~ ., data = Anscombe)

The ols_step_all_possible() function from the olsrr package shows values for different criteria for selecting the best model from the $2^k-1=7$ possible regression models. Make sure to put $result to return a dataframe of the values of the different criteria.

ols_step_all_possible(anscombe_full)$result

	mindex	n	predictors	rsquare	adjr	rmse	predrsq	cp	aic	sbic	sbc	msep	fpe	apc	hsp
3	1	1	urban	0.4698527	0.4590334	403.7443	0.4192797	76.30922	762.8115	615.0464	768.6069	8653084	176316.32	0.5734246	3534.643
1	2	1	education	0.4456595	0.4343464	412.8539	0.3931971	81.93642	765.0873	617.2070	770.8828	9047966	184362.49	0.5995928	3695.946
2	3	1	young	0.0263608	0.0064906	547.1509	-0.1091104	179.46291	793.8136	644.7821	799.6091	15891776	323812.81	1.0531200	6491.530
5	4	2	education urban	0.7247758	0.7133081	290.9052	0.6943138	19.01558	731.3775	585.3431	739.1048	4587799	95204.04	0.3096273	1913.084
4	5	2	education young	0.5975156	0.5807454	351.7893	0.5592806	48.61556	750.7610	602.8835	758.4883	6709139	139225.16	0.4527949	2797.669
6	6	2	young urban	0.4744752	0.4525784	401.9802	0.2945371	77.23405	764.3648	615.4573	772.0921	8760138	181786.61	0.5912154	3652.922
7	7	3	education young urban	0.7979314	0.7850334	249.2628	0.7325135	4.00000	717.6195	573.5542	727.2787	3441570	72707.61	0.2364633	1465.647

Complete list of criteria in the outputs of the function ols_step_all_possible() are as follows:

rsquare - $R^2$ of the model
adjr - $R_a^2$ , the adjusted R-squared
predrsq - predicted R square
cp - Mallows’ $C_p$
aic - Akaike Information Criterion
sbic - Sawa Bayesian Information Criterion
sbc - Schwarz Bayes Information Criterion
msep - Estimated MSE of prediction, assuming multivariate normality
fpe - Final Prediction Error
apc - Amemiya Prediction Criterion
hsp - Hocking’s $S_p$ Criterion

A major disadvantage of the all-possible-regression selection procedure is the amount of computation required, and it may be difficult for an investigator to evaluate all models fitted. All $2^k-1$ models must be examined manually.

7.2 Automatic Search Procedures (ASP)

Automatic Search Procedures perform algorithms containing conditions or criteria in including or excluding variables. Three procedures are demonstrated here.

In the examples, we use the mtcars dataset to find predictors of mpg, or the fuel efficiency of cars in miles per gallon.

# Load the mtcars dataset
data(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Forward Selection Procedure

Definition 7.5 (Forward Selection) Forward selection is a variable selection technique used to build a model by adding a predictor variable one at a time based on a specified criterion.

In forward selection, there are different criteria in determining which model should be added. The following is an example algorithm that rely on the the concept of General Linear Test and Sequential F-tests from the previous Chapter.

Specify a significance level for entering, say $\alpha_E$
Perform a simple linear regression involving the potential variable most highly correlated with $Y$ .
- If the overall F-value is insignificant, the procedure terminates and we adopt the equation $\hat{Y} = \bar{Y}$ as best;
- otherwise, proceed.
The partial F-value (or F-to-enter) is calculated for every variable, treated as though it were the last to enter the regression equation.
(selection condition) The highest partial F-value, say $F_H$ , is compared with $F_{\alpha_E}$ , which is the critical value given $\alpha_E$ .
- If $F_H>F_{\alpha_E}$ (or p-value < $\alpha_E$ ) add the variable $x_H$ , which gave rise to $F_H$ , to the equation and recompute the regression equation. Go back to step 3.
- If $F_H\leq F_{\alpha_E}$ (or p-value $\geq \alpha_E$ ), stop. Adopt the regression equation as calculated.

Illustration of Forward Selection in R

As an initial inspection (but not required), we check which variable has the highest correlation with mpg

cor(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
mpg	1.0000	-0.8522	-0.8476	-0.7762	0.6812	-0.8677	0.4187	0.6640	0.5998	0.4803	-0.5509
cyl	-0.8522	1.0000	0.9020	0.8324	-0.6999	0.7825	-0.5912	-0.8108	-0.5226	-0.4927	0.5270
disp	-0.8476	0.9020	1.0000	0.7909	-0.7102	0.8880	-0.4337	-0.7104	-0.5912	-0.5556	0.3950
hp	-0.7762	0.8324	0.7909	1.0000	-0.4488	0.6587	-0.7082	-0.7231	-0.2432	-0.1257	0.7498
drat	0.6812	-0.6999	-0.7102	-0.4488	1.0000	-0.7124	0.0912	0.4403	0.7127	0.6996	-0.0908
wt	-0.8677	0.7825	0.8880	0.6587	-0.7124	1.0000	-0.1747	-0.5549	-0.6925	-0.5833	0.4276
qsec	0.4187	-0.5912	-0.4337	-0.7082	0.0912	-0.1747	1.0000	0.7445	-0.2299	-0.2127	-0.6562
vs	0.6640	-0.8108	-0.7104	-0.7231	0.4403	-0.5549	0.7445	1.0000	0.1683	0.2060	-0.5696
am	0.5998	-0.5226	-0.5912	-0.2432	0.7127	-0.6925	-0.2299	0.1683	1.0000	0.7941	0.0575
gear	0.4803	-0.4927	-0.5556	-0.1257	0.6996	-0.5833	-0.2127	0.2060	0.7941	1.0000	0.2741
carb	-0.5509	0.5270	0.3950	0.7498	-0.0908	0.4276	-0.6562	-0.5696	0.0575	0.2741	1.0000

The wt has highest correlation with mpg, followed by cyl. In the forward selection procedure, we expect that the algorithm will first select wt in the first step.

# Define the full model (with all predictor variables)
full_model <- lm(mpg ~ ., data = mtcars)

# Perform forward selection using F test at alpha = 0.05 as criterion
forward_model <- ols_step_forward_p(full_model, p_val = 0.05,
                                    progress=TRUE, details = TRUE)

## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ 1 
## R2     => 0 
## 
## Initiating stepwise selection... 
## 
##                     Selection Metrics Table                     
## ---------------------------------------------------------------
## Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
## ---------------------------------------------------------------
## wt            0.00000        0.753             0.745    166.029 
## cyl           0.00000        0.726             0.717    169.306 
## disp          0.00000        0.718             0.709    170.209 
## hp            0.00000        0.602             0.589    181.239 
## drat            2e-05        0.464             0.446    190.800 
## vs              3e-05        0.441             0.422    192.147 
## am            0.00029        0.360             0.338    196.484 
## carb          0.00108        0.304             0.280    199.181 
## gear          0.00540        0.231             0.205    202.364 
## qsec          0.01708        0.175             0.148    204.588 
## ---------------------------------------------------------------
## 
## Step      => 1 
## Selected  => wt 
## Model     => mpg ~ wt 
## R2        => 0.753 
## 
##                     Selection Metrics Table                     
## ---------------------------------------------------------------
## Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
## ---------------------------------------------------------------
## cyl           0.00106        0.830             0.819    156.010 
## hp            0.00145        0.827             0.815    156.652 
## qsec          0.00150        0.826             0.814    156.720 
## vs            0.01293        0.801             0.787    161.095 
## carb          0.02565        0.792             0.778    162.440 
## disp          0.06362        0.781             0.766    164.168 
## drat          0.33085        0.761             0.744    166.968 
## gear          0.73267        0.754             0.737    167.898 
## am            0.98791        0.753             0.736    168.029 
## ---------------------------------------------------------------
## 
## Step      => 2 
## Selected  => cyl 
## Model     => mpg ~ wt + cyl 
## R2        => 0.83 
## 
##                     Selection Metrics Table                     
## ---------------------------------------------------------------
## Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
## ---------------------------------------------------------------
## hp            0.14002        0.843             0.826    155.477 
## carb          0.15154        0.842             0.826    155.617 
## qsec          0.21106        0.840             0.822    156.190 
## gear          0.50752        0.833             0.815    157.499 
## disp          0.53322        0.833             0.815    157.558 
## vs            0.74973        0.831             0.813    157.892 
## am            0.89334        0.830             0.812    157.989 
## drat          0.99032        0.830             0.812    158.010 
## ---------------------------------------------------------------
## 
## 
## No more variables to be added.
## 
## Variables Selected: 
## 
## => wt 
## => cyl

When using the ols_step_forward_p function, selection is based on the t-test of the parameter of an independent variable, assuming that a previous variable that has been selected was already in the model. Recall that the t-test of the parameters is essentially the same as the result of the general linear test.

# show final model
summary(forward_model$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

After entering wt and cyl in the model, based on the general linear test (partial F-test) when adding a single variable, adding other variables does not contribute that much in the model. Hence, the forward selection algorithm stopped.

Therefore, the final selected model using forward selection procedure is:

$\widehat{mpg}=39.686-3.191(wt) - 1.508(cyl)$

Backward Elimination Procedure

Definition 7.6 (Backward Elimination) Backward selection builds regression model from a set of candidate predictor variables by removing predictors based on a criterion in a stepwise manner until there is no variable left to remove anymore.

The following is an example algorithm that rely on the the concept of General Linear Test and partial F-tests from the previous Chapter.

Specify a significance level for staying, say $\alpha_S$ .
Compute a regression equation containing all potential independent. If the overall F-value is insignificant at $\alpha_S$ , the procedure terminates and we adopt the equation $\hat{Y} = \bar{Y}$ as best; otherwise, proceed.
The partial F-value (called F-to-remove or F-to-stay) is calculated for every variable, treated as though it were the last variable to enter the regression equation.
(elimination condition) The lowest partial F-value, say $F_L$ , is compared with $F_S$ .
- If $F_L\leq F_S$ , delete the variable $x_L$ , which gave rise to $F_L$ , from the model and recompute the regression equation in the remaining variables. Go back to step 3.
- If $F_L > F_S$ , stop. Adopt the regression equation as calculated.

Illustration of Backward Elimination in R:

backward_model <- ols_step_backward_p(full_model, p_val = 0.05, 
                                      progress = TRUE, details=TRUE)

## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb 
## R2     => 0.869 
## 
## Initiating stepwise selection... 
## 
## Step     => 1 
## Removed  => cyl 
## Model    => mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb 
## R2       => 0.86894 
## 
## Step     => 2 
## Removed  => vs 
## Model    => mpg ~ disp + hp + drat + wt + qsec + am + gear + carb 
## R2       => 0.86871 
## 
## Step     => 3 
## Removed  => carb 
## Model    => mpg ~ disp + hp + drat + wt + qsec + am + gear 
## R2       => 0.8681 
## 
## Step     => 4 
## Removed  => gear 
## Model    => mpg ~ disp + hp + drat + wt + qsec + am 
## R2       => 0.86671 
## 
## Step     => 5 
## Removed  => drat 
## Model    => mpg ~ disp + hp + wt + qsec + am 
## R2       => 0.86374 
## 
## Step     => 6 
## Removed  => disp 
## Model    => mpg ~ hp + wt + qsec + am 
## R2       => 0.85785 
## 
## Step     => 7 
## Removed  => hp 
## Model    => mpg ~ wt + qsec + am 
## R2       => 0.84966 
## 
## 
## No more variables to be removed.
## 
## Variables Removed: 
## 
## => cyl 
## => vs 
## => carb 
## => gear 
## => drat 
## => disp 
## => hp

From the 10 potential regressors, 7 were removed.

summary(backward_model$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Using backward elimination procedure, the selected model is: $\widehat{mpg}=9.618 - 3.917(wt) + 1.226(qsec) + 2.936(am)$

Stepwise Selection Procedure

Definition 7.7 (Stepwise Selection) Stepwise selection procedure builds a regression model from a set of candidate predictor variables by entering and removing predictors based on a criteria, in a stepwise manner until there is no variable left to enter or remove any more.

There are times, if a variable $X_a$ was already in the model, and another variable $X_b$ is added, the previous variable $X_a$ may become weak in predicting $Y$ .

The following is a stepwise selection algorithm that involves both forward and backward selection procedures.

Specify the significance level for entry, $\alpha_E$ , and for staying $\alpha_S$ .
Compute a first-order simple linear regression equation involving the potential variable that explains the largest proportion of variation in $Y$ (is most highly correlated with the response). If the overall F-value is not significant, the procedure terminates and we adopt the equation $\hat{Y}=\bar{Y}$ as best; otherwise, proceed.
(entry condition) If there are no other candidates for entry into the model, the procedure terminates and we adopt the regression equation as calculated. Otherwise, the partial F-values (or F-to-enter) for all variables not in regression at this stage are calculated. The highest F-to-enter, $F_H$ , is compared with $F_{\alpha_E}$
- If $F_H > F_{\alpha_E}$ , add the variable $x_H$ , which gave rise to $F_H$ , to the model. Go to the next step.
- If $F_H \leq F_{\alpha_E}$ , stop. Adopt the regression as calculated.
(elimination condition) Recompute the regression equation in the potential variables included in the model at this stage. The overall regression is checked for significance, the changes in $R_2$ , $MSE$ , and $C_p$ values are noted, and the partial F-values (or F-to-stay) for all variables now in the equation are examined. The lowest F-to-stay, $F_L$ , is compared with $F_S$ .
- If $F_L \leq F_{\alpha_E}$ , delete the variable $x_L$ , which gave rise to $F_L$ , from the model. Reenter step 4.
- If $F_L > F_{\alpha_E}$ , go back to step 3.

Illustration of Stepwise Selection in R:

stepwise_1 <- ols_step_both_p(full_model, 
                                  p_enter = 0.05, p_remove = 0.05,
                                  progress = TRUE, details=TRUE)

## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ 1 
## R2     => 0 
## 
## Initiating stepwise selection... 
## 
## Step      => 1 
## Selected  => wt 
## Model     => mpg ~ wt 
## R2        => 0.753 
## 
## Step      => 2 
## Selected  => cyl 
## Model     => mpg ~ wt + cyl 
## R2        => 0.83 
## 
## 
## No more variables to be added or removed.

In this example, none was removed after including wt and cyl. Hence, the final model is the same as the result of the forward selection procedure.

You can also force a variable to be included, regardless of the result of the criterion. For example, we want to include drat in the model.

stepwise_2 <- ols_step_both_p(full_model, include = "drat",
                              p_enter = 0.05, p_remove = 0.05,
                              progress = TRUE, details = TRUE)

## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ drat 
## R2     => 0.464 
## 
## Initiating stepwise selection... 
## 
## Step      => 1 
## Selected  => wt 
## Model     => mpg ~ drat + wt 
## R2        => 0.761 
## 
## Step      => 2 
## Selected  => qsec 
## Model     => mpg ~ drat + wt + qsec 
## R2        => 0.837 
## 
## 
## No more variables to be added or removed.

summary(stepwise_2$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1152 -1.8273 -0.2696  1.0502  5.5010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3945     8.0689   1.412  0.16892    
## drat          1.6561     1.2269   1.350  0.18789    
## wt           -4.3978     0.6781  -6.485 5.01e-07 ***
## qsec          0.9462     0.2616   3.616  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.8196 
## F-statistic: 47.93 on 3 and 28 DF,  p-value: 3.723e-11

Note:

The order in which the independent variables enter the equation in the stepwise selection procedure does not reflect the importance of the independent variables in the model.
$\alpha_S \geq \alpha_E$ , since we need to be more strict in our criteria for adding new variables, than the criteria for staying.

Notes on Variable Selection

Which variable selection procedure should I use?

It depends on the aspect that you want to focus or improve.

Prediction Accuracy - R-squared, Adjusted R-Squared
Precise estimates of the fitted values - $C_p$ Mallows, PRESS procedure
Final model includes significant coefficients - automatic search procedures using partial F-test as criterion