CHAPTER 7 Variable Selection Procedures

In a regression problem, there are many possible models. One should just use logic and sound judgement to arrive at the most useful model.

To quote the British statistician George Box:

“All models are wrong, some are useful.”

In this Chapter, Sections 7.1 and 7.2 discuss variable selection procedures, while Section 4.6 show a possible preliminary before fitting models for better interpretability of coefficients.

Suppose we want to establish a linear regression equation for a particular dependent variable \(Y\) in terms of the independent variables \(X_1, X_2, . . . , X_k\) (or some transformations of the independent variables).

 

Two opposing criteria of selecting a resultant equation are usually involved:

  1. To make the equation useful for predictive purposes we should want our model to include as many regressors as possible so that reliable fitted values can be determined.
  2. Because of the costs involved in obtaining information on a large number of regressor variables and subsequently monitoring them, we should like the equation to include as few regressors as possible.

The compromise between these is what is usually called ”selection of independent variables”.

 

Some of the independent variables can be screened out due to the following reasons:

  1. they may not be fundamental to the problem
  2. they are subject to large errors, and
  3. they effectively duplicate another variable in the list.

In the following procedures, we will use R functions from the olsrr package.

library(olsrr)

7.1 All-Possible-Regression Approach (APR)

If you have a dataset with \(k\) possible independent variables \(X_1,X_2,...,X_k\) , then how many possible linear regression models can be made? (excluding the model with only the intercept as coefficient).

\[\begin{align} \text{1 independent variable} &: y=\beta_0+\beta_1x_1, \quad y=\beta_0+\beta_2x_2,\quad...\quad, y=\beta_0+\beta_kx_k\\ \text{2 independent variables}&:y=\beta_0 + \beta_1x_1+\beta_2x_2, \quad ...\quad, y=\beta_0+\beta_{k-1}x_{k-1}+\beta_kx_k\\ \vdots &\\ \text{k variables}&:y = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k \end{align}\]
Answer

There are \(2^k-1\) possible combinations.

If we are selecting the “best” regression from all-possible-regression, we need some criteria what constitutes as “best”.

The possible criteria for comparing and selecting regression models with the all-possible-regression selection procedure are \(R_p^2\), \(MSE_p\) or \(R_a^2\), and \(C_p\), where \(p\) is the number of parameters in the model.


Definition 7.1 (R-squared Criterion) The \(R_p^2\) criterion calls for an examination of the coefficient of multiple determination \(R^2\) in order to select one or several subsets of the independent variables.

  • The models with the highest \(R^2\) per class are selected. The best model is chosen based on parsimony.

  • The intent here is to find the point where adding more independent variables is not worthwhile because it leads to a very small increase in \(R_p^2\).

  • However, since it is expected that the \(R^2\) increases as the number of independent variables increases, and the \(\max (R_p^2)\) is always attained by the full model, then this criterion is biased towards models with more number of independent variables. It is suggested to use the adjusted \(R^2\) instead.


Definition 7.2 (Adjusted R-squared or MSE Criterion) The \(R_a^2\) criterion (or the \(MSE_p\) criterion) uses the adjusted coefficient of multiple determination in selecting the model.

  • The \(R_a^2\) takes into account the number of parameters in the model through the degrees of freedom, hence penalizes models with more independent variables.

  • Note that \(R_a^2\) increases if and only if the mean square error decreases. Hence, \(R_a^2\) and \(MSE\) are equivalent criteria.


Definition 7.3 (Mallows' Cp Criterion) Mallows’ \(C_p\) statistic is an estimator of the measure \(\Gamma_p\), which is the size of the bias that is introduced into the predicted responses by having an underspecified model.

This appeals nicely to common sense and is developed from considerations of the proper compromise between excessive bias incurred when one underfits (chooses too few model terms) and excessive prediction variance produced when one overfits (has redundancies in the model).

The \(C_p\) for a particular subset of model is an estimate of the following:

\[\begin{align} \Gamma_p&=\frac{1}{\sigma^2}\left\{\sum_{i=1}^n[E(\hat{Y}_i)-E(Y_i)]^2 + \sum_{i=1}^nVar(\hat{Y}_i)\right\}\\ &=\frac{1}{\sigma^2}\left\{\sum_{i=1}^n[BIAS(\hat{Y}_i)]^2 + \sum_{i=1}^nVar(\hat{Y}_i)\right\} \end{align}\]

\(\Gamma_p\) is called the standardized measure of the total variation in the predicted response.

We can see here that if the bias is 0, \(\Gamma_p\) achieves its smallest possible value, which is \(p\), the number of slope parameters in the linear model.

An estimator of \(\Gamma_p\) is \(C_p\), which is defined as follows: \[\begin{align} C_p=\frac{SSE_p}{MSE(X_1,X_2,...,X_k)}-(n-2p)=\frac{SSE_R}{MSE_F}-(n-2p) \end{align}\]

In using this criterion, look for models where the \(C_p\) value is small and near \(p\).

Notes:

  • Since computation of \(C_p\) criterion involves \(MSE\) of the model containing all potential independent variables, this metric requires careful development of the set of potential independent variables. Some suggestions:

    • express independent variables in appropriate form (linear, quadratic, etc.)
    • exclude useless variables so that \(MSE(x_1, x_2, ..., x_k)\) or MSE of the full model, provides an unbiased estimate of the error variance \(\sigma^2\).
  • The model containing all candidate independent variables will have the lowest \(C_p\), since \(SSE(R)=SSE(F)\)

    \[ \begin{align} \frac{SSE(R)}{MSE(F)}-(n-2p)&=\frac{SSE(F)}{SSE(F)/(n-p)}-n+2p\\ &=n-p-n+2p\\ &=p \end{align} \]

Limitation: we cannot use this to evaluate the model containing all candidate independent variables.


Definition 7.4 The Akaike Information Criterion (AIC) is an information criterion that uses number of parameters \(p\) and the likelihood function \(L\) to balance the trade-off between the goodness-of-fit of the model and the simplicity of the parameters.

\[ AIC=2(p)-2\ln L(\hat{\boldsymbol{\beta}_p}|\textbf{Y},\textbf{X}_p) \]

  • AIC rewards models with good fitting through the likelihood function.
  • AIC penalizes models with more number of parameters \(p\) to prevent overfitting.
  • Model with lowest \(AIC\) is preferred.

Illustration in R:

Create an lm object containing all possible independent variables.

library(carData)
data("Anscombe")
anscombe_full <- lm(income ~ ., data = Anscombe)

The ols_step_all_possible() function from the olsrr package shows values for different criteria for selecting the best model from the \(2^k-1=7\) possible regression models. Make sure to put $result to return a dataframe of the values of the different criteria.

ols_step_all_possible(anscombe_full)$result
mindex n predictors rsquare adjr rmse predrsq cp aic sbic sbc msep fpe apc hsp
3 1 1 urban 0.4698527 0.4590334 403.7443 0.4192797 76.30922 762.8115 615.0464 768.6069 8653084 176316.32 0.5734246 3534.643
1 2 1 education 0.4456595 0.4343464 412.8539 0.3931971 81.93642 765.0873 617.2070 770.8828 9047966 184362.49 0.5995928 3695.946
2 3 1 young 0.0263608 0.0064906 547.1509 -0.1091104 179.46291 793.8136 644.7821 799.6091 15891776 323812.81 1.0531200 6491.530
5 4 2 education urban 0.7247758 0.7133081 290.9052 0.6943138 19.01558 731.3775 585.3431 739.1048 4587799 95204.04 0.3096273 1913.084
4 5 2 education young 0.5975156 0.5807454 351.7893 0.5592806 48.61556 750.7610 602.8835 758.4883 6709139 139225.16 0.4527949 2797.669
6 6 2 young urban 0.4744752 0.4525784 401.9802 0.2945371 77.23405 764.3648 615.4573 772.0921 8760138 181786.61 0.5912154 3652.922
7 7 3 education young urban 0.7979314 0.7850334 249.2628 0.7325135 4.00000 717.6195 573.5542 727.2787 3441570 72707.61 0.2364633 1465.647

Complete list of criteria in the outputs of the function ols_step_all_possible() are as follows:

  • rsquare - \(R^2\) of the model
  • adjr - \(R_a^2\), the adjusted R-squared
  • predrsq - predicted R square
  • cp - Mallows’ \(C_p\)
  • aic - Akaike Information Criterion
  • sbic - Sawa Bayesian Information Criterion
  • sbc - Schwarz Bayes Information Criterion
  • msep - Estimated MSE of prediction, assuming multivariate normality
  • fpe - Final Prediction Error
  • apc - Amemiya Prediction Criterion
  • hsp - Hocking’s \(S_p\) Criterion

A major disadvantage of the all-possible-regression selection procedure is the amount of computation required, and it may be difficult for an investigator to evaluate all models fitted. All \(2^k-1\) models must be examined manually.


7.2 Automatic Search Procedures (ASP)

Automatic Search Procedures perform algorithms containing conditions or criteria in including or excluding variables. Three procedures are demonstrated here.

In the examples, we use the mtcars dataset to find predictors of mpg, or the fuel efficiency of cars in miles per gallon.

# Load the mtcars dataset
data(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Forward Selection Procedure

Definition 7.5 (Forward Selection) Forward selection is a variable selection technique used to build a model by adding a predictor variable one at a time based on a specified criterion.

In forward selection, there are different criteria in determining which model should be added. The following is an example algorithm that rely on the the concept of General Linear Test and Sequential F-tests from the previous Chapter.

  1. Specify a significance level for entering, say \(\alpha_E\)

  2. Perform a simple linear regression involving the potential variable most highly correlated with \(Y\).

    • If the overall F-value is insignificant, the procedure terminates and we adopt the equation \(\hat{Y} = \bar{Y}\) as best;

    • otherwise, proceed.

  3. The partial F-value (or F-to-enter) is calculated for every variable, treated as though it were the last to enter the regression equation.

  4. (selection condition) The highest partial F-value, say \(F_H\) , is compared with \(F_{\alpha_E}\), which is the critical value given \(\alpha_E\).

    • If \(F_H>F_{\alpha_E}\) (or p-value < \(\alpha_E\) ) add the variable \(x_H\) , which gave rise to \(F_H\) , to the equation and recompute the regression equation. Go back to step 3.
    • If \(F_H\leq F_{\alpha_E}\) (or p-value \(\geq \alpha_E\)), stop. Adopt the regression equation as calculated.
Illustration of Forward Selection in R

As an initial inspection (but not required), we check which variable has the highest correlation with mpg

cor(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.0000 -0.8522 -0.8476 -0.7762 0.6812 -0.8677 0.4187 0.6640 0.5998 0.4803 -0.5509
cyl -0.8522 1.0000 0.9020 0.8324 -0.6999 0.7825 -0.5912 -0.8108 -0.5226 -0.4927 0.5270
disp -0.8476 0.9020 1.0000 0.7909 -0.7102 0.8880 -0.4337 -0.7104 -0.5912 -0.5556 0.3950
hp -0.7762 0.8324 0.7909 1.0000 -0.4488 0.6587 -0.7082 -0.7231 -0.2432 -0.1257 0.7498
drat 0.6812 -0.6999 -0.7102 -0.4488 1.0000 -0.7124 0.0912 0.4403 0.7127 0.6996 -0.0908
wt -0.8677 0.7825 0.8880 0.6587 -0.7124 1.0000 -0.1747 -0.5549 -0.6925 -0.5833 0.4276
qsec 0.4187 -0.5912 -0.4337 -0.7082 0.0912 -0.1747 1.0000 0.7445 -0.2299 -0.2127 -0.6562
vs 0.6640 -0.8108 -0.7104 -0.7231 0.4403 -0.5549 0.7445 1.0000 0.1683 0.2060 -0.5696
am 0.5998 -0.5226 -0.5912 -0.2432 0.7127 -0.6925 -0.2299 0.1683 1.0000 0.7941 0.0575
gear 0.4803 -0.4927 -0.5556 -0.1257 0.6996 -0.5833 -0.2127 0.2060 0.7941 1.0000 0.2741
carb -0.5509 0.5270 0.3950 0.7498 -0.0908 0.4276 -0.6562 -0.5696 0.0575 0.2741 1.0000

The wt has highest correlation with mpg, followed by cyl. In the forward selection procedure, we expect that the algorithm will first select wt in the first step.

# Define the full model (with all predictor variables)
full_model <- lm(mpg ~ ., data = mtcars)

# Perform forward selection using F test at alpha = 0.05 as criterion
forward_model <- ols_step_forward_p(full_model, p_val = 0.05,
                                    progress=TRUE, details = TRUE)
## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ 1 
## R2     => 0 
## 
## Initiating stepwise selection... 
## 
##                     Selection Metrics Table                     
## ---------------------------------------------------------------
## Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
## ---------------------------------------------------------------
## wt            0.00000        0.753             0.745    166.029 
## cyl           0.00000        0.726             0.717    169.306 
## disp          0.00000        0.718             0.709    170.209 
## hp            0.00000        0.602             0.589    181.239 
## drat            2e-05        0.464             0.446    190.800 
## vs              3e-05        0.441             0.422    192.147 
## am            0.00029        0.360             0.338    196.484 
## carb          0.00108        0.304             0.280    199.181 
## gear          0.00540        0.231             0.205    202.364 
## qsec          0.01708        0.175             0.148    204.588 
## ---------------------------------------------------------------
## 
## Step      => 1 
## Selected  => wt 
## Model     => mpg ~ wt 
## R2        => 0.753 
## 
##                     Selection Metrics Table                     
## ---------------------------------------------------------------
## Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
## ---------------------------------------------------------------
## cyl           0.00106        0.830             0.819    156.010 
## hp            0.00145        0.827             0.815    156.652 
## qsec          0.00150        0.826             0.814    156.720 
## vs            0.01293        0.801             0.787    161.095 
## carb          0.02565        0.792             0.778    162.440 
## disp          0.06362        0.781             0.766    164.168 
## drat          0.33085        0.761             0.744    166.968 
## gear          0.73267        0.754             0.737    167.898 
## am            0.98791        0.753             0.736    168.029 
## ---------------------------------------------------------------
## 
## Step      => 2 
## Selected  => cyl 
## Model     => mpg ~ wt + cyl 
## R2        => 0.83 
## 
##                     Selection Metrics Table                     
## ---------------------------------------------------------------
## Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
## ---------------------------------------------------------------
## hp            0.14002        0.843             0.826    155.477 
## carb          0.15154        0.842             0.826    155.617 
## qsec          0.21106        0.840             0.822    156.190 
## gear          0.50752        0.833             0.815    157.499 
## disp          0.53322        0.833             0.815    157.558 
## vs            0.74973        0.831             0.813    157.892 
## am            0.89334        0.830             0.812    157.989 
## drat          0.99032        0.830             0.812    158.010 
## ---------------------------------------------------------------
## 
## 
## No more variables to be added.
## 
## Variables Selected: 
## 
## => wt 
## => cyl

When using the ols_step_forward_p function, selection is based on the t-test of the parameter of an independent variable, assuming that a previous variable that has been selected was already in the model. Recall that the t-test of the parameters is essentially the same as the result of the general linear test.

# show final model
summary(forward_model$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

After entering wt and cyl in the model, based on the general linear test (partial F-test) when adding a single variable, adding other variables does not contribute that much in the model. Hence, the forward selection algorithm stopped.

Therefore, the final selected model using forward selection procedure is:

\[ \widehat{mpg}=39.686-3.191(wt) - 1.508(cyl) \]


Backward Elimination Procedure

Definition 7.6 (Backward Elimination) Backward selection builds regression model from a set of candidate predictor variables by removing predictors based on a criterion in a stepwise manner until there is no variable left to remove anymore.

The following is an example algorithm that rely on the the concept of General Linear Test and partial F-tests from the previous Chapter.

  1. Specify a significance level for staying, say \(\alpha_S\).

  2. Compute a regression equation containing all potential independent. If the overall F-value is insignificant at \(\alpha_S\), the procedure terminates and we adopt the equation \(\hat{Y} = \bar{Y}\) as best; otherwise, proceed.

  3. The partial F-value (called F-to-remove or F-to-stay) is calculated for every variable, treated as though it were the last variable to enter the regression equation.

  4. (elimination condition) The lowest partial F-value, say \(F_L\) , is compared with \(F_S\).

    • If \(F_L\leq F_S\), delete the variable \(x_L\) , which gave rise to \(F_L\) , from the model and recompute the regression equation in the remaining variables. Go back to step 3.
    • If \(F_L > F_S\), stop. Adopt the regression equation as calculated.
Illustration of Backward Elimination in R:
backward_model <- ols_step_backward_p(full_model, p_val = 0.05, 
                                      progress = TRUE, details=TRUE)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb 
## R2     => 0.869 
## 
## Initiating stepwise selection... 
## 
## Step     => 1 
## Removed  => cyl 
## Model    => mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb 
## R2       => 0.86894 
## 
## Step     => 2 
## Removed  => vs 
## Model    => mpg ~ disp + hp + drat + wt + qsec + am + gear + carb 
## R2       => 0.86871 
## 
## Step     => 3 
## Removed  => carb 
## Model    => mpg ~ disp + hp + drat + wt + qsec + am + gear 
## R2       => 0.8681 
## 
## Step     => 4 
## Removed  => gear 
## Model    => mpg ~ disp + hp + drat + wt + qsec + am 
## R2       => 0.86671 
## 
## Step     => 5 
## Removed  => drat 
## Model    => mpg ~ disp + hp + wt + qsec + am 
## R2       => 0.86374 
## 
## Step     => 6 
## Removed  => disp 
## Model    => mpg ~ hp + wt + qsec + am 
## R2       => 0.85785 
## 
## Step     => 7 
## Removed  => hp 
## Model    => mpg ~ wt + qsec + am 
## R2       => 0.84966 
## 
## 
## No more variables to be removed.
## 
## Variables Removed: 
## 
## => cyl 
## => vs 
## => carb 
## => gear 
## => drat 
## => disp 
## => hp

From the 10 potential regressors, 7 were removed.

summary(backward_model$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Using backward elimination procedure, the selected model is: \[ \widehat{mpg}=9.618 - 3.917(wt) + 1.226(qsec) + 2.936(am) \]


Stepwise Selection Procedure

Definition 7.7 (Stepwise Selection) Stepwise selection procedure builds a regression model from a set of candidate predictor variables by entering and removing predictors based on a criteria, in a stepwise manner until there is no variable left to enter or remove any more.

There are times, if a variable \(X_a\) was already in the model, and another variable \(X_b\) is added, the previous variable \(X_a\) may become weak in predicting \(Y\).

The following is a stepwise selection algorithm that involves both forward and backward selection procedures.

  1. Specify the significance level for entry, \(\alpha_E\), and for staying \(\alpha_S\).

  2. Compute a first-order simple linear regression equation involving the potential variable that explains the largest proportion of variation in \(Y\) (is most highly correlated with the response). If the overall F-value is not significant, the procedure terminates and we adopt the equation \(\hat{Y}=\bar{Y}\) as best; otherwise, proceed.

  3. (entry condition) If there are no other candidates for entry into the model, the procedure terminates and we adopt the regression equation as calculated. Otherwise, the partial F-values (or F-to-enter) for all variables not in regression at this stage are calculated. The highest F-to-enter, \(F_H\) , is compared with \(F_{\alpha_E}\)

    • If \(F_H > F_{\alpha_E}\), add the variable \(x_H\) , which gave rise to \(F_H\) , to the model. Go to the next step.
    • If \(F_H \leq F_{\alpha_E}\), stop. Adopt the regression as calculated.
  4. (elimination condition) Recompute the regression equation in the potential variables included in the model at this stage. The overall regression is checked for significance, the changes in \(R_2\), \(MSE\), and \(C_p\) values are noted, and the partial F-values (or F-to-stay) for all variables now in the equation are examined. The lowest F-to-stay, \(F_L\), is compared with \(F_S\).

    • If \(F_L \leq F_{\alpha_E}\) , delete the variable \(x_L\) , which gave rise to \(F_L\) , from the model. Reenter step 4.

    • If \(F_L > F_{\alpha_E}\), go back to step 3.

Illustration of Stepwise Selection in R:
stepwise_1 <- ols_step_both_p(full_model, 
                                  p_enter = 0.05, p_remove = 0.05,
                                  progress = TRUE, details=TRUE)
## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ 1 
## R2     => 0 
## 
## Initiating stepwise selection... 
## 
## Step      => 1 
## Selected  => wt 
## Model     => mpg ~ wt 
## R2        => 0.753 
## 
## Step      => 2 
## Selected  => cyl 
## Model     => mpg ~ wt + cyl 
## R2        => 0.83 
## 
## 
## No more variables to be added or removed.

In this example, none was removed after including wt and cyl. Hence, the final model is the same as the result of the forward selection procedure.

You can also force a variable to be included, regardless of the result of the criterion. For example, we want to include drat in the model.

stepwise_2 <- ols_step_both_p(full_model, include = "drat",
                              p_enter = 0.05, p_remove = 0.05,
                              progress = TRUE, details = TRUE)
## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 7. vs 
## 8. am 
## 9. gear 
## 10. carb 
## 
## 
## Step   => 0 
## Model  => mpg ~ drat 
## R2     => 0.464 
## 
## Initiating stepwise selection... 
## 
## Step      => 1 
## Selected  => wt 
## Model     => mpg ~ drat + wt 
## R2        => 0.761 
## 
## Step      => 2 
## Selected  => qsec 
## Model     => mpg ~ drat + wt + qsec 
## R2        => 0.837 
## 
## 
## No more variables to be added or removed.
summary(stepwise_2$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1152 -1.8273 -0.2696  1.0502  5.5010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3945     8.0689   1.412  0.16892    
## drat          1.6561     1.2269   1.350  0.18789    
## wt           -4.3978     0.6781  -6.485 5.01e-07 ***
## qsec          0.9462     0.2616   3.616  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.8196 
## F-statistic: 47.93 on 3 and 28 DF,  p-value: 3.723e-11

Note:

  • The order in which the independent variables enter the equation in the stepwise selection procedure does not reflect the importance of the independent variables in the model.
  • \(\alpha_S \geq \alpha_E\) , since we need to be more strict in our criteria for adding new variables, than the criteria for staying.

Notes on Variable Selection

Which variable selection procedure should I use?

It depends on the aspect that you want to focus or improve.

  • Prediction Accuracy - R-squared, Adjusted R-Squared
  • Precise estimates of the fitted values - \(C_p\) Mallows, PRESS procedure
  • Final model includes significant coefficients - automatic search procedures using partial F-test as criterion

© 2024 Siegfred Roi Codia. All rights reserved.