Part 8 Variable Selection
Here we discuss a few different approaches to variable selection in R.
8.1 Data: Airbnb listings
We will use the airbnb_small.csv
dataset, which contains information on 50 Airbnb listings in Florence, Italy from 2021. There are 24 variables for each listing, which includes information on hosts, location, price, property details, and reviews. Our goal will be to build a model that predicts the price of the Airbnb property from the remaining variables.
Two useful packages are olsrr
and glmnet
, which have convenient built-in methods for variable selection. Let’s start by loading these packages, setting our working directory, and reading in the file:
set.seed(100)
library(olsrr)
library(glmnet)
library(tidyverse)
library(mltools)
library(data.table)
= read.csv("airbnb_small.csv") airbnb
8.2 Stepwise regression
When we have a large number of potential variables to choose from for regression, simply throwing all of them into the model might make the model too complex. High model complexity has two drawbacks:
- it makes interpreting and explaining the model to others more complicated, and
- it risks overfitting the model to data, which can erode predictive performance.
One way of controlling model complexity in linear regression is called stepwise regression.
The key to stepwise regression is a performance metric called the Akaike information criterion (AIC). The AIC of a regression model measures how well the model balances complexity (i.e., the number of variables included in the regression) and goodness-of-fit (i.e., how well the chosen set of variables fit the data). In general, a regression model with a lower AIC is viewed as achieving a “better” balance between complexity and goodness-of-fit. The AIC is therefore widely used as a guide for finding a small number of variables with high explanatory power.
8.2.1 Stepwise regression with forward selection
One approach to stepwise regression is called forward selection. The general idea is that we start with a “blank” model, add whichever variable improves (i.e., decreases) the AIC the most, and repeat until the AIC can’t be improved any further.
Building a stepwise model in R is similar to building a standard linear regression model, but takes one extra step. Using the airbnb.csv
First we run a regular linear regression using lm()
, with Price as the dependent variable:
<- lm(Price ~., data = airbnb) model
As a benchmark, let’s look at the results of the full linear regression model:
summary(model)
##
## Call:
## lm(formula = Price ~ ., data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.613 -10.105 -0.221 12.286 49.415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13516.3486 56832.9384 0.238 0.8144
## HostResponseTimewithin a day -475.6676 213.3597 -2.229 0.0374 *
## HostResponseTimewithin a few hours -493.5318 203.8891 -2.421 0.0251 *
## HostResponseTimewithin an hour -455.0459 207.1368 -2.197 0.0400 *
## HostResponseRate 390.4220 220.2179 1.773 0.0915 .
## HostAcceptRate 31.5768 39.6043 0.797 0.4346
## Superhost -7.6700 13.5703 -0.565 0.5782
## HostListings 0.7634 0.2810 2.716 0.0133 *
## NeighbourhoodCentro Storico 4.9871 37.1483 0.134 0.8945
## NeighbourhoodGavinana Galluzzo -186.4095 1012.4128 -0.184 0.8558
## NeighbourhoodIsolotto Legnaia -103.9420 83.4928 -1.245 0.2276
## NeighbourhoodRifredi -28.8847 43.0517 -0.671 0.5099
## Latitude 60.4395 1238.8568 0.049 0.9616
## Longitude -1464.6477 1062.1974 -1.379 0.1832
## RoomTypeHotel room 50.2613 33.7877 1.488 0.1525
## RoomTypePrivate room -5.8018 24.4563 -0.237 0.8149
## Accomodates 12.1546 6.7288 1.806 0.0859 .
## Bathrooms -19.7126 13.0087 -1.515 0.1453
## Bedrooms 29.1729 15.6235 1.867 0.0766 .
## Beds -2.6974 10.8947 -0.248 0.8070
## MinNights 1.0589 2.6414 0.401 0.6928
## AvgRating -51.2645 75.8444 -0.676 0.5068
## RatingAccuracy -31.8468 35.4207 -0.899 0.3793
## RatingClean 11.9607 63.9980 0.187 0.8536
## RatingCheckIn 22.1832 88.4108 0.251 0.8044
## RatingCommunication 85.7820 75.2299 1.140 0.2676
## RatingLocation -55.6150 43.7572 -1.271 0.2183
## RatingValue 102.3121 55.8141 1.833 0.0817 .
## InstantBook -17.9653 16.8469 -1.066 0.2990
## ReviewsPerMonth -4.2126 6.0649 -0.695 0.4953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.55 on 20 degrees of freedom
## Multiple R-squared: 0.8794, Adjusted R-squared: 0.7045
## F-statistic: 5.028 on 29 and 20 DF, p-value: 0.0002066
Then we send our stored model to the ols_step_forward_aic()
function (which comes from the olsrr
package), which will automatically perform the stepwise regression:
<-ols_step_forward_aic(model) forward
To see the full set of results and the variables that are added at each step, we simply use print()
:
print(forward)
##
## Selection Summary
## ------------------------------------------------------------------------------
## Variable AIC Sum Sq RSS R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------
## Bedrooms 531.846 67591.153 108128.467 0.38465 0.37183
## HostListings 518.117 96776.482 78943.138 0.55074 0.53163
## Accomodates 515.520 103710.752 72008.868 0.59021 0.56348
## Bathrooms 511.999 111238.486 64481.134 0.63305 0.60043
## Neighbourhood 509.511 123439.723 52279.897 0.70248 0.64443
## ReviewsPerMonth 503.142 131497.257 44222.363 0.74834 0.69171
## MinNights 502.423 133837.507 41882.113 0.76165 0.70054
## ------------------------------------------------------------------------------
The output above tells us that the first variable added to the model is Bedrooms, and that this one-variable achieves \(AIC = 532\) and \(R^2 = 0.385\). Then a two-variable linear model with both Bedrooms and HostListings achieves \(AIC = 518\) and \(R^2 = 0.551\), and so on. Note that the output above lists only 7 out of the 23 possible independent variables we could have used.
Next, visualizing AIC is straightforward using plot()
:
plot(forward)
8.2.2 Stepwise regression with backward elimination
Another way of performing stepwise regression is backward elimination, which is conceptually similar to forward selection. In backward elimination, we start with a “full” model that contains all possible variables, and then we identify the variable whose removal from the model improves the AIC the most, and repeat until the AIC cannot be improved further.
The R code for backward elimination is similar: first we run the full linear regression model using lm()
, and then we use ols_step_backward_aic()
:
<-ols_step_backward_aic(model) backward
Like in forward selection, we can print the model to see the result of each step:
print(backward)
##
##
## Backward Elimination Summary
## ----------------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## ----------------------------------------------------------------------------
## Full Model 506.367 21194.135 154525.485 0.87939 0.70450
## Latitude 504.373 21196.657 154522.963 0.87937 0.71854
## RatingClean 502.458 21232.830 154486.790 0.87917 0.73087
## Beds 500.568 21279.825 154439.795 0.87890 0.74200
## RatingCheckIn 499.030 21477.352 154242.268 0.87777 0.75046
## MinNights 497.612 21728.535 153991.085 0.87635 0.75764
## AvgRating 496.242 22004.159 153715.461 0.87478 0.76400
## HostAcceptRate 495.231 22443.497 153276.123 0.87228 0.76821
## Superhost 494.898 23204.834 152514.786 0.86794 0.76890
## ----------------------------------------------------------------------------
In the table above, we can see the names of 8 variables that are eliminated from the model. In this example, the final model from backward elimination achieves \(AIC = 495\) and \(R^2 = 0.868\). Note that backward elimination arrives at a different final model than forward selection in this case.
Similar to before, we can use plot()
to visualize the result of the stepwise regression based on backward elimination:
plot(backward)
8.3 LASSO regression
Another popular method for variable selection is LASSO regression. In general, LASSO is used when dealing with a huge number of potential variables, in which case stepwise regression can be very slow. (There are other differences that we won’t get into here).
8.3.1 Building a LASSO regression model
LASSO regression can be done in R using the glmnet()
function. In order to apply it to the Airbnb data, we first need to do a little bit of data manipulation. In particular, we need to manually convert the columns with character strings (i.e., words) into R factors, and then convert the factors into 0 or 1 binary variables using “one-hot” encoding. The code below does this:
$HostResponseTime <- as.factor(airbnb$HostResponseTime)
airbnb$RoomType <- as.factor(airbnb$RoomType)
airbnb$Neighbourhood <- as.factor(airbnb$Neighbourhood)
airbnb
<- one_hot(as.data.table(airbnb)) airbnbone
If we take a look at the new data frame, we can see that it now has 34 columns, due to the conversion of the factor variables into 0-1 dummy variables. (It is commented out below to simplify HTML output, but you can run it in R Markdown to preview the new data frame).
#head(airbnbone)
Next, we need to split the dataframe into independent and dependent variables. (We have to do this because the glmnet()
function we will use for LASSO regression expects the data to be in a different format from the stepwise regression functions we used above). For simplicity we will use x and y to denote the independent and dependent variables, respectively:
<-as.matrix(subset(airbnbone, select=-c(Price)))
x <-as.matrix(subset(airbnbone, select=c(Price))) y
Now that the data is formatted, we can call glmnet()
to run a lasso regression:
<- glmnet(x,y) lasso
Printing the results displays the number of variables selected at each value of \(\lambda\) (Df = degrees of freedom):
print(lasso)
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## 1 0 0.00 36.770
## 2 2 7.19 33.500
## 3 2 13.22 30.520
## 4 2 18.24 27.810
## 5 3 23.99 25.340
## 6 3 29.94 23.090
## 7 3 34.88 21.040
## 8 3 38.98 19.170
## 9 3 42.38 17.470
## 10 3 45.21 15.920
## 11 3 47.55 14.500
## 12 3 49.50 13.210
## 13 4 51.49 12.040
## 14 5 53.80 10.970
## 15 7 56.68 9.995
## 16 7 59.60 9.108
## 17 8 62.07 8.298
## 18 9 64.32 7.561
## 19 9 66.40 6.889
## 20 9 68.13 6.277
## 21 9 69.57 5.720
## 22 9 70.76 5.212
## 23 10 71.76 4.749
## 24 12 72.66 4.327
## 25 13 73.62 3.942
## 26 15 74.55 3.592
## 27 16 75.50 3.273
## 28 16 76.37 2.982
## 29 16 77.09 2.717
## 30 17 77.70 2.476
## 31 17 78.33 2.256
## 32 17 78.85 2.056
## 33 18 79.32 1.873
## 34 18 79.76 1.707
## 35 18 80.13 1.555
## 36 18 80.43 1.417
## 37 18 80.68 1.291
## 38 18 80.89 1.176
## 39 18 81.07 1.072
## 40 19 81.24 0.977
## 41 19 81.45 0.890
## 42 19 81.63 0.811
## 43 20 81.78 0.739
## 44 21 81.96 0.673
## 45 23 82.12 0.613
## 46 24 82.58 0.559
## 47 25 82.94 0.509
## 48 25 83.25 0.464
## 49 26 83.52 0.423
## 50 25 84.08 0.385
## 51 25 84.57 0.351
## 52 27 84.97 0.320
## 53 26 85.37 0.291
## 54 27 85.69 0.266
## 55 27 86.02 0.242
## 56 27 86.30 0.220
## 57 27 86.54 0.201
## 58 27 86.79 0.183
## 59 27 86.97 0.167
## 60 26 87.12 0.152
## 61 26 87.24 0.138
## 62 26 87.34 0.126
## 63 26 87.42 0.115
## 64 26 87.49 0.105
## 65 27 87.55 0.095
## 66 28 87.61 0.087
## 67 28 87.66 0.079
## 68 28 87.70 0.072
## 69 28 87.73 0.066
## 70 28 87.76 0.060
## 71 28 87.78 0.055
## 72 28 87.80 0.050
## 73 28 87.82 0.045
## 74 28 87.84 0.041
## 75 28 87.85 0.038
## 76 28 87.86 0.034
## 77 28 87.87 0.031
## 78 28 87.87 0.028
## 79 28 87.88 0.026
## 80 28 87.89 0.024
## 81 28 87.89 0.022
## 82 28 87.89 0.020
## 83 28 87.90 0.018
## 84 28 87.90 0.016
## 85 28 87.90 0.015
## 86 28 87.90 0.014
## 87 30 87.91 0.012
## 88 30 87.91 0.011
## 89 30 87.91 0.010
## 90 30 87.91 0.009
## 91 30 87.91 0.008
## 92 30 87.91 0.008
## 93 30 87.91 0.007
## 94 30 87.91 0.006
The table above represents 94 different regression models obtained for 94 different values of \(\lambda\). The R function glmnet()
automatically decides which values of \(\lambda\) to test. For example, we can see that when \(\lambda\) is around 5, the LASSO selects 9 variables, and when \(\lambda\) is around 1, the regression selects 19 variables. (The middle column (%Dev) is a measure of goodness-of-fit that we won’t worry about.)
To actually see the model output for each value of \(\lambda\), we use the coef()
function. For example, if we want to see the results of the LASSO regression using \(\lambda = 10\), we can write
coef(lasso,10)
## 33 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 30.8880917
## HostResponseTime_a few days or more .
## HostResponseTime_within a day .
## HostResponseTime_within a few hours -7.7209538
## HostResponseTime_within an hour .
## HostResponseRate .
## HostAcceptRate .
## Superhost .
## HostListings 0.3786573
## Neighbourhood_Campo di Marte .
## Neighbourhood_Centro Storico 2.0365337
## Neighbourhood_Gavinana Galluzzo .
## Neighbourhood_Isolotto Legnaia .
## Neighbourhood_Rifredi .
## Latitude .
## Longitude .
## RoomType_Entire home/apt .
## RoomType_Hotel room .
## RoomType_Private room .
## Accomodates 7.4069029
## Bathrooms -0.4116008
## Bedrooms 18.6908176
## Beds .
## MinNights .
## AvgRating .
## RatingAccuracy .
## RatingClean .
## RatingCheckIn .
## RatingCommunication .
## RatingLocation .
## RatingValue .
## InstantBook .
## ReviewsPerMonth -1.2924361
The variables selected are those with non-zero coefficients. You can try varying the \(\lambda\) parameter in coef()
to see how it changes the variables that are selected along with their coefficients.
8.3.2 Cross-validation of LASSO regression
It is important to remember that \(\lambda\) is a parameter that we need to choose before building a LASSO regression model. The glmnet()
function builds many different regression models by varying \(\lambda\) (in our case, the variable lasso
is storing the results of 78 different regression models).
Of course, usually we want to identify one good model, and the output doesn’t tell us anything about what value of \(\lambda\) we should use. So how do we select the “best” value of \(\lambda\) in the first place?
The standard approach to picking \(\lambda\) is cross-validation. Recall that the main idea behind cross-validation is as follows:
- Pick a value of \(\lambda\),
- Randomly split the dataset into 10 different segments called folds,
- Build the LASSO regression model using data from 9 folds, and measure its prediction error on the 10th fold that was “left out” from the model building phase,
- Repeat by rotating through all 10 folds, and then calculate the average prediction error over the 10 folds. This produces a single number – the cross-validation error – which we can interpret as a measure of model performance (under the value of \(\lambda\) that we have selected).
- Repeat steps 1 through 4 for many different values of \(\lambda\), and pick the value that achieves the desired balance between predictive performance and model complexity (discussed more below).
Luckily, the glmnet
package comes with a built-in function for cross validation, so we can perform the above steps in a single line!
<- cv.glmnet(x,y, type.measure="mae") cv.lasso
Next, we can use plot()
to visualize the cross-validation errors:
plot(cv.lasso)
The horizontal axis of the plot shows different values of \(\lambda\) (by default, this is given in terms of \(log(\lambda)\) instead of \(\lambda\), which just makes the results easier to visualize). The vertical axis reports the associated cross-validation error. Across the top of the plot, we can also see the number of variables that are selected at each value of \(\lambda\). The dashed line on the left indicates the value of \(\lambda\) that achieves the minimum cross-validation error, which is what we would usually choose as our “final” model.
To print the values of \(\lambda\) corresponding to the minimum cross-validation error, we write
print(cv.lasso)
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mae")
##
## Measure: Mean Absolute Error
##
## Lambda Index Measure SE Nonzero
## min 4.749 23 28.13 5.133 10
## 1se 12.040 13 32.93 5.209 4
The table above shows that the cross-validation error is minimized at a value of \(\lambda = 4.7\), which produces a mean absolute error of $28 and selects 10 variables. A general rule of thumb is to select the value of \(\lambda\) that minimizes the error, but ultimately, the decision of which value of \(\lambda\) to use for the final model depends on your preference for how to balance model complexity and prediction error.
Forward selection, backward elimination, and LASSO regression are three common approaches to variable selection. An advantage of forward selection and backward elimination is that they do not require choosing a value of \(\lambda\) through cross-validation, which LASSO does. An advantage of LASSO is that it can be computationally much faster than stepwise regression when there are a large number of variables to choose from.
Regardless of which approach we use, it is important to remember there is no “one-size-fits-all” approach to variable selection. These methods should be used as helpful guides only, and we should always aim to incorporate background knowledge and domain expertise into the variable selection process wherever possible.