Part 8 Variable Selection

Here we discuss a few different approaches to variable selection in R.

8.1 Data: Airbnb listings

We will use the airbnb_small.csv dataset, which contains information on 50 Airbnb listings in Florence, Italy from 2021. There are 24 variables for each listing, which includes information on hosts, location, price, property details, and reviews. Our goal will be to build a model that predicts the price of the Airbnb property from the remaining variables.

Two useful packages are olsrr and glmnet, which have convenient built-in methods for variable selection. Let’s start by loading these packages, setting our working directory, and reading in the file:

set.seed(100)
library(olsrr)
library(glmnet)
library(tidyverse)
library(mltools)
library(data.table)

airbnb = read.csv("airbnb_small.csv")

8.2 Stepwise regression

When we have a large number of potential variables to choose from for regression, simply throwing all of them into the model might make the model too complex. High model complexity has two drawbacks:

  1. it makes interpreting and explaining the model to others more complicated, and
  2. it risks overfitting the model to data, which can erode predictive performance.

One way of controlling model complexity in linear regression is called stepwise regression.

The key to stepwise regression is a performance metric called the Akaike information criterion (AIC). The AIC of a regression model measures how well the model balances complexity (i.e., the number of variables included in the regression) and goodness-of-fit (i.e., how well the chosen set of variables fit the data). In general, a regression model with a lower AIC is viewed as achieving a “better” balance between complexity and goodness-of-fit. The AIC is therefore widely used as a guide for finding a small number of variables with high explanatory power.

8.2.1 Stepwise regression with forward selection

One approach to stepwise regression is called forward selection. The general idea is that we start with a “blank” model, add whichever variable improves (i.e., decreases) the AIC the most, and repeat until the AIC can’t be improved any further.

Building a stepwise model in R is similar to building a standard linear regression model, but takes one extra step. Using the airbnb.csv First we run a regular linear regression using lm(), with Price as the dependent variable:

model <- lm(Price ~., data = airbnb)

As a benchmark, let’s look at the results of the full linear regression model:

summary(model)
## 
## Call:
## lm(formula = Price ~ ., data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.613 -10.105  -0.221  12.286  49.415 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                        13516.3486 56832.9384   0.238   0.8144  
## HostResponseTimewithin a day        -475.6676   213.3597  -2.229   0.0374 *
## HostResponseTimewithin a few hours  -493.5318   203.8891  -2.421   0.0251 *
## HostResponseTimewithin an hour      -455.0459   207.1368  -2.197   0.0400 *
## HostResponseRate                     390.4220   220.2179   1.773   0.0915 .
## HostAcceptRate                        31.5768    39.6043   0.797   0.4346  
## Superhost                             -7.6700    13.5703  -0.565   0.5782  
## HostListings                           0.7634     0.2810   2.716   0.0133 *
## NeighbourhoodCentro Storico            4.9871    37.1483   0.134   0.8945  
## NeighbourhoodGavinana Galluzzo      -186.4095  1012.4128  -0.184   0.8558  
## NeighbourhoodIsolotto Legnaia       -103.9420    83.4928  -1.245   0.2276  
## NeighbourhoodRifredi                 -28.8847    43.0517  -0.671   0.5099  
## Latitude                              60.4395  1238.8568   0.049   0.9616  
## Longitude                          -1464.6477  1062.1974  -1.379   0.1832  
## RoomTypeHotel room                    50.2613    33.7877   1.488   0.1525  
## RoomTypePrivate room                  -5.8018    24.4563  -0.237   0.8149  
## Accomodates                           12.1546     6.7288   1.806   0.0859 .
## Bathrooms                            -19.7126    13.0087  -1.515   0.1453  
## Bedrooms                              29.1729    15.6235   1.867   0.0766 .
## Beds                                  -2.6974    10.8947  -0.248   0.8070  
## MinNights                              1.0589     2.6414   0.401   0.6928  
## AvgRating                            -51.2645    75.8444  -0.676   0.5068  
## RatingAccuracy                       -31.8468    35.4207  -0.899   0.3793  
## RatingClean                           11.9607    63.9980   0.187   0.8536  
## RatingCheckIn                         22.1832    88.4108   0.251   0.8044  
## RatingCommunication                   85.7820    75.2299   1.140   0.2676  
## RatingLocation                       -55.6150    43.7572  -1.271   0.2183  
## RatingValue                          102.3121    55.8141   1.833   0.0817 .
## InstantBook                          -17.9653    16.8469  -1.066   0.2990  
## ReviewsPerMonth                       -4.2126     6.0649  -0.695   0.4953  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.55 on 20 degrees of freedom
## Multiple R-squared:  0.8794, Adjusted R-squared:  0.7045 
## F-statistic: 5.028 on 29 and 20 DF,  p-value: 0.0002066

Then we send our stored model to the ols_step_forward_aic() function (which comes from the olsrr package), which will automatically perform the stepwise regression:

forward<-ols_step_forward_aic(model)

To see the full set of results and the variables that are added at each step, we simply use print():

print(forward)
## 
##                               Selection Summary                                
## ------------------------------------------------------------------------------
## Variable             AIC        Sum Sq         RSS         R-Sq      Adj. R-Sq 
## ------------------------------------------------------------------------------
## Bedrooms           531.846     67591.153    108128.467    0.38465      0.37183 
## HostListings       518.117     96776.482     78943.138    0.55074      0.53163 
## Accomodates        515.520    103710.752     72008.868    0.59021      0.56348 
## Bathrooms          511.999    111238.486     64481.134    0.63305      0.60043 
## Neighbourhood      509.511    123439.723     52279.897    0.70248      0.64443 
## ReviewsPerMonth    503.142    131497.257     44222.363    0.74834      0.69171 
## MinNights          502.423    133837.507     41882.113    0.76165      0.70054 
## ------------------------------------------------------------------------------

The output above tells us that the first variable added to the model is Bedrooms, and that this one-variable achieves \(AIC = 532\) and \(R^2 = 0.385\). Then a two-variable linear model with both Bedrooms and HostListings achieves \(AIC = 518\) and \(R^2 = 0.551\), and so on. Note that the output above lists only 7 out of the 23 possible independent variables we could have used.

Next, visualizing AIC is straightforward using plot():

plot(forward)

8.2.2 Stepwise regression with backward elimination

Another way of performing stepwise regression is backward elimination, which is conceptually similar to forward selection. In backward elimination, we start with a “full” model that contains all possible variables, and then we identify the variable whose removal from the model improves the AIC the most, and repeat until the AIC cannot be improved further.

The R code for backward elimination is similar: first we run the full linear regression model using lm(), and then we use ols_step_backward_aic():

backward<-ols_step_backward_aic(model)

Like in forward selection, we can print the model to see the result of each step:

print(backward)
## 
## 
##                          Backward Elimination Summary                         
## ----------------------------------------------------------------------------
## Variable            AIC         RSS         Sum Sq       R-Sq      Adj. R-Sq 
## ----------------------------------------------------------------------------
## Full Model        506.367    21194.135    154525.485    0.87939      0.70450 
## Latitude          504.373    21196.657    154522.963    0.87937      0.71854 
## RatingClean       502.458    21232.830    154486.790    0.87917      0.73087 
## Beds              500.568    21279.825    154439.795    0.87890      0.74200 
## RatingCheckIn     499.030    21477.352    154242.268    0.87777      0.75046 
## MinNights         497.612    21728.535    153991.085    0.87635      0.75764 
## AvgRating         496.242    22004.159    153715.461    0.87478      0.76400 
## HostAcceptRate    495.231    22443.497    153276.123    0.87228      0.76821 
## Superhost         494.898    23204.834    152514.786    0.86794      0.76890 
## ----------------------------------------------------------------------------

In the table above, we can see the names of 8 variables that are eliminated from the model. In this example, the final model from backward elimination achieves \(AIC = 495\) and \(R^2 = 0.868\). Note that backward elimination arrives at a different final model than forward selection in this case.

Similar to before, we can use plot() to visualize the result of the stepwise regression based on backward elimination:

 plot(backward)

8.3 LASSO regression

Another popular method for variable selection is LASSO regression. In general, LASSO is used when dealing with a huge number of potential variables, in which case stepwise regression can be very slow. (There are other differences that we won’t get into here).

8.3.1 Building a LASSO regression model

LASSO regression can be done in R using the glmnet() function. In order to apply it to the Airbnb data, we first need to do a little bit of data manipulation. In particular, we need to manually convert the columns with character strings (i.e., words) into R factors, and then convert the factors into 0 or 1 binary variables using “one-hot” encoding. The code below does this:

airbnb$HostResponseTime <- as.factor(airbnb$HostResponseTime)  
airbnb$RoomType <- as.factor(airbnb$RoomType)  
airbnb$Neighbourhood <- as.factor(airbnb$Neighbourhood) 

airbnbone <- one_hot(as.data.table(airbnb))

If we take a look at the new data frame, we can see that it now has 34 columns, due to the conversion of the factor variables into 0-1 dummy variables. (It is commented out below to simplify HTML output, but you can run it in R Markdown to preview the new data frame).

#head(airbnbone) 

Next, we need to split the dataframe into independent and dependent variables. (We have to do this because the glmnet() function we will use for LASSO regression expects the data to be in a different format from the stepwise regression functions we used above). For simplicity we will use x and y to denote the independent and dependent variables, respectively:

x <-as.matrix(subset(airbnbone, select=-c(Price)))
y <-as.matrix(subset(airbnbone, select=c(Price)))

Now that the data is formatted, we can call glmnet() to run a lasso regression:

lasso <- glmnet(x,y)

Printing the results displays the number of variables selected at each value of \(\lambda\) (Df = degrees of freedom):

print(lasso)
## 
## Call:  glmnet(x = x, y = y) 
## 
##    Df  %Dev Lambda
## 1   0  0.00 36.770
## 2   2  7.19 33.500
## 3   2 13.22 30.520
## 4   2 18.24 27.810
## 5   3 23.99 25.340
## 6   3 29.94 23.090
## 7   3 34.88 21.040
## 8   3 38.98 19.170
## 9   3 42.38 17.470
## 10  3 45.21 15.920
## 11  3 47.55 14.500
## 12  3 49.50 13.210
## 13  4 51.49 12.040
## 14  5 53.80 10.970
## 15  7 56.68  9.995
## 16  7 59.60  9.108
## 17  8 62.07  8.298
## 18  9 64.32  7.561
## 19  9 66.40  6.889
## 20  9 68.13  6.277
## 21  9 69.57  5.720
## 22  9 70.76  5.212
## 23 10 71.76  4.749
## 24 12 72.66  4.327
## 25 13 73.62  3.942
## 26 15 74.55  3.592
## 27 16 75.50  3.273
## 28 16 76.37  2.982
## 29 16 77.09  2.717
## 30 17 77.70  2.476
## 31 17 78.33  2.256
## 32 17 78.85  2.056
## 33 18 79.32  1.873
## 34 18 79.76  1.707
## 35 18 80.13  1.555
## 36 18 80.43  1.417
## 37 18 80.68  1.291
## 38 18 80.89  1.176
## 39 18 81.07  1.072
## 40 19 81.24  0.977
## 41 19 81.45  0.890
## 42 19 81.63  0.811
## 43 20 81.78  0.739
## 44 21 81.96  0.673
## 45 23 82.12  0.613
## 46 24 82.58  0.559
## 47 25 82.94  0.509
## 48 25 83.25  0.464
## 49 26 83.52  0.423
## 50 25 84.08  0.385
## 51 25 84.57  0.351
## 52 27 84.97  0.320
## 53 26 85.37  0.291
## 54 27 85.69  0.266
## 55 27 86.02  0.242
## 56 27 86.30  0.220
## 57 27 86.54  0.201
## 58 27 86.79  0.183
## 59 27 86.97  0.167
## 60 26 87.12  0.152
## 61 26 87.24  0.138
## 62 26 87.34  0.126
## 63 26 87.42  0.115
## 64 26 87.49  0.105
## 65 27 87.55  0.095
## 66 28 87.61  0.087
## 67 28 87.66  0.079
## 68 28 87.70  0.072
## 69 28 87.73  0.066
## 70 28 87.76  0.060
## 71 28 87.78  0.055
## 72 28 87.80  0.050
## 73 28 87.82  0.045
## 74 28 87.84  0.041
## 75 28 87.85  0.038
## 76 28 87.86  0.034
## 77 28 87.87  0.031
## 78 28 87.87  0.028
## 79 28 87.88  0.026
## 80 28 87.89  0.024
## 81 28 87.89  0.022
## 82 28 87.89  0.020
## 83 28 87.90  0.018
## 84 28 87.90  0.016
## 85 28 87.90  0.015
## 86 28 87.90  0.014
## 87 30 87.91  0.012
## 88 30 87.91  0.011
## 89 30 87.91  0.010
## 90 30 87.91  0.009
## 91 30 87.91  0.008
## 92 30 87.91  0.008
## 93 30 87.91  0.007
## 94 30 87.91  0.006

The table above represents 94 different regression models obtained for 94 different values of \(\lambda\). The R function glmnet() automatically decides which values of \(\lambda\) to test. For example, we can see that when \(\lambda\) is around 5, the LASSO selects 9 variables, and when \(\lambda\) is around 1, the regression selects 19 variables. (The middle column (%Dev) is a measure of goodness-of-fit that we won’t worry about.)

To actually see the model output for each value of \(\lambda\), we use the coef() function. For example, if we want to see the results of the LASSO regression using \(\lambda = 10\), we can write

coef(lasso,10)
## 33 x 1 sparse Matrix of class "dgCMatrix"
##                                             s1
## (Intercept)                         30.8880917
## HostResponseTime_a few days or more  .        
## HostResponseTime_within a day        .        
## HostResponseTime_within a few hours -7.7209538
## HostResponseTime_within an hour      .        
## HostResponseRate                     .        
## HostAcceptRate                       .        
## Superhost                            .        
## HostListings                         0.3786573
## Neighbourhood_Campo di Marte         .        
## Neighbourhood_Centro Storico         2.0365337
## Neighbourhood_Gavinana Galluzzo      .        
## Neighbourhood_Isolotto Legnaia       .        
## Neighbourhood_Rifredi                .        
## Latitude                             .        
## Longitude                            .        
## RoomType_Entire home/apt             .        
## RoomType_Hotel room                  .        
## RoomType_Private room                .        
## Accomodates                          7.4069029
## Bathrooms                           -0.4116008
## Bedrooms                            18.6908176
## Beds                                 .        
## MinNights                            .        
## AvgRating                            .        
## RatingAccuracy                       .        
## RatingClean                          .        
## RatingCheckIn                        .        
## RatingCommunication                  .        
## RatingLocation                       .        
## RatingValue                          .        
## InstantBook                          .        
## ReviewsPerMonth                     -1.2924361

The variables selected are those with non-zero coefficients. You can try varying the \(\lambda\) parameter in coef() to see how it changes the variables that are selected along with their coefficients.

8.3.2 Cross-validation of LASSO regression

It is important to remember that \(\lambda\) is a parameter that we need to choose before building a LASSO regression model. The glmnet() function builds many different regression models by varying \(\lambda\) (in our case, the variable lasso is storing the results of 78 different regression models).

Of course, usually we want to identify one good model, and the output doesn’t tell us anything about what value of \(\lambda\) we should use. So how do we select the “best” value of \(\lambda\) in the first place?

The standard approach to picking \(\lambda\) is cross-validation. Recall that the main idea behind cross-validation is as follows:

  1. Pick a value of \(\lambda\),
  2. Randomly split the dataset into 10 different segments called folds,
  3. Build the LASSO regression model using data from 9 folds, and measure its prediction error on the 10th fold that was “left out” from the model building phase,
  4. Repeat by rotating through all 10 folds, and then calculate the average prediction error over the 10 folds. This produces a single number – the cross-validation error – which we can interpret as a measure of model performance (under the value of \(\lambda\) that we have selected).
  5. Repeat steps 1 through 4 for many different values of \(\lambda\), and pick the value that achieves the desired balance between predictive performance and model complexity (discussed more below).

Luckily, the glmnet package comes with a built-in function for cross validation, so we can perform the above steps in a single line!

cv.lasso <- cv.glmnet(x,y, type.measure="mae")

Next, we can use plot() to visualize the cross-validation errors:

plot(cv.lasso)

The horizontal axis of the plot shows different values of \(\lambda\) (by default, this is given in terms of \(log(\lambda)\) instead of \(\lambda\), which just makes the results easier to visualize). The vertical axis reports the associated cross-validation error. Across the top of the plot, we can also see the number of variables that are selected at each value of \(\lambda\). The dashed line on the left indicates the value of \(\lambda\) that achieves the minimum cross-validation error, which is what we would usually choose as our “final” model.

To print the values of \(\lambda\) corresponding to the minimum cross-validation error, we write

print(cv.lasso)
## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mae") 
## 
## Measure: Mean Absolute Error 
## 
##     Lambda Index Measure    SE Nonzero
## min  4.749    23   28.13 5.133      10
## 1se 12.040    13   32.93 5.209       4

The table above shows that the cross-validation error is minimized at a value of \(\lambda = 4.7\), which produces a mean absolute error of $28 and selects 10 variables. A general rule of thumb is to select the value of \(\lambda\) that minimizes the error, but ultimately, the decision of which value of \(\lambda\) to use for the final model depends on your preference for how to balance model complexity and prediction error.

Forward selection, backward elimination, and LASSO regression are three common approaches to variable selection. An advantage of forward selection and backward elimination is that they do not require choosing a value of \(\lambda\) through cross-validation, which LASSO does. An advantage of LASSO is that it can be computationally much faster than stepwise regression when there are a large number of variables to choose from.

Regardless of which approach we use, it is important to remember there is no “one-size-fits-all” approach to variable selection. These methods should be used as helpful guides only, and we should always aim to incorporate background knowledge and domain expertise into the variable selection process wherever possible.