12 Regularization
Up to this point we’ve been fitting various types of models (linear regression, logistic regression, kNN) to predict some continuous or categorical target. We’ve been working with pretty simple datasets that don’t have an overly massive number of features and have a good number of observations. Together this makes it easy to fit and interpret the models.
But, in the age of big data we often have a massive number of features. It’s so easy to collect so many details… if you’re Netflix you will be taking in every user’s age, gender, income, length of watching a TV show, last show watched, number of action shows watched, number of romantic comedies watched, time of day last watched, number of times a week watched, how long you hovered over something, and hundreds of other things. In this case, how do we know which ones to use? Should we just include them all? That isn’t a good solution as the more features you include the more flexible your model is and thus you may start seeing high variance and thus high test error. Even if it doesn’t have high test error, what if you have 50 features that are significant… how do you interpret which ones are the most important? Other times we may have lots of features but relatively few rows of data. What do we do in this case?
Chapter 6 of your book deals solely with this issue. How do we fit and find the optimal model when there are thousands or million potential combinations of features, each of which would need to be fit with itss own model? We’re going to cover model regularization in detail, but I want to briefly cover the other methods here even though we’re not going to actually apply them beyond this.
12.1 Dealing with lots of features - a few methods in brief
Let’s use the Olympic dataset that’s part of the ade4
package. This is a dataset where you can try and predict the score in an Olympic decathlon based on performance in the 10 events. There are only 33 observations, but then the 10 features, which makes this ideal for illustrating how having lots of features can be a problem as it’s so easy to overfit.
library(ade4)
Load data, convert to a data frame, check it out. We see that we have the 10 different events and then a score feature
data("olympic")
<- data.frame(olympic)
oly glimpse(oly)
## Rows: 33
## Columns: 11
## $ tab.100 <dbl> 11.25, 10.87, 11.18, 10.62, 11.02, 10.83, 11.18, 11.05, 11...
## $ tab.long <dbl> 7.43, 7.45, 7.44, 7.38, 7.43, 7.72, 7.05, 6.95, 7.12, 7.28...
## $ tab.poid <dbl> 15.48, 14.97, 14.20, 15.02, 12.92, 13.58, 14.12, 15.34, 14...
## $ tab.haut <dbl> 2.27, 1.97, 1.97, 2.03, 1.97, 2.12, 2.06, 2.00, 2.03, 1.97...
## $ tab.400 <dbl> 48.90, 47.71, 48.29, 49.06, 47.44, 48.34, 49.34, 48.21, 49...
## $ tab.110 <dbl> 15.13, 14.46, 14.81, 14.72, 14.40, 14.18, 14.39, 14.36, 14...
## $ tab.disq <dbl> 49.28, 44.36, 43.66, 44.80, 41.20, 43.06, 41.68, 41.32, 42...
## $ tab.perc <dbl> 4.7, 5.1, 5.2, 4.9, 5.2, 4.9, 5.7, 4.8, 4.9, 5.2, 4.8, 4.7...
## $ tab.jave <dbl> 61.32, 61.76, 64.16, 64.04, 57.46, 52.18, 61.60, 63.00, 66...
## $ tab.1500 <dbl> 268.95, 273.02, 263.20, 285.11, 256.64, 274.07, 291.20, 26...
## $ score <dbl> 8488, 8399, 8328, 8306, 8286, 8272, 8216, 8189, 8180, 8167...
12.1.1 Subset selection
One method of reducing the number of features involves fitting models and then adding/removing features based on if they improve fit. You could fit all possible model combinations, but this will rapidly explode into fitting thousands or more models. Other simpler methods involved adding (forwards stepwise selection) or removing features (backwards stepwise selection) based on if they improve fit. These forward and backward methods which are conceptually similar. Thinking about forward, we’d start by fitting the following model.
lm(score ~ tab.100, data = oly)
Then we’d add another feature. If it improves the R2 (or some other model fit measure of choice) then we’d keep that extra feature. Say this model.
lm(score ~ tab.100 + tab.long, data = oly)
Great, we can say that helped, so let’s add another feature.
lm(score ~ tab.100 + tab.long + tab.poid, data = oly)
Let’s say that tab.poid
didn’t help, then we remove it and replace with something else.
lm(score ~ tab.100 + tab.long + tab.haut, data = oly)
While this in theory is less computationally intensive as just fitting all possible models, it’s also clunky and doesn’t guarantee that you wound up with the best model. This is because as we’ve seen in earlier chapters the relative importance of one feature can depend on the inclusion of another. Thus, the order in which features are added and removed will influence what model you wind up with. You might have found a different best model if you started down a different “path” by fitting a different feature first. Finally, there is debate out there if repeatedly fitting different combinations isn’t what’s called multiple hypothesis testing. The problem that arises when you fit models in such a non-specific way repeatedly to the same dataset is that you can stumble upon significant p-values (indeed, using a p = 0.05 threshold we’d expect to see something significantly non zero 5% of the time even when it’s not a true effect).
Point being, although it is a method that’s out there, it for sure has its own set of issues that you should be aware of!
12.1.2 Penalizing models for having too many features
12.1.2.1 Adjusted \(R^2\)
Another way to ensure you don’t have too many features in your model is to fit say some subset of models, but then use a fit metric that adds a penalty for each extra parameter.
We remember our \(R^2\) measure which is calculated from your total sum of squares (\(TSS\)) and the residual sum of squares (\(RSS\)) of the fitted model \[R^2 = 1 - \frac{RSS}{TSS}\]
Well there’s also an Adjusted \(R^2\) measure which penalizes based on the number of features, \(d\). You can see from the formula below that the numerator with the \(RSS\) will shrink as more parameters and included. Thus, your adjusted \(R^2\) will be smaller than a non-adjusted \(R^2\) \[Adjusted\:R^2 = 1 - \frac{RSS/(1-n-d)}{TSS/(1-n)}\]
12.1.2.2 AIC, BIC, DIC, Cp
AIC, BIC, and DIC are other commonly used methods to penalize your model’s fit based on number of features. These are all scores that you calculate when fitting different models, and then compare the models using the score. The best model is then the one with the lowest score. These methods are very commonly used in academia (I’ve published lots of papers using them), but less-so for non-academic purposes.
The AIC formula is below (and the others are very similar and thus just refer to your book). The key thing to notice is that AIC will grow for every additional parameter. Thus, given you judge fit based on which models have the lowest AIC, models with an extra parameter must have a much lower RSS (and thus much better fit) in order to offset the extra parameter penalty. \[AIC = \frac{1}{n\hat\sigma^2}(RSS + 2d\hat\sigma)\]
12.2 Shrinkage methods - AKA Regularization
The class of models that we’re going to focus on use what’s called regularization. Remember that when fitting a linear regression model the goal is to find the fit line that has the smallest residual sum of squares \(RSS\)?
The closer the estimated y value is (\(\hat{y}\)) is to the actual value of \(y\), the lower the \(RSS\). You can minimize this value using the formulas to calculate \(\beta_1\) and \(\beta_0\) that we learned back in our linear regression lesson. The point here is that the goal is to minimize \(RSS\) \[\sum_{x = 1}^{n} (\hat{y}_i - y_i)^2 = RSS\] Well, what regularization does is make is so you instead minimize \(RSS\) PLUS another value that gets bigger as your \(\beta\) estimates grow. This biases the model in favor of using smaller, more conservative \(\beta\) coefficients, which will likely have lower variance than the non-regularized regression models.
The two models we’re going to learn about are Ridge Regression and Lasso Regression. These both rely on this concept of shrinking your \(\beta\) coefficients through a penalty. This shrinking allows you to fit all your features without the worries of overfitting, and thus having high variance and poor test error, or having to do some sort of stepwise selection.
Let’s start with Ridge Regression
12.2.1 Ridge Regression
Ridge regression fits a regression model by minimizing the following. The \(RSS\) is the same, but you now have to minimize the right part of the equation as well. The \(\sum_{j=1}^{p}\beta_{j}^2\) is saying that part of what it’s trying to minimize is the sum of your squared \(\beta\) coefficients. Thus it’ll try to shrink or minimize your \(\beta\)’s as part of fitting the model. The other key part is the \(\lambda\) (lambda). \(\lambda\) is what’s known as a tuning parameter or hyperparameter. I’ll get to how we find our optimal \(\lambda\) next, but for now the key thing to realize is that if we put in a small lambda, the model won’t be penalized for large \(\beta\)’s, and thus won’t shrink them that much. Indeed, if it’s zero then it’s just regular old least squares regression! But, as \(\lambda\) grows the model will start penalizing your \(\beta\)’s more and more and thus will shrink your coefficients towards zero! This is pretty cool as what it’s letting you do is fit all your features, but making all the features with little importance have little to no effect (or removing them entirely).
\[RSS + \lambda\sum_{j=1}^{p}\beta_{j}^2\] #### A quick note on \(\lambda\) before fitting
As mentioned above, how do we find the optimal \(\lambda\) value for our model? Too small and we won’t penalize enough and thus get high variance in our test data, too large and we’ll penalize too much and get high bias in our test data. Both result in high test error, which is bad. The ideal way to tackle this would be to test a range of \(\lambda\) values and see which one gives the lowest error. Thus we are ‘tuning’ the parameter to find it’s ideal value. Luckily the package we’re going to use can not only take a range of \(\lambda\) values, but it’ll cross-validate them as well all in one go!
12.2.1.1 Fitting our Ridge Regression model
Let’s use the Olympic decathlon data for this example. We’re going to start by splitting our data, then finding our optimal \(\lambda\), then using that \(\lambda\) value to fit our model. We’ll be using the package glmnet
which we’ll load now.
library(glmnet)
12.2.1.1.1 Splitting our data
Although glmnet is great at fitting models, it doesn’t allow you to specify your models in the target ~ features
syntax that we’re used to. Instead you need your features (x) to be a matrix. We can use the function model.matrix()
to convert our data frame to a matrix of features. The nice part is this will also one hot encode any categorical variables for you.
Below we use the model.matrix()
function, feeding it the formula of “target ~ .” so the model knows what is the target and what isn’t. You also specify where to find the data, of course. The [,-1]
is removing the intercept that the function will calculate. Check the book for why we don’t need this. We also just grab our target value score
and assign it to target.
set.seed(888) # set seed for reproducibility
<- model.matrix(score~., data = oly)[,-1] # convert features to matrix
features <- oly$score # get target
target
<- createDataPartition(oly$score, p = 0.7, list = F) # create split_index
split_index
#split data as usual.
#Note we don't need to select column for features as we already removed the target!
<- features[split_index,]
features_train_m <- features[-split_index,]
features_test_m <- target[split_index]
target_train_m <- target[-split_index] target_test_m
12.2.1.2 Finding \(\lambda\)
Let’s find our \(\lambda\) now. First we’ll specify a exponential sequence that ranges from tiny (0.01) to huge (10^10). We do this by making a sequence from -2 to 10 with 100 values in between and then taking 10^each value in the sequence. Go play with the individual code directly in the console to better understand how this is done. For example, go enter seq(-2, 10, length = 100)
. For now, we’ll just generate our vector of lambda values.
<- 10^seq(-2, 10, length = 1000)
lambda_vals c(1,2,3, 998,999,1000)] # look at first and last three to check lambda_vals[
## [1] 1.000000e-02 1.028045e-02 1.056876e-02 9.461848e+09 9.727203e+09
## [6] 1.000000e+10
Now we’ll use the cv.glmnet()
function to fit our model across the varying lambdas. We specify our features first, then target. the alpha = 0
is telling glmnet()
that you want this to be ridge regression (alpha = 1
is for LASSO). We can then call the lambda.min
from our model object to give us the \(\lambda\) value that has the lowest \(MSE\). We can ignore the warning.
<- cv.glmnet(features_train_m, target_train_m, alpha = 0, lambda = lambda_vals, grouped = FALSE)
ridge_cv <- ridge_cv$lambda.min
bestlam # value of lambda that minimizes error bestlam
## [1] 2.90043
12.2.1.3 Using our optimal lambda to fit the model and generate predictions
Let’s use that \(\lambda\) to fit our model and make predictions. We’ll first specify a general model with just glmnet()
, then use our trusty predict()
function to generate target predictions using our test features. We can tell it what \(\lambda\) to use when we specify s
. One thing that differs here is that we don’t tell it our test features using newdata =
. Instead we use newx =
.
# fit model
<- glmnet(features_train_m, target_train_m, alpha = 0)
ridge_mod
# get predictions
<- predict(ridge_mod, s = bestlam, newx = features_test_m)
ridge_preds
# calculate MSE
mean((ridge_preds - target_test_m)^2)
## [1] 4900.913
And there we have the test \(MSE\) of our ridge regression model! In order to know if this is good or bad we’ll want to cross validate it and compare to other models, which we’ll get to later.
12.2.1.4 Exploring our \(\beta\) coefficients and the effect of \(\lambda\)
Let’s take a minute to show how our \(\lambda\) value influences shrinkage of our \(\beta\)’s.
First, to get our coefficients from a glmnet()
model we need to use predict()
with some slight modification. We can see how performance in each event influences overall score in the decathlon. tab.100
is for time in the 100m dash. It’s not surprising that for every extra second it takes one to complete that their score drops by -247.75 points. For every extra meter someone throws a javelin, we see an increase in score by 15.15 points.
<- predict(ridge_mod, type = "coefficients", s = bestlam)[1:11,]
ridge_coef ridge_coef
## (Intercept) tab.100 tab.long tab.poid tab.haut tab.400
## 8892.77251 -247.74824 227.14713 59.12515 897.85635 -47.26363
## tab.110 tab.disq tab.perc tab.jave tab.1500
## -108.95045 18.07661 288.76504 15.15348 -5.63988
Let’s show what happens when we increase our \(\lambda\). Currently our lambda is
bestlam
## [1] 2.90043
This is really small and suggests that with the current split we are not having to regularize very much. We can rerun the above but instead specify a larger \(\lambda\) in the s =
argument. Notice how a lot of our \(\beta\) coefficients shrink closer to zero when \(\lambda\) = 100. That is forcing them to have less importance.
<- predict(ridge_mod, type = "coefficients", s = 100)[1:11,]
ridge_coef ridge_coef
## (Intercept) tab.100 tab.long tab.poid tab.haut tab.400
## 8936.540199 -237.586430 223.727287 64.995875 841.785737 -48.036237
## tab.110 tab.disq tab.perc tab.jave tab.1500
## -105.975171 15.066914 257.297538 13.366074 -4.663467
Let’s try a huge \(\lambda\). We see that most of our \(\beta\)’s are near zero now.
<- predict(ridge_mod, type = "coefficients", s = 1000000)[1:11,]
ridge_coef ridge_coef
## (Intercept) tab.100 tab.long tab.poid tab.haut
## 7.855720e+03 -9.931595e-34 8.884643e-34 2.375378e-34 2.525939e-33
## tab.400 tab.110 tab.disq tab.perc tab.jave
## -1.960888e-34 -5.001708e-34 4.291940e-35 8.180136e-34 3.534778e-35
## tab.1500
## -9.505518e-36
What about when \(\lambda\) = 0?
<- predict(ridge_mod, type = "coefficients", s = 0)[1:11,]
ridge_coef ridge_coef
## (Intercept) tab.100 tab.long tab.poid tab.haut tab.400
## 8892.77251 -247.74824 227.14713 59.12515 897.85635 -47.26363
## tab.110 tab.disq tab.perc tab.jave tab.1500
## -108.95045 18.07661 288.76504 15.15348 -5.63988
Those are virtually identical to a regular multiple regression model! So, the smaller the \(\lambda\) the lower the shrinkage, while the bigger the \(\lambda\) the greater the shrinkage.
#bind training data back together
<- data.frame(features_train_m, target_train_m)
oly_full <- oly_full %>% rename(score = 'target_train_m') # rename
oly_full <- lm(score ~ ., data = oly_full) # fit
lr_mod summary(lr_mod)$coefficients[,1] # get beta coefficients
## (Intercept) tab.100 tab.long tab.poid tab.haut tab.400
## 8863.518740 -252.615719 227.809833 47.257327 905.330092 -43.096275
## tab.110 tab.disq tab.perc tab.jave tab.1500
## -115.480683 21.511068 312.386062 16.551194 -6.417975
12.2.2 Lasso Regression
Lasso works much like ridge regression. It still uses \(\lambda\) in the same way, but it differs in how it penalizes based on \(\beta\) coefficients. Instead of using \(\lambda\sum_{j=1}^{p}\beta_{j}^2\) it uses \(\lambda\sum_{j=1}^{p}|\beta_{j}|\). Because of this it allows the model to actually push coefficient to zero, thus selecting features for you. Be sure to read about how this is done in the book itself.
Let’s fit our Lasso model
12.2.2.1 Finding \(\lambda\)… again
We need to find the optimal \(\lambda\) again. We’ll use the same process as before. The only change now is that to fit a lasso model we specify alpha = 1
in our cv.glmnet()
function.
<- 10^seq(-2, 10, length = 1000)
lambda_vals c(1,2,3, 998,999,1000)] # look at first and last three to check lambda_vals[
## [1] 1.000000e-02 1.028045e-02 1.056876e-02 9.461848e+09 9.727203e+09
## [6] 1.000000e+10
<- cv.glmnet(features_train_m, target_train_m, alpha = 1, lambda = lambda_vals, grouped = FALSE)
lasso_cv <- lasso_cv$lambda.min
bestlam # value of lambda that minimizes error bestlam
## [1] 0.01
12.2.2.2 Calculating MSE of our Lasso model
Following the same fitting procedure as before. Here are Lasso model appears to be doing better than our ridge model, but given the data is so small we want to cross validate these results a bunch of times. We’ll do that a bit later.
# fit model
<- glmnet(features_train_m, target_train_m, alpha = 1)
lasso_mod
# get predictions with cross validated lambda
<- predict(lasso_mod, s = bestlam, newx = features_test_m)
lasso_preds
# calculate MSE
mean((lasso_preds - target_test_m)^2)
## [1] 4804.733
12.2.2.3 Playing with \(\lambda\)
Just to illustrate how our Lasso model will actually zero out coefficients, let’s input a \(\lambda\) of 200. Doing so we see that most of our \(\beta\) coefficients are now zero. This means the model doesn’t even try and predict with unimportant features.
<- predict(lasso_mod, type = "coefficients", s = 200)[1:11,]
lasso_coef lasso_coef
## (Intercept) tab.100 tab.long tab.poid tab.haut tab.400
## 5586.54641 0.00000 217.23221 27.57721 0.00000 0.00000
## tab.110 tab.disq tab.perc tab.jave tab.1500
## 0.00000 0.00000 70.27826 0.00000 0.00000
12.2.3 Cross-validating our models
Of course, we learned last week that we should calculate an average test error for our models as randomness in the splits can cause variation in test error. Thus, by looking at many splits you gain more confidence in the accuracy of your models.
12.2.4 Making our split and empty data frame
First, we’ll make our split index but set times =
to 100 so we generate 100 split indexes. We’re doing 100 folds given the sample size is so small we will likely have lots of variation of error across samples. Doing more folds helps us hone in on a more stable measure.
<- createDataPartition(oly$score, p = 0.7, list = F, times = 100) split_index
Now we’ll make our empty data frame
<- data.frame(matrix(ncol = 4, nrow = ncol(split_index)))
comp_df colnames(comp_df) <- c('lin_reg_mse', 'ridge_mse', 'lasso_mse' ,'fold')
12.2.4.1 Making our for loop
Below is a loop that splits our data, runs a model, and fills our empty data frame with the errors. You’ll notice that I also include code to fit a regular linear regression model in there so we can see if this regularization is helping over that.
for(i in 1:ncol(split_index)) {
# split for linear regression
<- oly[split_index[,i], !(colnames(oly) %in% c('score'))]
features_train <- oly[-split_index[,i], !(colnames(oly) %in% c('score'))]
features_test <- oly[split_index[,i], c('score')]
target_train <- oly[-split_index[,i], c('score')]
target_test # preprocess for linear regression
<- preProcess(features_train, method = c('scale'))
oly_pp <- predict(oly_pp, newdata = features_train)
features_train <- predict(oly_pp, newdata = features_test)
features_test # join and rename features and target
<- data.frame(features_train, target_train)
training <- training %>%
training rename(score = target_train)
# fit linear regression
<- lm(score ~ ., data = training)
reg_mod <- predict(reg_mod, newdata = features_test)
reg_pred # calculate MSE and add to DF
<- mean((target_test - reg_pred)^2)
reg_mse 'lin_reg_mse'] <- reg_mse
comp_df[i, 'fold'] <- i
comp_df[i, # generate x and y (this doesn't have to be here but will leave for clarity)
<- model.matrix(score~., oly)[,-1]
features <- oly$score
target <- 10^seq(10, -2, length = 100)
lambda # split x and y
<- features[split_index[,i],]
features_train_m <- features[-split_index[,i],]
features_test_m <- target[split_index[,i]]
target_train_m <- target[-split_index[,i]]
target_test_m # fit ridge model
<- glmnet(features_train_m, target_train_m, alpha = 0)
ridge_mod <- cv.glmnet(features_train_m, target_train_m, alpha = 0, lambda = lambda, grouped = FALSE)
cv_out <- cv_out$lambda.min
bestlam <- predict(ridge_mod, s = bestlam, newx = features_test_m)
ridge_pred # calculate error and add
<- mean((target_test_m - ridge_pred)^2)
ridge_mse 'ridge_mse'] <- ridge_mse
comp_df[i, # same for lasso
<- glmnet(features_train_m, target_train_m, alpha = 1)
lasso_mod <- cv.glmnet(features_train_m, target_train_m, alpha = 1, lambda = lambda, grouped = FALSE)
cv_out <- cv_out$lambda.min
bestlam <- predict(lasso_mod, s = bestlam, newx = features_test_m)
lasso_pred <- mean((lasso_pred - target_test_m)^2)
lasso_mse 'lasso_mse'] <- lasso_mse
comp_df[i,
}
12.2.4.2 graphing our results
Let’s do some data wrangling to make our data in a easier to read format for ggplot
Currently the format is ‘wide’ such that each error measure type has it’s own column.
head(comp_df)
## lin_reg_mse ridge_mse lasso_mse fold
## 1 431.4795 876.8839 463.8869 1
## 2 343.0930 334.8495 307.8185 2
## 3 312.6817 654.1425 329.3818 3
## 4 1269.7818 1316.6070 1198.3101 4
## 5 5798.1215 4533.6053 5572.5744 5
## 6 3698.2821 4470.8438 4494.1759 6
But really most languages work better if the data is in ‘long’ format such that error type is its own factor with it’s associated measure. We can use the gather()
function that’s part of tidyverse
to ‘gather’ our columns into a single one.
<- gather(data = comp_df, key = type, value = error, c('lin_reg_mse', 'ridge_mse', 'lasso_mse' ), factor_key=TRUE)
data_long c(1,2,3,101,102,103,201,202,203),]# check out a few values data_long[
## fold type error
## 1 1 lin_reg_mse 431.4795
## 2 2 lin_reg_mse 343.0930
## 3 3 lin_reg_mse 312.6817
## 101 1 ridge_mse 876.8839
## 102 2 ridge_mse 334.8495
## 103 3 ridge_mse 654.1425
## 201 1 lasso_mse 463.8869
## 202 2 lasso_mse 307.8185
## 203 3 lasso_mse 329.3818
This is a better format as we can plot different lines by color by just specifying color = type
inside our aesthetic call.
ggplot(data_long,
aes(x = fold, y = error, color = type)) +
geom_line() +
labs(x = 'Fold Number', y = 'Mean Squared Error') +
theme_classic()
And let’s take the mean \(MSE\) of each model while we’re at it.
mean(comp_df$lin_reg_mse)
## [1] 3031.229
mean(comp_df$ridge_mse)
## [1] 2893.818
mean(comp_df$lasso_mse)
## [1] 3130.283
Looks like our ridge did the best. Let’s see what the actual mean error in our predictions is by taking the square root of the mean square error.
sqrt(mean(comp_df$ridge_mse))
## [1] 53.79422
12.2.4.3 Interpreting our figure and wrapping up.
So, it looks like our ridge regression generally performs the best. This is clear given it has the lowest average \(MSE\) over our 100 folds. On average we can see that our score predictions are off by 54 points when using ridge regression. We can see from the graph that the \(MSE\) estimates for all models jump around a ton. This shouldn’t surprise us given we have only 33 observations total and thus our test set is only 8 values with the current split proportion! It is all too easy both overfit a model with such limited training data and get really strange test sets that are really hard to predict. Both of these will cause test \(MSE\) to jump around fold-to-fold.
Although this is a case where ridge regression is able to improve on our run of the mill least squares linear regression, that might not always be the case. In other data sets you might find that Lasso models fit better. You book covers more detail regarding situations where one might be better than the other. From this lesson you should more understand how to fit these models and what the goals of the model types are.
It’s worth wrapping up by saying that just because these models have a number of benefits over regular least squares linear regression, it doesn’t mean you should default to them. One downside is that there are extra steps of finding lambda. The bigger downside in my opinion is that although you get \(\beta\) coefficients, you don’t readily get their confidence intervals and p-values. If your goal is to make inference regarding what features explained a pattern, rather than predict new data points, then these regularization methods might not be the best. So, all these models are just tools, each of which has their own pros and cons. By learning these pros and cons you’ll be able to flexibly gain the best inference/predictive ability from a dataset rather than blindly following a cookbook.