7 Machine Learning 1: Shrinkage Estimation

In this exercise you familiarize yourself with the details behind shrinkage regression methods such as Ridge and Lasso. Although R provides amazing interfaces which perform the estimation flawlessly, you are first asked to implement Ridge and Lasso regression estimators from scratch before moving on to using the package glmnet next.

7.1 Introduction into penalized regressions

Exercises:

  • Load the file macro_predictors from the tidy_finance.sqlitedatabase. Create a variable ythat contains the market excess returns from the Goyal-Welsh dataset (rp_div) and a matrix X that contains the remaining macroeconomic variables except the column month. You will use X and y to perform penalized regressions. Try to run a simple linear regression of X on `y´ to analyse the relationship between the macroeconomic variables and market returns. Which problems do you encounter?
  • Write a function that requires three inputs, y (a \(T\) vector), X (a \((T \times K)\) matrix), and lambda and which returns the Ridge estimator (a \(K\) vector) for given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should allow to indicate whether \(X\) contains a vector of ones as first column which should be exempt from the \(L_2\) penalty.
  • Compute the \(L_2\) norm (\(\beta'\beta\)) for the regression coefficients based on the predictive regression from the previous exercise for a range of \(\lambda\)’s and illustrate the effect the penalization in a suitable figure.
  • Now, write a function that requires three inputs, y (a \(T\) vector), X (a \((T \times K)\) matrix), and ’lambda` and which returns the Lasso estimator (a \(K\) vector) for given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should allow to indicate whether \(X\) contains a vector of ones as first column which should be exempt from the \(L_1\) penalty.
  • After you are really sure you understand what Ridge and Lasso regression are doing, familiarize yourself with the documentation of the package glmnet(). It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients not only for Ridge and Lasso but also for combinations therefore, commonly called elastic nets.

Solutions:

# Load required packages
library(RSQLite)
library(tidyverse)

# Read in the data
tidy_finance <- dbConnect(SQLite(), "../Lectures/data/tidy_finance.sqlite", extended_types = TRUE)

macro_predictors <- tbl(tidy_finance, "macro_predictors") %>%
  collect()

y <- macro_predictors$rp_div
X <- macro_predictors %>%
  select(-month, -rp_div) %>%
  as.matrix()

# OLS for Welsh data fails because X is not of full rank
c(ncol(X), Matrix::rankMatrix(X))
## [1] 13 11

As for OLS, the objective function is to minimize the sum of squared errors \((y - X\beta)'(y - X\beta)\) under the condition that \(\sum\limits_{j=2}^K \beta_j \leq t(\lambda)\) if an intercept is present. This can be rewritten as \(\beta'A\beta \leq t(\lambda)\) where \(A = \begin{pmatrix}0&&\ldots&&0\\\vdots& 1&0&\ldots&0\\&\vdots&\ddots&&0\\0& &\ldots&&1\end{pmatrix}\). Otherwise, the condition is simply that \(\beta'I_K\beta \leq t(\lambda)\) where \(I_k\) is an identity matrix of size (\(k \times k\)).

ridge_regression <- function(y, X, lambda = 0, intercept = FALSE) {
  K <- ncol(X)
  A <- diag(K)
  if (intercept) {
    A[1, ] <- 0
  }
  coeffs <- solve(t(X) %*% X + lambda * A) %*% t(X) %*% y
  return(coeffs)
}

Below, I apply the function to the data set from the previous exercise to illustrate the output of the function ridge_regression().

rc <- c(NA, ridge_regression(y, X, lambda = 1))

# Add an intercept term
rc_i <- ridge_regression(y, cbind(1, X), lambda = 1, intercept = TRUE)
cbind(rc, rc_i)
##            rc         
##            NA  0.10070
## dp   -0.01197 -0.00133
## dy    0.01744  0.02265
## ep   -0.00639  0.00230
## de   -0.00559 -0.00364
## svar -0.00556 -0.00526
## bm    0.00377 -0.02857
## ntis -0.01399 -0.00867
## tbl  -0.04288 -0.05277
## lty  -0.02118 -0.03452
## ltr   0.04861  0.04604
## tms   0.02170  0.01825
## dfy   0.00495  0.00555
## infl -0.00648 -0.00592

Finally, the following code sequence computes the \(L_2\) norm of the ridge regression coefficients for a given \(\lambda\). Note that in order to implement this evaluation in a tidy manner, vectorized functions are important! R provides great ways to vectorize a function, to learn more, Hadley Wickham’s book Advanced R is a highly recommended read!

l2_norm <- function(lambda) sum(ridge_regression(y = y, X = X, lambda = lambda)^2) # small helper function to extract the L_2 norm of the ridge regression coefficients

l2_norm <- Vectorize(l2_norm) # In order to use the function within a `mutate` operation, it needs to be vectorized

seq(from = 0.01, to = 20, by = 0.1) %>%
  as_tibble() %>%
  mutate(norm = l2_norm(value)) %>%
  ggplot(aes(x = value, y = norm)) +
  geom_line() +
  labs(x = "Lambda", y = " L2 norm")

To compute the coefficients of a linear regression with a penalty on the sum of the absolute value of the regression coefficients, numerical optimization routines are required. Recall, the objective function is to minimize the sum of squared residuals, \(\hat\varepsilon'\hat\varepsilon\) under the constraint that \(\sum\limits_{j=2}^K\beta_j\leq t(\lambda)\). Make sure you familiarize yourself with the way, numerical optimization in R works: we first define the objective function (objective_lasso()) which has the parameter we aim to optimize as its first argument. The main function lasso_regression() then only calls this helper function.

objective_lasso <- function(beta, y, X, lambda, intercept) {
  residuals <- y - X %*% beta
  sse <- sum(residuals^2)
  penalty <- sum(abs(beta))
  if (intercept) {
    penalty <- penalty - abs(beta[1])
  }
  return(sse + lambda * penalty)
}

lasso_regression <- function(y, X, lambda = 0, intercept = FALSE) {
  K <- ncol(X)
  beta_init <- rep(0, K)
  return(optim(par = beta_init, fn = objective_lasso, y = y, X = X, lambda = lambda, intercept = intercept)$par)
}

rc <- c(NA, lasso_regression(y, X, lambda = 0.01))

# Add an intercept term
rc_i <- lasso_regression(y, cbind(1, X), lambda = 0.01, intercept = TRUE)
cbind(rc, rc_i)
##              rc     rc_i
##  [1,]        NA  0.01623
##  [2,] -0.000976 -0.00208
##  [3,] -0.000963  0.00225
##  [4,]  0.001870  0.00361
##  [5,]  0.000227  0.00266
##  [6,] -0.092782 -0.18322
##  [7,]  0.009676  0.00551
##  [8,] -0.037432 -0.01990
##  [9,] -0.088572 -0.13273
## [10,] -0.045513 -0.01411
## [11,]  0.116790  0.14961
## [12,]  0.122108  0.13362
## [13,]  0.030072  0.01779
## [14,] -0.002820 -0.02764

Finally, as in the previous example with Ridge regression I illustrate how a larger penalization term \(\lambda\) affects the \(L_1\) norm of the regression coefficients.

l1_norm <- function(lambda) sum(abs(lasso_regression(y = y, X = X, lambda = lambda)))
l1_norm <- Vectorize(l1_norm) # In order to use the function within a `mutate` operation, it needs to be vectorized

seq(from = 0, to = 0.5, by = 0.05) %>%
  as_tibble() %>%
  mutate(norm = l1_norm(value)) %>%
  ggplot(aes(x = value, y = norm)) +
  geom_line() +
  labs(x = "Lambda", y = " L1 norm")

While the code above should work, there are well-tested R packages available that provide a much more reliable and faster implementation. Thus, you can safely use the package glmnet. As a first step, the following code create the sequence of regressions coefficients for the Goyal-Welch dataset which should be identical to the example we discussed in the lecture slides.

library(glmnet)
# Lasso and Ridge regression

fit_ridge <- glmnet(X, y, # Model
  alpha = 0
)

broom::tidy(fit_ridge) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x = lambda, y = estimate, color = term)) +
  geom_line() +
  geom_hline(data = . %>% filter(lambda == min(lambda)), aes(yintercept = estimate, color = term), linetype = "dotted") +
  theme_minimal() +
  scale_x_log10() +
  labs(x = "Lambda", y = "Estimate")

Note the function argument alpha. glmnet() allows a more flexible combination of \(L_1\) and \(L_2\) penalization on the regression coefficients. The pure ridge estimation is implemented with alpha = 0, lasso requires alpha = 1.

fit_lasso <- glmnet(X, y, # Model
  alpha = 1
)

broom::tidy(fit_lasso) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x = lambda, y = estimate, color = term)) +
  geom_line() +
  geom_hline(data = . %>% filter(lambda == min(lambda)), aes(yintercept = estimate, color = term), linetype = "dotted") +
  theme_minimal() +
  scale_x_log10() +
  labs(x = "Lambda", y = "Estimate")

Note in the figure how Lasso discards all variables for high values of \(\lambda\) and then gradually incorporates more predictors. It seems like stock variance is selected first. As expected, for \(\lambda \rightarrow 0\), the lasso coefficients converge towards the OLS estimates (illustrated with dotted lines).

7.2 Factor selection via machine learning

In this Chapter you get to know tidymodels, a collection of packages for modellling and machine learning (ML) using tidyverse principles. From a finance perspective, you will apply penalized regressions to understand which factors and macroeconomic predictors may help to explain the cross-section of industry returns.

Exercises:

  • In this analysis, we use four different data sources. Load the monthly Fama-French 3 factor returns, the monthly q-factor returns from Hou, Xue, and Zhang (2014), the macroeconomic predictors from Welch and Goyal (2008) and monthly portfolio returns from 10 different industries according to the definition from Kenneth French’s homepage as test assets. Your data should contain 22 columns of regressors with the 13 macro variables and 8 factor returns for each month.
  • Provide meaningful summary statistics for the test assets excess returns.
  • Familiarize yourself with the tidymodels workflow. First, restrict yourself to just one industry, e.g. Manufacturing. Use the function initial_time_split from the rsample package to split the sample into a training and a test set.
  • Preprocess the data by creating a recipe which performs the following steps: First, the aim is to explain the industry excess returns as a function of all predictors. Exclude the column month from the analysis. Include all interaction terms between factors and macroeconomic predictors. Demean and scale each regressor such that the standard deviation is one.
  • Build a model linear_reg with tidymodels with a fixed penalty term that performs lasso regression. Create a workflow that first applies the recipe steps to the training data and then fit the model. Illustrate the predicted industry returns for the in-sample and the out-of-sample period.
  • Next, tune the model such that the penalty term is flexibly chosen by cross-validation. For that purpose, update the model such that both, penalty and mixture are flexible tuning variables. Use the function time_series_cv to generate a time series cross-validation sample which allows to tune the model with 20 random samples of length five years with a validation period of four years. Tune the model for a grid of possible penalty and mixture values and visualize the mean-squared prediction errors for the industry returns across the range of possible tuning parameters.
  • Finally, write a function to parallelize the entire workflow. The function should split the data into a training and test set, create a cross-validation scheme and tunes a lasso model for different penalty values. Finally, select the best model in terms of MSPE in the validation test set. Apply the function to every individual industry and illustrate for each industry which factor and macroeconomic variables are selected by the Lasso.

Solutions: All solutions are provided in the book chapter Factor selection via machine learning and the lecture slides

7.3 Empirical Asset Pricing via Machine Learning

This exercise should help you to fight your way through an actual academic application of machine learning methods in asset pricing. The exercise guides you step-by-step to replicate the empirical framework and therefore at some point also the main results of the paper Empirical Asset Pricing via Machine Learning by Shihao Gu, Bryan Kelly and Dacheng Xiu.

  • Start reading the paper from Section 2 (“An empirical study of U.S. equities”). In their study, the authors aim to use different machine learning procedures to approximate the overarching empirical model \(E_t\left(r_{i, t+1}\right) = g^\star(z_{i,t})\) as defined in the paper. The returns are monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ. The sample starts in March 1957 (the start date of the S&P 500) and ends in December 2016, totaling 60 years. Prepare the data from CRSP according to these requirements.
  • One major contribution of the paper is to implement the predictive framework with a huge collection of stock-level predictive characteristics based on the cross-section of stock returns literature. Table A.6 in the Internet Appendix provides the details of all these characteristics. You don’t have to generate each of the characteristics, instead download the data from Dacheng Xius homepage.

Exercises:

  • Download and extract the firm characteristics from the link above. Read the readme file and process the data the way the authors explain it in footnote 29 of the paper: “We cross-sectionally rank all stock characteristics period-by-period and map these ranks into the [-1,1] interval following Kelly, Pruitt, and Su (2019) and Freyberger, Neuhierl, and Weber (2020).”
  • Merge the dataset with the CRSP file and the macroeconomic predictors following the variable definitions detailed in Welch and Goyal (2008)
  • Replace missing values with the cross-sectional median at each month for each stock
  • Then, process the column `sic2´ which currently contains categorical data and which should be transformed into a matrix with 74 columns: The values in the cells should be 1 if the ´permno´ corresponds to the specific industry classification and 0 otherwise
  • In the original paper, the authors inflate the dataset by considering any possible interaction between macroeconomic variable and firm characteristic. How could you implement this? (Note: the dataset is huge already without interaction terms. For the purpose of this lecture, feel free to skip this final step.)

Solutions:

First we download the (large) dataset directly from within R. Presumably you want to store the output in an .sqlite file, for instance the file tidy_finance.sqlite or in a new file.

characteristics <- read_csv("https://www.dropbox.com/s/htkvj1hzabyucwk/datashare.csv?dl=1")

characteristics <- characteristics %>%
  mutate(DATE = ymd(DATE),
         DATE = floor_date(DATE, "month")) %>%
  rename("month" = "DATE") %>%
  rename_with(~paste0("characteristic_", .), -c(permno, month, sic2))

The cross-sectional ranking may be somewhat time-consuming. The idea is really just that at each date the cross section of each predictor should be scaled such that the maximum value is 1 and the minimum value is -1. The function below explicitly handles NA values such that they do not tamper with the ranking.

rank_transform <- function(x){
  rx <- rank(x, na.last = TRUE)
  non_nas <- sum(!is.na(x))
  rx[rx>non_nas] <- NA
  2*(rx/non_nas - 0.5)
}

characteristics <- characteristics %>%
  group_by(month) %>%
  mutate(across(-c(permno, sic2), rank_transform))

To merge with the monthly CRSP data (in order to get excess returns) we can rely on the “usual” workflow. For the sake of portfolio sorts I retain mktcap_lag in the sample.

tidy_finance <- dbConnect(SQLite(), "../Lectures/data/tidy_finance.sqlite", 
                          extended_types = TRUE)

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") %>%
  select(month, permno, mktcap_lag, ret_excess) %>% 
  collect()

characteristics <- characteristics %>%
  inner_join(crsp_monthly, by = c("month", "permno"))

Now it comes to merging everything with the Goyal Welch dataset we have been working with before to capture the relevance of macroeconomic variables. A lot of the processing below is just a repetition of the earlier preparation steps.

macro_predictors <- tbl(tidy_finance, "macro_predictors") %>%
  select(-rp_div) %>%
  collect() %>% 
  rename_with(~ paste0("macro_", .), -month) 

characteristics <- characteristics %>% 
     inner_join(macro_predictors, by = "month") %>% 
     arrange(permno, month) %>%
     select(permno, month,ret_excess, mktcap_lag, sic2, contains("macro"), contains("characteristic"))

The code below replaces missing values with the cross-sectional median at each month for each stock (which should be zero). One of the coauthors of the paper actually also just claims that NA values are set to zero.

replace_nas <- function(x){
  med <- median(x, na.rm = TRUE)
  x[is.na(x)] <- med
  x[is.na(x)] <- 0
  return(x)
}

characteristics <- characteristics %>%
  group_by(month) %>%
  mutate(across(contains("characteristic"), replace_nas))

Finally, I include the pre-processed file into a new file tidy_finance_ML.sqlite database for your convenience. You find the document on Absalon and can use it for your own research and replication attempts.

tidy_finance_ML <- dbConnect(SQLite(), "../Lectures/data/tidy_finance_ML.sqlite", 
                          extended_types = TRUE)

characteristics %>%
  ungroup() %>% 
  dbWriteTable(tidy_finance_ML, "stock_characteristics_monthly", ., overwrite = TRUE)