3 CRSP and Asset Pricing

3.1 Cleaning and preparing the CRSP sample

A large part of the empirical asset pricing literature focuses on US stock markets, more specifically, on the extensive data on prices and firm characteristic provided by the Center for Research in Security Prices(CRSP). As a student at KU, you have access to their vast database. To get familiar with the dataset, I provide the file raw_crsp.csv on Absalon which contains monthly observations for all stocks covered in the CRSP universe. The sample ranges from December 1925 until December 2019. A large part of the following computations is based on Empirical Asset Pricing: The Cross Section of Stock Returns (excerpts provided in Absalon) and a lot of the code has been prepared by my great colleague Christoph Scheuch. Moreover, we will make use of data provided on Kenneth R. French’s homepage. I already downloaded the file F_Research_Data_Factors.csv for you and provide it on Absalon but you are encouraged to take a deeper look at his very informative homepage.

3.1.1 Exercises

  • Read in the file raw_crsp.csvand make yourself familiar with the structure of the data. Filter out all observations that do have a share code 10 or 11 (a documentation of the variables is provided here). Make sure there are no duplicates in your dataset.
  • Replicate the two figures provided in the lecture slides: i) Create a time-series of the number of stocks listed on NASDAQ, NYSE and AMEX. ii) Illustrate the time series of total market values (inflation adjusted) based on industry classification siccd. The book chapters in Absalon (Empirical Asset Pricing) provide a precise walk-through if you need help.
  • Follow the procedure described in Chapter 7.2 of the textbook (Empirical Asset Pricing) to compute returns adjusted for delisting.
  • Download and prepare time series for the risk-free rate and the market return from Kenneth Frenchs data library. Illustrate the evolution of the risk-free rate over time.
  • Create summary statistics for the cross-section of returns, excess returns, adjusted returns and adjusted excess returns. There are many ways to do this, one way is to compute summary statistic for the cross-section on a monthly basis and then to report the time-series averages of these values.
  • Assume you invested 100 USD at the beginning of the sample in the market portfolio and reinvested all distributed dividends. Illustrate the time-series of your wealth.

3.1.2 Solutions

Note: the file is huge (roughly 4 million rows) and it may take some time to read-in the entire file

library(tidyverse)
library(lubridate) # for working with dates
# Read in CRSP data
crsp <- read_csv("raw_crsp.csv", col_types = cols(.default = col_character()))

read_csv usually does a great job in recognizing the data types in the individual columns. To avoid any parsing issues, however, I haven chosen to read in the file with the default that all columns contain character data and change the column type in the next step. You may note that the dataset actually contains 64 variables, many more than we need for this exercise.

# Clean the file
crsp <- crsp %>%
  janitor::clean_names() %>%                 # Replace column names with lower case
  transmute(permno = as.integer(permno),     # security identifier
            date = ymd(date),                # month identifier
            ret = as.numeric(ret) * 100,     # return (convert to percent)
            shrout = as.numeric(shrout),     # shares outstanding (in thousands)
            altprc = as.numeric(altprc),     # last traded price in a month
            exchcd = as.integer(exchcd),     # exchange code
            shrcd = as.integer(shrcd),       # share code
            siccd = as.integer(siccd),       # industry code
            dlret = as.numeric(dlret) * 100, # delisting return (converted to percent)
            dlstcd = as.integer(dlstcd)      # delisting code
  ) %>%
  mutate(date = floor_date(date, "month"))   # consistent naming convention: YYYY-MM-DD for each time series

Next, as common in the literature, we further restrict our sample and keep only US listed stocks. The command distinct allows to filter out any two rows which have the same permno-date combination and are thus duplicates. The option .keep_all = TRUE retains the remaining 8 columns.

crsp <- crsp %>%
  filter(shrcd %in% c(10, 11)) %>%               # keep only US listed stocks
  distinct(permno, date, .keep_all = TRUE) %>%   # remove duplicates
  select(-shrcd)

As described in the textbook, stocks listing exchanges are recorded in the field exchcd. Values 1 and 31 stand for NYSE listed stocks, 2 and 32 for NASDAQ, 3 and 33 for AMEX. There are plenty other trading venues, for the sake of brevity I aggregate the remaining ones as Other. Next, industry classifications on a really broad level are provided by the siccd code. The classification below differentiates between 10 different industries.

crsp <- crsp %>%
  mutate(mktcap = abs(shrout * altprc) / 1000,   # market cap in millions of dollars (shares outstanding*last reported trading price)
         mktcap = if_else(mktcap == 0, as.numeric(NA), mktcap), # replace stocks with 0 market cap with NA
         exchange = case_when(exchcd %in% c(1, 31) ~ "NYSE",
                              exchcd %in% c(2, 32) ~ "AMEX",
                              exchcd %in% c(3, 33) ~ "NASDAQ", # Relabel exchange codes
                              TRUE ~ "Other"),
         industry = case_when(siccd >= 1 & siccd <= 999 ~ "Agriculture", # Create industry codes
                              siccd >= 1000 & siccd <= 1499 ~ "Mining",
                              siccd >= 1500 & siccd <= 1799 ~ "Construction",
                              siccd >= 2000 & siccd <= 3999 ~ "Manufacturing",
                              siccd >= 4000 & siccd <= 4999 ~ "Transportation",
                              siccd >= 5000 & siccd <= 5199 ~ "Wholesale",
                              siccd >= 5200 & siccd <= 5999 ~ "Retail",
                              siccd >= 6000 & siccd <= 6799 ~ "Finance",
                              siccd >= 7000 & siccd <= 8999 ~ "Services",
                              siccd >= 9000 & siccd <= 9999 ~ "Public",
                              TRUE ~ "Missing"))
# Visualize number of securities (per exchange)
crsp %>%
  count(exchange, date) %>%
  ggplot(aes(x = date, y = n, group = exchange, linetype = exchange)) +
  geom_line(aes(color = exchange)) +
  labs(x = "", y = "Number of Securities", color = "Exchange", linetype = "Exchange") +
  scale_x_date(expand = c(0, 0),
               date_breaks = "10 years",
               date_labels = "%Y") +
  theme_classic()
Number of securities in CRSP sample

Figure 3.1: Number of securities in CRSP sample

The NYSE has by far the longest tradition on US equity markets, however, during more recent decades, NASDAQ listed an increasing number of stocks.

As described in the lecture slides, we adjust for changes in purchasing power with a measure of the consumer price index. The data is provided on Absalon in exactly the same way I downloaded it. See the slides or the textbook for more information.

cpi <- read_csv("data/CPIAUCNS.csv")
cpi <- cpi %>% # Clean CPI data
  rename("date" = DATE,
         "cpi" = CPIAUCNS) %>%
  mutate(cpi = cpi/dplyr::last(cpi))
crsp <- crsp %>%
  left_join(cpi, by = "date") # Join CPI data and CRSP sample

Note that in the figure below I illustrate market capitalization in thousand USD, whereas all values represent purchasing power as of 2020.

# Visualize market capitalization by industry
crsp %>%
  filter(!is.na(industry)) %>%
  group_by(industry, date) %>%
  summarize(mktcap = sum(mktcap / cpi, na.rm = TRUE) / 1000) %>%
  ggplot(aes(x = date, y = mktcap, group = industry)) +
  geom_line(aes(color = industry, linetype = industry)) +
  labs(x = "", y = "Total Market Value (Billions of Dec 2020 Dollars)",
       color = "Industry", linetype = "Industry") +
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
  theme_classic()
Market capitalization by industry in CRSP sample

Figure 3.2: Market capitalization by industry in CRSP sample

While stock returns are an easy to grasp concept, they are hard to measure if stocks undergo delisting procedures and therefore no publicly available quotes are available anymore. While CRSP provides some additional information on the form and reason of delisting it still remains hard to measure stock returns during such periods. Bali, Engle and Murray propose the following adjustment, which either assigns a return of -30% or -100%, depending of the reason of the delisting, in cases where CRSP does not provide a return during the delisting period.

# compute stock returns
crsp <- crsp %>%
  mutate(ret.adj = case_when(!is.na(dlret) ~ dlret, # ret.adj computes adjusted returns based on procedure described in Bali, Murray, Engle
                             is.na(dlret) & !is.na(dlstcd) ~ -100,
                             is.na(dlret) & (dlstcd %in% c(500, 520, 580, 584) |
                                               (dlstcd >= 551 & dlstcd <= 574)) ~ -30,
                             TRUE ~ ret)) %>%
  select(-c(dlret, dlstcd))

Next, I downloaded data on market returns and the risk-free rate from Kenneth Frenchs homepage. The dataset provided on Absalon is somewhat cleaned beforehand, therefore you are adviced to repeat this analysis on your own to really understand how French’s data library is structured. I join the CRSP returns with the Market and risk-free returns to compute excess returns later on.

factors_ff <- read_csv("F-F_Research_Data_Factors.csv", skip = 3) %>%
  rename(date = X1, MKT = `Mkt-RF`) %>%
  janitor::clean_names() %>%
  mutate(date = ymd(parse_date_time(date, "%Y%m")))%>%
  na.omit()

crsp <- crsp %>%
  left_join(factors_ff, by = "date") # Join with CRSP sample
factors_ff %>%
  ggplot(aes(x=date, y = rf)) +
  geom_point() +
  theme_bw() +
  labs(x="", y ="Risk-free rate (FF) in %")
Risk-free rate from Kenneth Frenchs database

Figure 3.3: Risk-free rate from Kenneth Frenchs database

Above I illustrate the risk-free rate as provided by Kenneth French. Finally, we compute some summary statistics for the excess returns as well as their (delisting) adjusted counterparts. The code below first computes summary statistics on the cross-section of stocks for each date and then simply reports the time-series average of each value. Note: To create the tables, the packages knitr and kableExtra have to be installed.

crsp_summaries <- crsp %>%
  mutate(ret.exc = ret - rf, # Compute excess returns
         ret.adj.exc = ret.adj - rf) %>%
  group_by(date) %>% # Compute summary statistics by month (cross-section of returns)
  summarise_at(vars(ret.adj.exc, ret.adj, ret.exc, ret),
               list(mean = mean,
                    sd = sd,
                    min = min,
                    q05 = ~quantile(., 0.05, na.rm = TRUE),
                    q25 = ~quantile(., 0.25, na.rm = TRUE),
                    median = median,
                    q75 = ~quantile(., 0.75, na.rm = TRUE),
                    q95 = ~quantile(., 0.95, na.rm = TRUE),
                    max = max,
                    n = ~as.numeric(sum(!is.na(.)))), na.rm = TRUE) %>%
  select(-date) %>%
  mutate_all(~if_else(!is.finite(.), as.double(NA), .)) %>%
  summarise_all(mean, na.rm = TRUE) %>% # Compute time series means
  pivot_longer(everything()) %>% # improve formatting
  separate(name, into = c("var", "stat"), sep = "_") %>%
  pivot_wider(names_from = stat, values_from = value)

crsp_summaries %>%
  kableExtra::kable("markdown", digits = 2) # Return table which is used in the slides
var mean sd min q05 q25 median q75 q95 max n
ret.adj.exc 0.86 13.68 -67.63 -16.68 -5.79 -0.13 6.05 20.95 208.27 3133.86
ret.adj 1.12 13.66 -67.25 -16.40 -5.52 0.13 6.31 21.19 207.87 3136.53
ret.exc 0.99 13.43 -56.25 -16.49 -5.70 -0.04 6.16 21.07 197.13 3117.27
ret 1.25 13.42 -55.93 -16.21 -5.42 0.23 6.41 21.31 196.77 3119.93

Finally, another representation of the evolution of the market portfolio. The figure below shows the cumulative log returns from the beginning of the sample and thus illustrates the dynamics of 100 USD invested in the US stock market throughout time. The self-financing option assumes you borrowed the 100 USD at the risk-free rate and rolled over your depth since then.

# Plot cumulative excess returns of mkt
factors_ff %>%
mutate("Self-financing" = 100*log(1 + mkt/100),
"Financed" = 100*log(1 + (rf + mkt)/100)) %>%
pivot_longer("Self-financing":"Financed") %>%
group_by(name) %>%
mutate(value = cumsum(value)) %>%
ggplot(aes(x=date,
y = value, color = name)) +
geom_line() +
labs(x = "", y = "Cumulative portfolio value") +
theme_minimal()
Cumulative portfolio values

Figure 3.4: Cumulative portfolio values

Note: From here it is easy to create to file exercise_clean_crsp_returns.rds from exercise 2.2. The code below has been used to generate the time series of returns.

# Filter and clean
crsp <- crsp %>%
drop_na(permno, date, ret.adj) %>%
filter(mktcap > 0) %>%
select(permno, date, ret.adj)

# Prepare CRSP returns for exercise
date_seq <- tibble(req_date = seq(ymd('1950-01-01'), ymd('2019-12-01'), by = '1 month'))

crsp_final <- crsp %>% inner_join(date_seq, by = c("date" = "req_date"))
crsp_final <- crsp_final %>%
group_by(permno) %>%
filter(n() == 840)

crsp_final <- crsp_final %>%
mutate(m = mean(ret.adj), sd = sd(ret.adj)) %>%
ungroup() %>%
filter(m < quantile(m, 0.95), # Some filtering to avoid annoying outliers.
sd < quantile(sd, 0.95)) %>%
select(-m, -sd)

crsp_final %>% write_rds("exercise_clean_crsp_returns.rds")

3.2 Beta in the CRSP sample and Portfolio sorts

This exercise is concerned about estimating \(\beta_i\) for the cross-section of US equity returns based on the CRSP sample. You will estimate the exposure of individual stock with respect to changes in the market portfolio. First, compute \(\beta_i\) for an individual stock to get familiar with linear regression in R. Then, we run rolling window regressions for the entire CRSP sample. To conduct a rolling window estimation you can either employ a for loop or exploit the functionalities of the convenient slider package.

3.2.1 Exercises

  • Read in the cleaned CRSP data set crsp_clean_sample.rds from exercise 2.5 and compute the market beta \(\beta_\text{AAPL}\) of ticker AAPL (permno == 14594). You can use the function lm() for that purpose (alternatively: compute the well-known OLS estimate \((X'X)^{-1}X'Y\) on your own).
  • For monthly data, it is common to compute \(\beta_i\) based on a rolling window of length 5 years. Implement a rolling procedure that estimates assets market beta each month based on the last 60 observations. You can either use the package slider or a simple for loop for that. (Note: this is going to be a time-consuming computational task.)
  • Provide summary statistics for the cross-section of estimated betas.
  • What is the theoretical prediction of CAPM with respect to the relationship between market beta and expected returns? What would you expect if you create portfolios based on beta (you create a high- and a low-beta portfolio each month and track the performance over time)? How should the expected returns differ between high and low beta portfolios?
  • Verify/ test your hypothesis by implementing such a portfolio sort procedure. You can follow the following sub-steps: i) For each month, compute the decile beta of the cross-section of returns. ii) use these deciles as thresholds which define for each asset into which portfolio it belongs (if the estimated beta of asset \(i\) is below the 30% and 40% decile, it belongs into the third portfolio). iii) Weight the assets within each portfolio either by equal weighting or relative to their market capitalization. iv) Compute the portfolio return of each portfolio in the next period. v) Repeat this step for each month.

3.2.2 Solutions

# install.packages("slider")
library(tidyverse) # for grammar
library(lubridate) # for working with dates
library(slider)     # for rolling window operations (https://github.com/DavisVaughan/slider)

We will make use of the slider package later on. Make sure it is installed before you run the code below.

First, I read in the data and extract the relevant variables. To estimate the CAPM regression equation \[r_{i,t} - r_{f,t} = \alpha_i + \beta_i\left(r_{m,t} - r_{f,t}\right) + \varepsilon_{i,t}\] the excess returns of the individual asset \(i\) and the excess market returns are required.

# Read in the cleaned CRSP data set
crsp <- read_rds("crsp_clean_sample.rds")

R provides a whole machinery to estimate (linear) models with the function lm(). lm() requires a formula as input which is specific in a compact symbolic form. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. In addition to standard linear models, lm() provides a lot of flexibility, you should check out the documentation for more information. Below I restrict the data only to the time-series of observations in CRSP that correspond to Apple’s stock (permno of Apple is 14594) and compute \(\alpha_i\) as well as \(\beta_i\) based on the assets and the markets excess returns.

crsp <- crsp %>%
  mutate(ret_excess = ret.adj - rf,
         mkt_excess = mkt - rf) %>%
  select(permno, date, ret_excess, mkt_excess, mktcap)

# Individual CAPM Regression
fit <- lm(ret_excess ~ mkt_excess, data = crsp %>%
     filter(permno == "14594")) # APPLE PERMNO
summary(fit)
## 
## Call:
## lm(formula = ret_excess ~ mkt_excess, data = crsp %>% filter(permno == 
##     "14594"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.023 -12.758  -3.502   7.772  62.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.9217     2.8074   0.684    0.497
## mkt_excess    0.9453     0.9027   1.047    0.300
## 
## Residual standard error: 20.1 on 53 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02028,    Adjusted R-squared:  0.00179 
## F-statistic: 1.097 on 1 and 53 DF,  p-value: 0.2997

lm() returns an object of class lm which contains all information one usually cares about with linear models. summary() returns an easy to understand overview of the estimated parameters. coefficients(fit) would return only the estimated coefficients. The output above indicates that Apple moves closely with the market, \(\beta\) is close to one. What the output also shows, however, is that estimation uncertainty is huge and we cannot reject the \(H_0\) that \(\beta\) is actually zero. What does this imply for the expected excess returns according to the CAPM?

Next, we scale the estimation of \(\beta\) to a whole different level and perform rolling window estimations for the entire CRSP sample. To do this, I first define a little helper function capm_regression which makes use of the code snippet above.

# Rolling window estimation
capm_regression <- function(x, check = 24) {
  # A simple function that computes CAPM betas using lm if the number of non-NA observations exceeds 24
  if (nrow(x) < check) { # "feasible" regressions require at least observations
    return(tibble(beta = NA))
  } else {
    reg <- lm(ret_excess ~ mkt_excess, data = x)
    return(tibble(beta = reg$coefficients[2]))
  }
}

Note that I included a test if the time-series contains at least 24 observations to avoid huge fluctuations in case the time-series is just too short. All this function returns is the market beta regression coefficient.

The next function wraps everything together: the slider package comes with the convenient function slide_index() which takes as first argument (X) a tibble, as second argument an (date) index (x$date) and performs a function (capm_regression(.x)) on the subset of data .x that is identified as falling withing the rolling window. The length of the window is provided via .before = months(60). Thus, for each month, the rolling window CAPM regression coefficient \(\hat\beta\) is stored in a new column which I call beta.

rolling_capm_regression <- function(x, window = 60) {
  # slide across dates
  betas <- slide_index_dfr(x,
                           x$date,
                           ~capm_regression(.x),
                           .before = months(window),
                           .complete = FALSE)

  # collect output
  x %>% mutate(beta =  betas$beta)
}

The final step is just to let Rdo the rest and to wait (probably quite some time) until the large number of regressions is completed. Note below, that the computation is performed by permno.

betas_monthly <- crsp %>%
  na.omit() %>%
  group_by(permno) %>%
  arrange(permno, date) %>%
  group_modify(~rolling_capm_regression(.x)) %>%
  na.omit()

How do the estimated betas look like? There are many ways to provide some summary statistics. The code snippet below computes some cross-sectional summary statistics (for each date) and then reports the time-series averages. In case you are wondering: the package knitr provides amazing features to transform R output into ready-to-publish formats (also supports LaTeX).

betas_monthly %>%
  group_by(date) %>%
  summarise(mean = mean(beta),
            sd = sd(beta),
            skew = moments::skewness(beta),
            kurt = moments::kurtosis(beta),
            min = min(beta),
            q25 = quantile(beta, 0.25),
            q50 = quantile(beta, 0.50),
            q75 = quantile(beta, 0.75),
            max = max(beta),
            n = n()) %>%
  ungroup() %>%
  select(-date) %>%
  summarise_all(list(~mean(., na.rm = TRUE))) %>%
  knitr::kable("pipe", digits = 2)
mean sd skew kurt min q25 q50 q75 max n
1.16 0.64 0.49 7.3 -2.17 0.73 1.1 1.53 5.11 2710.21

The table above shows that we computed betas for on average 2710 firms per month (after removing NAs). The typical firm has a mean market beta of about \(1\) but there is a lot of variation. We also see that only very few firms exhibit a negative beta.

The next section implements portfolio sorts based on beta. The idea is to create portfolios of assets with a comparable characteristic such as market beta. The idea, however, can also be translated to sorting based on size, book-to-market values or any other firm characteristic. First, I create a new variable _f1 which will track the returns of an asset in the following month. These values will be used to track the portfolio performance.

# Next, we construct portfolio sorts

# We add the (excess) returns for the subsequent month of each stock to the sample.
# Note that we use beginning of month dates to ensure correct matching of dates

betas_monthly <- betas_monthly %>%
  arrange(permno, date) %>%
  group_by(permno) %>%
  mutate(ret_excess_f1 = lead(ret_excess, 1),
         mkt_excess_f1 = lead(mkt_excess, 1))

Next, I compute the sorting thresholds. In this case, I want to create 10 portfolios of assets which have similar betas. For that purpose I first compute the beta decentiles for each month. Firms will be assigned to individual portfolios based on these thresholds.

# calculate quantiles
percentiles <- seq(0.1, 0.9, 0.1)
percentiles_names <- map_chr(percentiles, ~paste0("q", .x*100))
percentiles_funs <- map(percentiles,
                        ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
  set_names(nm = percentiles_names)

quantiles <- betas_monthly %>%
  group_by(date) %>%
  summarise_at(vars(beta), lst(!!!percentiles_funs))

quantiles %>%
  pivot_longer(-date, names_to = "Quantile") %>%
  ggplot(aes(x = date, y = value, color = Quantile)) + geom_line() + theme_bw() +
  labs(x="", y = "Beta")
Beta quantiles in the CRSP sample

Figure 3.5: Beta quantiles in the CRSP sample

The figure above illustrates the thresholds for market beta. There is some time-series variation we account for by recomputing the quantiles each month. Next, we sort each asset to its portfolio on a monthly basis. The implied portfolio weights are usually chosen to be either equally weighted (in that case the portfolio return is just the average of all assets) or relative to market capitalization (in that case the portfolio retun is a weighted mean).

# sort all stocks into decile portfolios
portfolios <- betas_monthly %>%
  left_join(quantiles, by = "date") %>%
  mutate(portfolio = case_when(beta <= q10 ~ 1L,
                               beta > q10 & beta <= q20 ~ 2L,
                               beta > q20 & beta <= q30 ~ 3L,
                               beta > q30 & beta <= q40 ~ 4L,
                               beta > q40 & beta <= q50 ~ 5L,
                               beta > q50 & beta <= q60 ~ 6L,
                               beta > q60 & beta <= q70 ~ 7L,
                               beta > q70 & beta <= q80 ~ 8L,
                               beta > q80 & beta <= q90 ~ 9L,
                               beta > q90 ~ 10L))

portfolios_ts <- portfolios %>%
  filter(date <= "2019-12-01") %>%
  mutate(portfolio = as.character(portfolio)) %>%
  group_by(portfolio, date) %>%
  summarize(ret_vw = weighted.mean(ret_excess_f1, mktcap, na.rm = TRUE),
            ret_mkt = mean(mkt_excess_f1, na.rm = TRUE)) %>%
  na.omit() %>%
  ungroup()

As an additional portfolio I include a self-financing portfolio which finances a long position in high beta stocks (“10”) with a short position in low beta stocks (“1”).

# 10-1 portfolio
portfolios_ts_101 <- portfolios_ts %>%
  filter(portfolio %in% c("1", "10")) %>%
  pivot_wider(names_from = portfolio,
              values_from = ret_vw) %>%
  mutate(ret_vw = `10` - `1`,
         portfolio = "10-1") %>%
  select(portfolio, date, ret_vw, ret_mkt)

# combine everything
portfolio_returns <- bind_rows(portfolios_ts, portfolios_ts_101) %>%
  mutate(portfolio = factor(portfolio, levels = c(as.character(seq(1, 10, 1)), "10-1")))

What is the performance of these portfolios? I provide two evaluations metrics. First, the average returns and then the CAPM risk adjusted alphas per portfolio. Recall, that high beta portfolios should, according to the CAPM, earn higher expected returns (if the market risk premium is positive) than low beta assets. However, after adjusting for the covariance with the market portfolios, both assets should return a zero alpha, such that only non-diversifiable risk-taking activities are rewarded.

average_ret <- portfolio_returns %>%
  group_by(portfolio) %>%
  arrange(date) %>%
  do(model = lm(paste0("ret_vw ~ 1"), data = .)) %>%
  mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6)))) %>%
  mutate(broom::tidy(model)) %>%
  ungroup() %>%
  mutate(nw_tstat = estimate / nw_stderror) %>%
  select(estimate, nw_tstat) %>%
  t()

# estimate capm alpha per portfolio
average_capm_alpha <- portfolio_returns %>%
  group_by(portfolio) %>%
  arrange(date) %>%
  do(model = lm(paste0("ret_vw ~ 1 + ret_mkt"), data = .)) %>%
  mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6))[1])) %>%
  mutate(estimate = coefficients(model)[1]) %>%
  mutate(nw_tstat = estimate / nw_stderror) %>%
  select(estimate, nw_tstat) %>%
  t()

# construct output table
out <- rbind(average_ret, average_capm_alpha)
colnames(out) <-c(as.character(seq(1, 10, 1)), "10-1")
rownames(out) <- c("Excess Return", "t-Stat" , "CAPM Alpha", "t-Stat")

out %>% knitr::kable(digits = 2, "pipe")
1 2 3 4 5 6 7 8 9 10 10-1
Excess Return 0.46 0.48 0.62 0.63 0.67 0.67 0.64 0.64 0.64 0.67 0.23
t-Stat 2.66 2.69 3.56 3.29 3.11 3.03 2.70 2.45 2.16 1.98 0.78
CAPM Alpha 0.25 0.22 0.32 0.29 0.29 0.24 0.18 0.13 0.06 0.01 -0.53
t-Stat 1.81 1.58 2.77 2.55 2.45 2.07 1.53 1.01 0.44 0.08 -2.50

Note: the code above implements standard errors based on Newey West autocorrelation adjustments in the error terms. What do the portfolio performances reveal? At least a big surprise if you believe in the CAPM: There is no indication that average excess returns increase with beta. In fact, the self-financing portfolio yields average excess returns of 0.23, which is not significantly different from zero. Thus, there is a strong signal that the CAPM story alone is not sufficient to explain the cross-section of returns.

3.3 Estimating and testing the CAPM

This exercise is closely aligned to the slides on the capital asset pricing model and asks you to estimate the market beta for some portfolios of assets sorted by industry, to compute optimal portfolio weights based on the CAPM factor structure and to test the validity of the CAPM.

3.3.1 Exercises

  • Download monthly return data for 10 industry portfolios from Kenneth French’s data library and prepare it to work with the file in R (I advice you to download the .csv file and to take a look in the file to remove unnecessary rows). Also, retrieve the market and risk-free returns from Kenneth French’s data library. Provide some summary statistics for the industry portfolio (excess) returns.
  • For each industry portfolio, estimate the CAPM model with simple OLS. What are the \(\alpha\)’s? What are the \(\beta\)’s? What does the CAPM imply for \(\alpha\)?
  • Perform a joint test on the validity of the CAPM based on the finite sample distribution of a Wald test proposed by MacKinlay (1987) and Gibbons, Ross, and Shanken (1989). What do you conclude?
  • Suppose the CAPM would hold. How could you make use of the factor structure of the returns to compute portfolio weights on the efficient frontier? What is the advantage of this procedure? Compare the portfolio weights of risky assets for a desired annualized rate of return of 10% resulting from using the sample mean and covariance, \(\hat\mu\) and \(\hat\Sigma\), as input into the optimization approach with the corresponding weights based on the moments of the return distribution implied by the CAPM.

3.3.2 Solutions

You can download the .csv manually from the homepage or from within R following this blog post. I cleaned the file by removing the annual and equally-weighted returns from the file. This raw version is also provided on Absalon.

# Load required packages
library(tidyverse)
library(lubridate)

# Read in the data

returns <- read_csv("../Empirics/data/10_Industry_Portfolios.csv", skip = 11) %>%
  rename(date = X1) %>%
  mutate(date = ymd(parse_date_time(date, "%Y%m")))%>%
  # parse_date_time is a convenient way to transform the YYYYMM convention
  # in Kenneth French's files into dates R can work with
na.omit()

# Merge with the market factor and the risk-free rate to compute excess returns
factors_ff <- read_csv("../Empirics/data/F-F_Research_Data_Factors.csv", skip = 3) %>%
  rename(date = X1, MKT = `Mkt-RF`) %>%
  janitor::clean_names() %>%
  mutate(date = ymd(parse_date_time(date, "%Y%m")))%>%
  na.omit()

After reading the files I join both together to compute the excess returns. The line mutate_at(vars(-date), ~ .-rf) below performs a mutation on each columns except for date and substracts the value of the column rf from the original value.

returns <- returns %>%
  left_join(factors_ff, by = "date") %>%
  select(-smb, -hml) %>%
  mutate_at(vars(-date), ~ .-rf) %>%
  select(-rf) %>%
  na.omit()

Next, I provide some summary statistics for each ticker, including mkt, the market excess returns.

returns %>%
  pivot_longer(-date, names_to = "ticker") %>%
  group_by(ticker) %>%
  summarise(mean = 12*mean(value), # annualized values
            sd = sqrt(12)*sd(value),
            sharpe = mean/sd)
ticker mean sd sharpe
Durbl 10.39 27.14 0.38
Enrgy 8.14 21.73 0.37
HiTec 10.45 25.03 0.42
Hlth 9.77 19.23 0.51
Manuf 8.94 21.63 0.41
mkt 4.83 18.61 0.26
NoDur 8.28 15.89 0.52
Other 7.60 22.21 0.34
Shops 9.22 20.17 0.46
Telcm 7.01 15.93 0.44
Utils 7.18 19.03 0.38

The implication of the CAPM is that all elements of the \((N \times 1)\) vector \(\alpha\) are zero in the joint regression framework \[Z_t = \alpha + \beta Z_{m,t} + \varepsilon_t\] where \(\beta\) is the \((N \times 1)\) vector of market betas (if \(\alpha\) is zero, then \(m\) is the tangency portfolio). Then, standard Ordinary least squares (OLS) estimators procedure delivers \[\hat\alpha = \hat\mu - \hat\beta\hat\mu_m\] with \[\hat\beta = \frac{\sum (Z_t - \hat\mu)(Z_{m,t} - \hat\mu_m)}{\sum (Z_{m,t} - \hat\mu_m)^2}.\] Alternatively, let \(X = (1, Z_m)\) be a \((T \times 2)\) matrix, then \((X'X)^{-1}X'Z\) is the \((2 \times N)\) matrix of coefficients. You can also use lm to compute the same coefficients but for the sake of brevity I provide the explicit estimation below.

Z <- returns %>% select(-date, -mkt) %>% as.matrix()
Z_m <- returns %>% pull(mkt)

X <- cbind(1, Z_m)
mle_estimate <- solve(t(X)%*%X) %*% t(X)%*%Z
alphas <- matrix(mle_estimate[1, ], nr = 1)

# Exactly the same can be computed with lm(Z~Z_m)

The estimated parameters

NoDur Durbl Manuf Enrgy HiTec Telcm Shops Hlth Utils Other
0.39 0.36 0.30 0.32 0.38 0.32 0.38 0.48 0.29 0.18
Z_m 0.75 1.25 1.12 0.89 1.22 0.66 0.96 0.83 0.76 1.12

To evaluate the significance of the individual parameters you can either implement the equations provided in Chapter 4 of The Econometrics of Financial Markets (available in Absalon) or exploit directly the functionalities provided by lm (which, in principle, also allow to compute standard errors in the presence of heteroscedasticity). As an example, see the code below

  fit <- lm(Z~Z_m)
  fit <- summary(fit)
  fit$`Response HiTec`
## 
## Call:
## lm(formula = HiTec ~ Z_m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9742  -1.8207  -0.0544   1.6076  14.9026 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.37904    0.09007   4.208 2.78e-05 ***
## Z_m          1.22194    0.01673  73.056  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.019 on 1128 degrees of freedom
## Multiple R-squared:  0.8255, Adjusted R-squared:  0.8254 
## F-statistic:  5337 on 1 and 1128 DF,  p-value: < 2.2e-16

The estimated parameters reveal that all industry portfolios exhibit a positive alpha (you can convince yourself that they are significantly different than 0). This indicates that each of the portfolios yield returns which are not fully explained by the exposure to market fluctuations. The estimated betas range from 0.66 (telecommunication) to 1.25 (durable goods) and reflect that industry are differently affected by fluctuations in the economy.

Next, we test if all alphas are jointly different from zero. If that is the case it implies that the market portfolio is not the efficient tangency portfolio. The following code just follows the equations outlined in the slides: MacKinlay (1987) and Gibbons, Ross, and Shanken (1989) developed the finite-sample distribution of the test-statistic \(J\) which yields \[J = \frac{T-N-1}{N}\left(1 + \frac{\hat \mu ^2}{\hat\sigma_m^2}\right)^{-1}\hat\alpha'\hat\Sigma^{-1}\hat\alpha\] where \(\hat\Sigma = Cov(\hat\varepsilon)\).

# Perform Wald test
N <- ncol(Z)
T <- nrow(Z)
mle_sigma <- cov(Z - X %*% mle_estimate)
test_statistic <- (T-N-1)/N * (1 + mean(Z_m)^2/sd(Z_m)^2)^(-1) * alphas %*% solve(mle_sigma) %*% t(alphas)
test_statistic
##          [,1]
## [1,] 32.93979

Under the null hypothesis, \(J\) is unconditionally distributed central \(F\) with \(N\) degrees of freedom in the numerator and \((T—N—l)\) degrees of freedom in the denominator. R provides extensive functions to sample from probability distributions, to evaluate quantiles and the probability mass at a specific point. You can check ?rnorm for a description based on the normal distribution.

1 - pf(test_statistic, T, T - N - 1) # Compute p-value
##      [,1]
## [1,]    0
qf(0.95, T, T - N - 1) # 95% critical value
## [1] 1.103137

The test clearly discard the CAPM. In other words, the estimated alphas are too high to justify that our chosen proxy for the market portfolio can be the efficient tangent portfolio.

As pointed out in the slides, the assumption of a factor structure in the returns reduces the number of parameters to estimate substantially. Recall from Exercise 2.3 how the efficient tangent portfolio is computed.

efficient_portfolio <- function(mu, sigma, desired_return = 10/12){
  N <- length(mu)
  iota <- rep(1, N)
  wmvp <- solve(sigma) %*% iota
  wmvp <- wmvp/sum(wmvp)
  # Compute efficient portfolio weights for given level of expected return
  C <- as.numeric(t(iota)%*%solve(sigma)%*%iota)
  D <- as.numeric(t(iota)%*%solve(sigma)%*%mu)
  E <- as.numeric(t(mu)%*%solve(sigma)%*%mu)
  lambda_tilde <- as.numeric(2*(desired_return -D/C)/(E-D^2/C))
  weff <- wmvp + lambda_tilde/2*(solve(sigma)%*%mu - D/C*solve(sigma)%*%iota)
  return(weff)
}

Next we compute the moments of the return distribution in two ways.

# Compute sample moments
mu <- returns %>% select(-date, -mkt) %>% colMeans()
sigma <- returns %>% select(-date, -mkt) %>% cov()

# Same with CAPM
# Compute restricted betas
constrained_estimate <- solve(t(Z_m)%*%Z_m) %*% t(Z_m)%*%Z

mu_m <- mean(returns$mkt) # average market risk premium
rf_m <- mean(factors_ff$rf) # average risk-free rate

mu_capm <- rf_m + constrained_estimate * mu_m
mu_capm <- as.numeric(mu_capm)
sigma_eps <- apply(Z - Z_m %*% constrained_estimate, 2, sd) # Column-wise residual standard deviation
sigma_m <- var(returns$mkt)

sigma_capm <- sigma_m * t(constrained_estimate) %*% constrained_estimate + diag(sigma_eps)

The resulting portfolio weights differ because the inputs changed

w <- efficient_portfolio(mu, sigma)
w_capm <- efficient_portfolio(mu_capm, sigma_capm)

cbind(w, w_capm)
##              [,1]        [,2]
## NoDur  0.76066404 -0.17898124
## Durbl  0.19253393  0.29859232
## Manuf -0.24936416  0.43480983
## Enrgy  0.22018162  0.01389309
## HiTec  0.20148549  0.36365099
## Telcm  0.20022415 -0.23072028
## Shops  0.07288366  0.10376855
## Hlth   0.45092041 -0.03944451
## Utils -0.02311254 -0.09580907
## Other -0.82641661  0.33024032

3.4 Fama MacBeth Regressions

3.4.1 Exercises

This exercises asks you to implement Fama MacBeth regressions. For a more detailed theoretical treatment, check out the Chapter of the textbook The econometrics of financial markets (available in Absalon).

  • Before starting with implementing the regression, write a short function to winsorize a variable. Winsorizing means to simply replace values which are above or below a pre-specified quantile with that quantile. It is common in the asset pricing literature to winsorize the independent variable before running regressions so it is a good thing to implement winsorization once.
  • Estimate the market risk premium based on Fama MacBeth regressions. To do this, for each cross-section of returns (e.g., each month), project the realized returns on the estimated betas from the previous exercise and then aggregate the estimates in the time dimension. That is, for each month \(t\), estimate \[Z_t = \gamma_{0,t}\iota + \gamma_{1,t}\beta + \eta_t\] and then compute the time-series mean of \(\hat\gamma_{0,t}\) and \(\hat\gamma_{1,t}\).

3.4.2 Solutions

The methodology of winsorizing is straight-forward to implement (although also somewhat arbitrary). The function below implements winsorizing in a pipeable (%>%) fashion.

winsorize <- function(x, cut = 0.005){
  cut_point_top <- quantile(x, 1 - cut, na.rm = TRUE)
  cut_point_bottom <- quantile(x, cut, na.rm = TRUE)
  i <- which(x >= cut_point_top)
  x[i] <- cut_point_top
  j <- which(x <= cut_point_bottom)
  x[j] <- cut_point_bottom
  return(x)
}

# prepare and winsorize data
data_nested <- betas_monthly %>%
  drop_na(ret_excess_f1, beta) %>%
  group_by(date) %>%
  mutate(beta = winsorize(beta)) %>%
  nest()

Note that the tibble betas_monthly is the output of the rolling window estimation in Exercise 2.6. After winsorizing we start with the “first” step of the Fama-Mac Beth regression. For each cross-section (that is, for each month) we regress the (one-month ahead) excess returns on the (pre-estimated) beta. Note the use of the tidyverse package broom which provides an easy way to extract summary stats for the individual regression estimations. If it is unclear what nested() and map are doing, check out the chapter Many models in Hadley Wickham’s book.

# perform cross-sectional regressions for each month
cross_sectional_regs <- data_nested %>%
  mutate(model = map(data, ~lm(ret_excess_f1 ~ beta, data = .x))) %>%
  mutate(tidy = map(model, broom::tidy),
         n = map_dbl(model, stats::nobs))

# extract average coefficient estimates
fama_macbeth_coefs <- cross_sectional_regs %>%
  unnest(tidy) %>%
  group_by(term) %>%
  summarize(coefficient = mean(estimate))

Above we compute the time-series average of the estimated regression coefficients \(\hat\gamma_0\) and \(\hat\gamma_1\). The remaining code implements Newey west standard errors of the average coefficient estimates and proceeds with some householding to present the regression results in a nice way.

# compute newey-west standard errors of average coefficient estimates
newey_west_std_errors <- cross_sectional_regs %>%
  unnest(tidy) %>%
  group_by(term) %>%
  arrange(date) %>%
  group_modify(~enframe(sqrt(diag(sandwich::NeweyWest(lm(estimate ~ 1, data = .x), lag = 6))))) %>%
  select(term, nw_std_error = value)

# put coefficient estimates and standard errors together and compute t-statistics
fama_macbeth_coefs <- fama_macbeth_coefs %>%
  left_join(newey_west_std_errors, by = "term") %>%
  mutate(nw_t_stat = coefficient / nw_std_error) %>%
  select(term, coefficient, nw_t_stat) %>%
  pivot_longer(cols = c(coefficient, nw_t_stat), names_to = "statistic") %>%
  mutate(statistic = paste(term, statistic, sep = " ")) %>%
  select(-term)

# extract average number of observations
fama_macbeth_stats <- cross_sectional_regs %>%
  ungroup() %>%
  summarize(n = mean(n)) %>%
  pivot_longer(n, names_to = "statistic")

# combine desired output and return results
out <- rbind(fama_macbeth_coefs, fama_macbeth_stats)

out %>% knitr::kable(digits = 2, "pipe")
statistic value
(Intercept) coefficient 0.79
(Intercept) nw_t_stat 3.94
beta coefficient 0.08
beta nw_t_stat 0.68
n 2692.88

What do the results show? One the one hand side, we cannot reject the \(H_0\) that the intercept is zero. However, the coefficient on the market beta, which is the market risk premium is also not significantly different from zero. Although slightly positive (which is what we would expect from a priced risk factor), these regression results indicate that the market factor is not priced.

3.5 “A comprehensive look at the empirical performance of equity premium prediction”

In this easy exercise you will familiarize yourself with the influential paper “a comprehensive look at the empirical performance of equity premium prediction” by Amit Goyal and Ivo Welch (2008). On his homepage, Amit Goyal provide the underlying data until 2019. Your task is to download, clean and prepare the data to replicate the main results in the paper. Next, we will use this dataset to run a number of machine learning methods for equity premium prediction. We will use this dataset as a benchmark example to get started with the implementation of various machine learning methods.

3.5.1 Exercises

  • Download the updated data (up to 2019) from Amit Goyal’s homepage and read-in the monthly observations. Read section 1.1 in the paper carefully and try to understand how the variables are created. More specifically, you should create the following variables (taken from the paper):
    1. (svar): Stock Variance is computed as sum of squared daily returns on S&P 500. Daily returns for 1871 to 1926 are obtained from Bill Schwert while daily returns from 1926 to 2005 are obtained from CRSP
    2. Cross-Sectional Premium (csp): The cross-sectional beta premium measures the relative valuations of high- and low-beta stocks. We obtained this variable directly from Sam Thompson. This variable is available from May 1937 to December 2002
    3. NetEquity Expansion (ntis) is the ratio of twelve-month moving sums of net issues by NYSE listed stocks divided by the total market capitalization of NYSE stocks
    4. Long Term Yield (lty): Long-term government bond yields for the period 1919 to 1925 is the U.S. Yield On Long-Term United States Bonds series from NBER’s Macrohistory database. Yields from 1926 to 2005 are from Ibbotson’s Stocks, Bonds, Bills and Inflation Yearbook
    5. Long Term Rate of Return (ltr): Long-term government bond returns for the period 1926 to 2005 are from Ibbotson’s Stocks, Bonds, Bills and Inflation Yearbook
    6. The Term Spread (tms) is the difference between the long term yield on government bonds and the T-bill
    7. Inflation (infl): Inflation is the Consumer Price Index (All Urban Consumers) for the period 1919 to 2005 from the Bureau of Labor Statistics. Because Inflation information is released only in the following month, in our monthly regressions, we inserted one month of waiting before use Inflation (infl): Inflation is the Consumer Price Index (All Urban Consumers) for the period 1919 to 2005 from the Bureau of Labor Statistics. Because inflation information is released only in the following month, in our monthly regressions, we inserted one month of waiting before use.
  • Create a tibble which contains the time-series of dependent variables (stock excess returns) and the host of independent variables. The sample period should start from 1957.

3.5.2 Solution

The data is provided (here)[http://www.hec.unil.ch/agoyal/docs/PredictorData2019.xlsx]. You can download it, store it locally and then read-it in, e.g. with the function read_xlsx from the readxl package. Alternatively, you can perform all these steps from within R by just calling

# install.packages("rio")
welch_data <- rio::import("http://www.hec.unil.ch/agoyal/docs/PredictorData2019.xlsx")

To read in the file, you will need to use the following line of code.

welch_data <- readxl::read_xlsx("PredictorData2019.xlsx")

Next we perform the usual cleaning steps: The column yyyymm comes in an unhandy format and it may require some time to figure out how to compute the diverse variables.

library(tidyverse)
library(lubridate)

welch_data <- welch_data %>%
  mutate(yyyymm = ymd(paste0(yyyymm,"01"))) %>%
  rename("date" = "yyyymm") %>%
  mutate_at(vars(-date), ~as.numeric(.)) %>%
  filter(date >= "1957-03-01")

welch_data <- welch_data  %>%
  mutate(IndexDiv = Index + D12,
         logret = log(IndexDiv) - log(lag(IndexDiv)),
         Rfree = log(Rfree + 1),
         rp_div = logret - Rfree,
         dp = log(D12) - log(Index),               # Dividend Price ratio
         dy = log(D12) - log(lag(Index)),          # Dividend Yield
         ep = log(E12) - log(Index),               # Earnings Price ratio
         de = log(D12) - log(E12),                 # Dividend Payout Ratio
         tms = lty - tbl,                           # Term Spread
         dfy = BAA - AAA) %>%                      # Default yield spread
  select(date, rp_div, dp, dy, ep, de, svar, csp, bm = `b/m`, ntis, tbl, lty, ltr, tms, dfy, infl) %>%
  na.omit()

3.6 Shrinkage Estimation

In this exercise you are supposed to familiarize yourself with the details behind shrinkage regression methods such as Ridge and Lasso. Although R provides amazing interfaces which perform the estimation flawlessly, you are first asked to implement Ridge and Lasso regression estimators from scratch before moving on to using the package glmnet next.

3.6.1 Exercises

  • Write a function that requires three inputs, y (a \(T\) vector), X (a \((T \times K)\) matrix), and lambda and which returns the Ridge estimator (a \(K\) vector) for given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should allow to indicate whether \(X\) contains a vector of ones as first column which should be exempt from the \(L_2\) penalty.
  • Compute the \(L_2\) norm (\(\beta'\beta\)) for the regression coefficients based on the predictive regression from the previous exercise for a range of \(\lambda\)’s and illustrate the effect the penalization in a suitable figure.
  • Now, write a function that requires three inputs, y (a \(T\) vector), X (a \((T \times K)\) matrix), and ’lambda` and which returns the Lasso estimator (a \(K\) vector) for given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should allow to indicate whether \(X\) contains a vector of ones as first column which should be exempt from the \(L_1\) penalty.
  • After you are really sure you understand what Ridge and Lasso regression are doing, familiarize yourself with the documentation of the package glmnet(). It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients not only for Ridge and Lasso but also for combinations therefore, commonly called elastic nets.
  • The choice of the hyperparameter \(\lambda\) should be based on the predictive performance within a validation sample. Split the entire sample into a training, validation and final test set. The test set should not be used for any computations but only to evaluate the out-of-sample predictive performance. One unified framework for such machine learning methods is the still very expiremental tidymodels library which consists of a number of packages that aim to improve machine learning tasks. You are advised to familiarize yourself with their great online tutorials to learn more about ‘tidymodels’. Use the sample split to choose the optimal penalty parameter \(\lambda\) for Ridge regression and show that penalization provides effective means against overfitting.

3.6.2 Solutions

  • As for OLS, the objective function is to minimize the sum of squared errors \((y - X\beta)'(y - X\beta)\) under the condition that \(\sum\limits_{j=2}^K \beta_j \leq t(\lambda)\) if an intercept is present. This can be rewritten as \(\beta'A\beta \leq t(\lambda)\) where \(A = \begin{pmatrix}0&&\ldots&&0\\\vdots& 1&0&\ldots&0\\&\vdots&\ddots&&0\\0& &\ldots&&1\end{pmatrix}\). Otherwise, the condition is simply that \(\beta'I_K\beta \leq t(\lambda)\) where \(I_k\) is an identity matrix of size (\(k \times k\)).
ridge_regression <- function(y, X, lambda = 0, intercept = FALSE){
  K <- ncol(X)
  A <- diag(K)  
  if(intercept){
    A[1,] <- 0
  }
  coeffs <- solve(t(X)%*%X + lambda * A) %*%t(X)%*%y
  return(coeffs)
}

Below, I apply the function to the data set from the previous exercise to illustrate the output of the function ridge_regression().

y <- welch_data$rp_div
X <- welch_data %>% select(-date, -rp_div) %>% as.matrix()

# OLS for Welsh data fails because X is not of full rank
c(ncol(X), Matrix::rankMatrix(X))
## [1] 14 12
rc <- c(NA, ridge_regression(y, X, lambda = 1))

# Add an intercept term
rc_i <- ridge_regression(y, cbind(1, X), lambda = 1, intercept = TRUE)
cbind(rc, rc_i)
##                rc             
##                NA  0.068450076
## dp   -0.239343694 -0.231886874
## dy    0.361579150  0.364407261
## ep   -0.124960781 -0.118116681
## de   -0.114382913 -0.113770193
## svar -0.014811852 -0.014348260
## csp   0.001835536  0.002063538
## bm   -0.004624298 -0.026906940
## ntis -0.021298568 -0.017659394
## tbl  -0.013561551 -0.018478176
## lty  -0.001868527 -0.008946509
## ltr   0.059168150  0.058178639
## tms   0.011693024  0.009531667
## dfy   0.005863087  0.006118038
## infl -0.003994673 -0.003826009

Finally, the following code sequence computes the \(L_2\) norm of the ridge regression coefficients for a given \(\lambda\). Note that in order to implement this evaluation in a tidy manner, vectorized functions are important! R provides great ways to vectorize a function, to learn more, Hadley Wickham’s book Advanced R is a highly recommended read!

l2_norm <- function(lambda) sum(ridge_regression(y = y, X = X, lambda =  lambda)^2) # small helper function to extract the L_2 norm of the ridge regression coefficients

l2_norm <- Vectorize(l2_norm) # In order to use the function within a `mutate` operation, it needs to be vectorized

seq(from = 0.01, to =20, by =0.1) %>% 
  as_tibble() %>% 
  mutate(norm = l2_norm(value)) %>% 
  ggplot(aes(x=value, y = norm))+ geom_line() +
  labs(x = "Lambda", y = " L2 norm")
Ridge regression coefficients for different penalty terms

Figure 3.6: Ridge regression coefficients for different penalty terms

To compute the coefficients of a linear regression with a penalty on the sum of the absolute value of the regression coefficients, numerical optimization routines are required. Recall, the objective function is to minimize the sum of squared residuals, \(\hat\varepsilon'\hat\varepsilon\) under the constraint that \(\sum\limits_{j=2}^K\beta_j\leq t(\lambda)\). Make sure you familiarize yourself with the way, numerical optimization in R works: we first define the objective function (objective_lasso()) which has the parameter we aim to optimize as its first argument. The main function lasso_regression() then only calls this helper function.

objective_lasso <- function(beta, y, X, lambda, intercept){
  residuals <- y - X%*%beta
  sse <- sum(residuals^2)
  penalty <- sum(abs(beta))
  if(intercept){
    penalty <- penalty - abs(beta[1])
  }
  return(sse + lambda * penalty)
  }

lasso_regression = function(y, X, lambda = 0, intercept = FALSE){
  K <- ncol(X)
  beta_init <- rep(0, K)
  return(optim(par = beta_init, fn = objective_lasso, y = y, X = X, lambda = lambda, intercept = intercept)$par)
}

rc <- c(NA, lasso_regression(y, X, lambda = 0.01))

# Add an intercept term
rc_i <- lasso_regression(y, cbind(1, X), lambda = 0.01, intercept = TRUE)
cbind(rc, rc_i)
##                 rc         rc_i
##  [1,]           NA -0.015481329
##  [2,] -0.002309567 -0.007432227
##  [3,]  0.018950587  0.007577932
##  [4,] -0.019328955 -0.007565038
##  [5,] -0.041474430 -0.018622341
##  [6,] -0.873890375 -1.668871810
##  [7,]  0.473858053  0.398850112
##  [8,]  0.004826174  0.008262646
##  [9,] -0.239587377 -0.405821201
## [10,]  0.002021899 -0.072684496
## [11,] -0.416048940 -0.374240823
## [12,]  0.256429715  0.368949682
## [13,]  0.335641082  0.226014373
## [14,]  0.230499721  1.327395136
## [15,]  0.011043346  0.019841244

Finally, as in the previous example with Ridge regression I illustrate how a larger penalization term \(\lambda\) affects the \(L_1\) norm of the regression coefficients.

l1_norm <- function(lambda) sum(abs(lasso_regression(y = y, X = X, lambda =  lambda)))
l1_norm <- Vectorize(l1_norm) # In order to use the function within a `mutate` operation, it needs to be vectorized

seq(from = 0, to =0.5, by =0.05) %>% 
  as_tibble() %>% 
  mutate(norm = l1_norm(value)) %>% 
  ggplot(aes(x=value, y = norm))+ geom_line() +
  labs(x = "Lambda", y = " L1 norm")
Lasso regression coefficients for different penalty terms

Figure 3.7: Lasso regression coefficients for different penalty terms

While the code above should work, there are well-tested R packages available that provide a much more reliable and faster implementation. Thus, you can safely use the package glmnet. As a first step, the following code create the sequence of regressions coefficients for the Goyal-Welch dataset which should be identical to the example we discussed in the lecture slides.

library(glmnet)
# Lasso and Ridge regression

fit_ridge <- glmnet(X, y, # Model
                    alpha = 0)

broom::tidy(fit_ridge) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x=lambda, y = estimate, color = term)) +
  geom_line() +
  geom_hline(data = . %>% filter(lambda == min(lambda)), aes(yintercept = estimate, color = term), linetype = "dotted") + 
  theme_minimal() +
  scale_x_log10() +
  labs(x = "Lambda", y = "Estimate")
Ridge regression coefficients for different penalty terms

Figure 3.8: Ridge regression coefficients for different penalty terms

Note the function argument alpha. glmnet() allows a more flexible combination of \(L_1\) and \(L_2\) penalization on the regression coefficients. The pure ridge estimation is implemented with alpha = 0, lasso requires alpha = 1.

fit_lasso <- glmnet(X, y, # Model
                       alpha = 1)

broom::tidy(fit_lasso) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x=lambda, y = estimate, color = term)) +
  geom_line() +
  geom_hline(data = . %>% filter(lambda == min(lambda)), aes(yintercept = estimate, color = term), linetype = "dotted") + 
  theme_minimal() +
  scale_x_log10() +
  labs(x = "Lambda", y = "Estimate")
Lasso regression coefficients for different penalty terms

Figure 3.9: Lasso regression coefficients for different penalty terms

Note in the figure how Lasso discards all variables for high values of \(\lambda\) and then gradually incorporates more predictors. It seems like stock variance is selected first. As expected, for \(\lambda \rightarrow 0\), the lasso coefficients converge towards the OLS estimates (illustrated with dotted lines).

Next we split the sample using the function ‘initial_time_split’ from the ´rsample´ package (you can just install ´tidymodels’ to get access to all the related packages). The split takes the last 20% of the data as test set which will not be used for any model tuning. The remaining sample is split in two even parts which are used for hyper-parameter selection.

#install.packages("tidymodels")
library(tidymodels)

data <- rsample::initial_time_split(welch_data, prop = 0.8)

test_file <- rsample::testing(data) # Don't touch again until you selected your parameters!
data <- rsample::initial_time_split(training(data), prop = 0.5)

training_x <- training(data) %>% select(-date, -rp_div) %>% as.matrix()
validation_x <- testing(data) %>% select(-date, -rp_div) %>% as.matrix()

Next I perform the typical hyperparameter tuning task: The idea is to choose a grid of possible values for the hyperparameter (in that case, \(\lambda\)). Then, fit the model using the training set and evaluate the goodness of fit (e.g. mean-squared prediction error) using the validation set. You will choose the hyperparameter that provides the best fit in the validation set.

# Compute Lasso for range of lambdas

lambdas <- fit_ridge$lambda
lambdas <- lambdas[lambdas >min(lambdas[lambdas!=min(lambdas)] )]

accuracy <- purrr::map_df(lambdas, function(k){
  fit <- glmnet(training_x, training(data)$rp_div, lambda = k, alpha = 0)
  y_hat <- predict(fit, training_x)
  training_error <- mean((y_hat - training(data)$rp_div)^2)
  
  y_hat <- predict(fit, validation_x)
  validation_error <- mean((y_hat - testing(data)$rp_div)^2)
  
  tibble(lambda = k, 
         train = training_error, 
         validation = validation_error)
})

p1 <- accuracy %>% 
  pivot_longer(-lambda, names_to = "Model") %>% 
  filter(lambda > min(lambda)) %>%
  ggplot(aes (x = lambda, y  = value, color = Model)) + 
  theme_minimal() + 
  geom_point() +
  scale_x_log10() +
  labs(x = latex2exp::TeX("Penalization ($\\lambda$)"), "MSPE",
       y = "MSPE")

final_fit <- glmnet(training_x, training(data)$rp_div, 
              lambda = accuracy %>% filter(validation == min(validation)) %>% pull(lambda), 
              alpha = 0)
y_hat_final_fit <- predict(final_fit, test_file %>% select(-date, -rp_div) %>% as.matrix())
p1 + geom_hline(yintercept = mean((y_hat_final_fit - test_file$rp_div)^2))
Hyperparameter tuning

Figure 3.10: Hyperparameter tuning

The figure illustrates two main insights: First, imposing penalization harms the fit within the test set. The more we penalize, the worse the in-sample gets. This is not true, however, for the validation set: For less restrictive models, overfitting becomes an issues which affects the out-of-sample performance of the ridge regression. Instead, the validation prediction error is smallest for a hyperparameter \(\lambda>0\)! The black line shows the out-of-sample prediction error of the test set which has not been at all for hyperparameter tuning and thus resembles an estimate of the actual predictive performance of the ridge regression.

3.7 Regression trees and random forests

Gu, Kelly and Xiu (2019) propose regression trees and especially random forests as suitable tools for factor and predictor selection in asset pricing. Basic regression trees partition a data set into smaller subgroups and then fit a simple constant for each observation in the subgroup. The constant to predict is based on the average response values for all observations that fall in that subgroup. First, its important to realize the partitioning of variables are done in a top-down, greedy fashion. This just means that a partition performed earlier in the tree will not change based on later partitions. But how are these partitions made?

R packages provide very efficient algorithms to generate predictions based on random forests but in this exercise you will first be asked to implement a simple function which builds a tree with predetermined depth. Afterward, we will use packages like rpartor randomForest to increase the efficiency in the estimation process.

3.7.1 Exercises

  • Write a function that takes y, a matrix of predictors X as inputs and returns a characterization of the relevant parameters of a regression tree with 1 branch.
  • Create a function that allows to create predictions for a new matrix of predictors `newX´ based on the estimated regression tree. The slide ´how do regression trees work?´ provides a step-by-step description on how to do this
  • Use the package rpart to grow a tree based on the Goyal-Welch data set as shown in the lecture
  • Make use of a training and a test set to choose the optimal depth (number of sample splits) of the tree
  • Grow a random forest and evaluate the predictive performance relative to the Lasso regression above

3.7.2 Solutions

Coding a regression manually is somewhat cumbersome but a good practice to make sure you really understand what is required for this task. In the following, I first define a function rss which, in its most basic form just takes two vectors xsort and y as input. The input `threshold´ then defines a partition of the observation based on if \(x_i >\text{threshold}\) or not. Then, I compute the residual sum of squares (note: we do not standardize by \(T\) to avoid that small clusters receive relatively more weight).

#objective function: find threshold to minimize overall sum of squared errors
rss <- function(threshold, xsort, y, optim = TRUE){
    partition_1 <- y[xsort >= threshold]
    partition_2 <- y[xsort < threshold]
    m1 <- sum((partition_1 - mean(partition_1))^2)
    m2 <- sum((partition_2 - mean(partition_2))^2)
    
    if(optim) return(m1 + m2)
    if(!optim) return(c(threshold, mean(partition_1), mean(partition_2)))
}

Next, we search for the variable across which the partition split should occur. For that purpose we simply loop through all columns of ´X´ and use ´optimize´, a simple wrapper around ´optim´ for univariate optimization to identify the RSS minimizing break point for each variable.

val <- c()
for(j in 1:ncol(X)){
  sol <- optimize(f = rss, lower = quantile(X[,j], 0.01), upper = quantile(X[,j], 0.99), xsort = X[,j], y = y)  
  val[[j]] <- sol
}

The remainder is householding: Identify the variable at which the sample split generates the minimum RSS. Our final function ´rf_predict´ then uses the output of our search algorithm and a new matrix Xnew as input to derive new predictions.

selected_variable <- val %>% 
  bind_rows() %>% 
  mutate(variable = colnames(X),
         id = 1:ncol(X)) %>% 
  filter(objective == min(objective))


rf_predict <- function(newX, tree = selected_variable) {
  relevant_values <- rss(selected_variable$minimum,
    X[,selected_variable$id],
    y = y,
    optim = FALSE)
  
  newx <- newX[, selected_variable$id]
  output <- newx
  output[newx >= relevant_values[1]] <- relevant_values[2]
  output[newx < relevant_values[1]] <- relevant_values[3]
return(output)
  }

To illustrate the resulting predictions, I sample 300 observations from our original dataset and let ´rf_predict` generate return predictions. As we have only one splitting variable all observations are associated with either one of the two partitions. Thus, this very simple model only generates two different predictions.

newX <- X[sample.int(300),]
kableExtra::kable(table(rf_predict(newX)))
Var1 Freq
-0.0228835360507263 17
0.00325337128357212 283

Before moving forward to using the packages rpart and ´randomForest´, make sure you know how you would use the code above to introduce a second branch!

Regression trees with R are straightforward to implement and to understand. See the code below which has been used in the lecture.

#install.packages("rpart")
library(rpart)
model <- rpart(
  rp_div ~ ., 
  data = welch_data %>% select(-date), 
  control = rpart.control(maxdepth = 3))

plot(model, compress = TRUE)
text(model, cex = 0.7, fancy = TRUE, all = TRUE)
Regression tree illustration

Figure 3.11: Regression tree illustration

Similar to hyperparameter tuning for Lasso, the depth of the tree is a parameter which can be optimally chosen using a training and a validation dataset. I implement such a simple search based on a grid of possible number of splits below.

# Regression tree hyperparameter tuning
depth <- 1:15
accuracy <- purrr::map_df(depth, function(k){
  fit <- rpart(
    rp_div ~ ., 
    data = training(data) %>% select(-date), 
    control = rpart.control(maxdepth = k))
  
  y_hat <- predict(fit, training(data))
  training_error <- mean((y_hat - training(data)$rp_div)^2)
  
  y_hat <- predict(fit, testing(data))
  validation_error <- mean((y_hat - testing(data)$rp_div)^2)
  
  tibble(depth = k, 
         train = training_error, 
         validation = validation_error)
})

Finally, below the code to generate the random forest predictions presented during the lecture.

# Random Forests
library(randomForest)
set.seed(2021)
fit <- randomForest(
  rp_div ~.,  
  data = training(data) %>% select(-date), 
  xtest = testing(data) %>% select(-date, -rp_div) , 
  ytest = testing(data) %>% pull(rp_div),
  keep.forest=TRUE)

y_hat_random_forest <- predict(fit, test_file)
test_error <- mean((y_hat_random_forest - test_file$rp_div)^2)

p1 + 
  geom_hline (yintercept = mean ((y_hat_final_fit - test_file $ rp_div) ^ 2)) + 
  geom_hline(yintercept = test_error, color = "gray")
Mean-squared prediction error: random forest

Figure 3.12: Mean-squared prediction error: random forest

3.8 Neural networks

  • Fitting neural networks is a tedious job because neural networks are among the most heavily parametrized statistical methods. Luckily, it became remarkable easy during recent years to exploit the rich coding infrastructure of tensorflow in R. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well. To get started, you will need some further system adjustments. For that purpose, you can follow this easy tutorial.
  • For a lot of very interesting applications, you are advised to read through at least a few of these exemplary sessions which provide an overview of the rich infrastructure you can exploit.

3.8.1 Exercises

  • Use ‘keras’ to initialize a sequential neural network which can take the 14 predictors from the Goyal Welch dataset as input, contains at least one hidden layer and generates continuous predictions. This sounds harder than it is: see a simply regression example here. How many parameters does the neural network you aim to fit have?
  • Next, compile the object. It is important that you specify a loss function. In our case, we focus on the mean squared error (“mse”).
  • Finally, fit the neural network. This may take some time. Make sure the optimizer uses the training dataset for fitting and the validation set for hyperparamter tuning. What is the mean squared prediction error of your first fitted neural network on the Goyal Welch dataset? Play around with the network architecture, node drop out or parameter penalization parameters to see if you can beat to Lasso regression.

3.8.2 Solutions

While the package ´kerasprovides an extraordinary API (which even uses the pipe ´%>%), the setup costs to implement Neural networks are nevertheless substantial, especially if you have never used ´tensorflow´ before. In the following, I provide my minimal code example on how to initialize and then how to compile a model but there is no doubt that in order to really familiarize yourself with the subject matter, the tutorials and examples linked to above will be very important.

´keras´ divides the work of training a neural network into three steps: Defining the architecture, compiling and then training. To provide you some idea on the architecture of the network, I added some example below: The object ´model´ represents a neural network (sequential, there is no recursive elements included, but ´keras´ could handle this as well!). First, the input layer should allow vectors of length 14, the number of predictors used in the Goyal Welch dataset. Then, I include a hidden layer with 64 units with the ReLu activation function. As discussed in class, common tools to avoid overfitting is to either introduce regularization on the parameters (which I implement for the second hidden layer) or a random dropout which sets nodes to zero in order to avoid to put too much weight on irrelevant nodes. Both features can easily be implemented with ´keras´ as you see below. Recall that we want to predict market premia. The output of the network should thus be a simple one-unit layer with linear activation function. If you would like to perform classification tasks, the final layer could also include multiple nodes (e.g. 10 nodes if you want to predict the value of hand-written numbers based on the MNIST dataset). Alternatively, if you want to predict binary outcomes, another activation function could make sense as well.

library(keras)

model <- keras_model_sequential() %>% 
  layer_flatten(input_shape = 14) %>% 
  layer_dense(units = 64, activation = "relu") %>%
#  layer_dropout(0.6) %>%
  layer_dense(units = 16, 
              activation = "relu", 
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dense(1)

summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## flatten (Flatten)                   (None, 14)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 64)                      960         
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 16)                      1040        
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       17          
## ================================================================================
## Total params: 2,017
## Trainable params: 2,017
## Non-trainable params: 0
## ________________________________________________________________________________

Next, the model is compiled into machine-readable code which will be handled by tensorflow. The compile function requires to specify the loss function. In our case the most natural loss function is mse, the mean square error. When dealing with categorical predictions, one may consider other loss functions, e.g. past on the entropy of the predictions. The option metrics is just specifying what kind of output the console should provide during the (often very time-consuming) training process.

model %>% 
  compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = "mean_absolute_error")

Final step: Train the model. This works very easy with ´keras´, just use the ´fit()´ function. Note how I introduce training and validation test tests. The option ´epochs´ determines the number of times the parameters should be updated. Note that there is no analytic solution for weights which minimize the loss, instead, numerical optimization procedures which are to some extend stochastic have been found to work very well for this class of training algorithms. The basic idea is that the model should be trained long enough such that the optimizer really reaches a (local) minimum. Note: unlike in the tidyverse, %>%fit() changes the object model without the need to assign a new variable.

model %>% 
  fit(
    x = training_x, 
    y = training(data) %>% pull(rp_div),
    validation_data = list(validation_x, 
                           testing(data) %>% pull(rp_div)),
    epochs = 3000)

The prediction step is easy and similar to prediction with all other machine learning tools considered so far.

predictions <- predict(model, test_file %>% select(-date,-rp_div) %>% as.matrix())

p1 + 
  geom_hline (yintercept = mean ((y_hat_final_fit - test_file $ rp_div) ^ 2)) + 
  geom_hline(yintercept = test_error, color = "gray") +
  geom_hline(yintercept =  mean ((predictions - test_file $ rp_div) ^ 2), color = "red")
Mean-squared prediction error: neural network

Figure 3.13: Mean-squared prediction error: neural network

Note the final performance of the trained neural network in the test set - can you improve upon that?

3.9 Empirical Asset Pricing via Machine Learning

This exercise should help you to fight your way through an actual academic application of machine learning methods in asset pricing. The exercises should guide you step-by-step to replicate the empirical framework and therefore at some point also the main results of the paper Empirical Asset Pricing via Machine Learning by Shihao Gu, Bryan Kelly and Dacheng Xiu.

  • Start reading the paper at Section 2 (“An empirical study of U.S. equities”). In their study, the authors aim to use different machine learning procedures to approximate the overarching empirical model \(E_t\left(r_{i, t+1}\right) = g^\star(z_{i,t})\) as defined in the paper. The returns are monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ. The sample starts in March 1957 (the start date of the S&P 500) and ends in December 2016, totaling 60 years. Prepare the data from CRSP according to these requirements.
  • To calculate individual excess returns, the authors claim they “obtain the Treasury-bill rate to proxy for the risk-free rate”. Use the risk-free rate provided by Kenneth French’s data library to compute individual excess returns.
  • One major contribution of the paper is to implement the predictive framework with a huge collection of stock-level predictive characteristics based on the cross-section of stock returns literature. Table A.6 in the Internet Appendix provides the details of all these characteristics. You don’t have to generate each of the characteristics, instead download the data from Dacheng Xius homepage.

3.9.1 Exercises

  • Extract the downloaded data, read the readme file and process the data the way the authors explain it in footnote 29: “We cross-sectionally rank all stock characteristics period-by-period and map these ranks into the [-1,1] interval following Kelly, Pruitt, and Su (2019) and Freyberger, Neuhierl, and Weber (2020).” Also, process the column `sic2´ which currently contains categorical data and which should be transformed into a matrix with 74 columns: The values in the cells should be 1 if the ´permno´ corresponds to the specific industry classification and 0 otherwise.
  • Lastly, merge the dataset with the eight macroeconomic predictors following the variable definitions detailed in Welch and Goyal (2008), including dividend-price ratio (dp), earnings-price ratio (ep), book-to-market ratio (bm), net equity expansion (ntis), Treasury-bill rate (tbl), term spread (tms), default spread (dfy), and stock variance (svar).
  • In the original paper, the authors inflate the dataset by considering any possible interaction between macroeconomic variable and firm characteristic. How could you implement this? (Note: the dataset is huge already without interaction terms. For the purpose of this lecture, feel free to skip this final step.)

3.9.2 Solutions

First I start with the CRSP dataset. We already cleaned the raw file (provided on Absalon) and computed adjusted returns) in a previous exercise. Note, that I make use of the command vroom::vroom which provides probably the fastest lazy read-in in R so far.

# install.packages("vroom")
library(tidyverse) # for grammar
library(lubridate) # for working with dates
library(vroom) # To import REALLY large datasets

# Read in CRSP data
crsp <- vroom("data/raw_crsp.csv")

# Clean the file
crsp <- crsp %>%
  janitor::clean_names() %>%                 # Replace column names with lower case
  transmute(permno = as.integer(permno),     # security identifier
            date = ymd(date),                # month identifier
            ret = as.numeric(ret) * 100,     # return (convert to percent)
            shrout = as.numeric(shrout),     # shares outstanding (in thousands)
            altprc = as.numeric(altprc),     # last traded price in a month
            exchcd = as.integer(exchcd),     # exchange code
            shrcd = as.integer(shrcd),       # share code
            siccd = as.integer(siccd),       # industry code
            dlret = as.numeric(dlret) * 100, # delisting return (converted to percent)
            dlstcd = as.integer(dlstcd)      # delisting code
  ) %>%
  mutate(date = floor_date(date, "month"))   # consistent naming convention: YYYY-MM-DD for each time series

return_data <- crsp %>%
  filter(date > "1957-02-01",                    # Sample begins in March 1957 and ends in December 2016
         date <= "2016-12-31") %>%
  filter(shrcd %in% c(10, 11)) %>%               # keep only US listed stocks
  filter(exchcd %in% c(1, 31, 2, 32, 3, 33)) %>% # listed at NYSE, AMEX or NASDAQ
  distinct(permno, date, .keep_all = TRUE)%>%    # remove duplicates
  select(-shrcd) %>%
  mutate(ret.adj = case_when(!is.na(dlret) ~ dlret, # ret.adj computes adjusted returns based on procedure described in Bali, Murray, Engle
                             is.na(dlret) & !is.na(dlstcd) ~ -100,
                             is.na(dlret) & (dlstcd %in% c(500, 520, 580, 584) |
                                               (dlstcd >= 551 & dlstcd <= 574)) ~ -30,
                             TRUE ~ ret)) %>%
  select(permno, date, ret.adj) %>%
  na.omit()

Next, read in the file with the stock characteristics. You can check that it makes a huge difference if you use read_csv() or vroom::vroom().

# Data from Dacheng Xius homepage

predictor_data <- vroom("datashare.csv")
predictor_data <- predictor_data %>%
  mutate(DATE = ymd(DATE),
         DATE = floor_date(DATE, "month")) %>%
  rename("date" = "DATE")

return_data <- return_data %>%
  inner_join(predictor_data, by = c("date", "permno"))

The cross-sectionally ranking may be somewhat time-consuming. The idea is really just that at each date the cross section of each predictor should be scaled such that the maximum value is 1 and the minimum value is -1.

# Cross-sectionally rank all characteristics period-by-period and map into the [-1, 1] interval

rank_transform <- function(x){
  rx <- rank(x, na.last = TRUE)
  non_nas <- sum(!is.na(x))
  rx[rx>non_nas] <- NA
  2*(rx/non_nas - 0.5)
}

clean_data <- return_data %>%
  group_by(date) %>%
  mutate_at(vars(-permno, -date, -ret.adj, -sic2), rank_transform)

One-hot encoding is a common task in machine learning. Instead of working with the industry classification codes in one column, we inflate this column such that for each value of `sic2´ we create a new column which is one if the permno-date combinations indicates the firm operates within the specific industry and 0 otherwise. There are many ways how to do this in R, below I simply use a creative version of ´pivot_wider´ for that task.

# One-hot encoding of SIC2 variable (for more in depth information see here: https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods)

clean_data <- clean_data %>%
  pivot_wider(values_from = sic2, names_from = sic2, names_prefix = "sic.") %>%
  mutate_at(vars(contains("sic")), function(x) if_else(is.na(x), 0, 1)) %>%
  select(-sic.NA)

Now it comes to merging everything with the Goyal Welch dataset we have been working with before to capture the relevance of macroeconomic variables. A lot of the processing below is just a repitition of the earlier preparation steps.

# Prepare Welch Goyal dataset (retrieved from http://www.hec.unil.ch/agoyal/docs/PredictorData2019.xlsx)
library(readxl) # for xlsx file import
# Read in dataset
welch_data <- readxl::read_xlsx("PredictorData2019.xlsx")

welch_data <- welch_data %>%
  janitor::clean_names() %>%
  mutate(yyyymm = ymd(paste0(yyyymm,"01"))) %>%
  rename("date" = "yyyymm") %>%
  mutate_at(vars(-date), ~as.numeric(.)) %>%
  filter(date >= "1957-03-01") %>%
  select(-crsp_s_pvw, -crsp_s_pvwx)

clean_data <- left_join(clean_data, welch_data, by = "date")
clean_data <- clean_data %>% ungroup()

clean_data <- clean_data %>%
  group_by(permno) %>%
  arrange(permno, date) %>%
  mutate(ret.adj = ret.adj - rfree) %>% # Variable to predict: future return
  ungroup()

The code below replaces missing values with the cross-sectional median at each month for each stock (which should probably be zero). One of the coauthors of the paper actually also just claims that NA values are set to zero.

# Replace missing values with the cross-sectional median at each month for each stock
replace_nas <- function(x){
  med <- median(x, na.rm = TRUE)
  x[is.na(x)] <- med
  x[is.na(x)] <- 0
  return(x)
}

clean_data <- clean_data %>%
  group_by(date) %>%
  mutate_at(vars(-permno, -date, -ret.adj), replace_nas)

The final step is to generate interaction terms. This is a step which will probably break on most normal computers and thus I do not execute the code below. On the computer cluster I usually use for large scale computations, task like the one below work, however.

clean_data <- clean_data %>%
  ungroup() %>%
  select(permno, date, ret.adj, mvel1, beta, infl, ntis) %>% pivot_longer(infl:ntis, names_to = "macro", values_to = "macro_value") %>% pivot_longer(mvel1:beta) %>% transmute(permno, date, ret.adj, name = paste0(macro, "_", name), s = macro_value * value) %>% pivot_wider(names_from = name, values_from = s)
clean_data %>% ungroup() %>% write_rds("clean_sample.rds")