# 3 CRSP and Asset Pricing

## 3.1 Cleaning and preparing the CRSP sample

A large part of the empirical asset pricing literature focuses on US stock markets, more specifically, on the extensive data on prices and firm characteristic provided by the Center for Research in Security Prices(CRSP). As a student at KU, you have access to their vast database. To get familiar with the dataset, I provide the file `raw_crsp.csv`

on Absalon which contains monthly observations for all stocks covered in the CRSP universe. The sample ranges from December 1925 until December 2019. A large part of the following computations is based on Empirical Asset Pricing: The Cross Section of Stock Returns (excerpts provided in Absalon) and a lot of the code has been prepared by my great colleague Christoph Scheuch. Moreover, we will make use of data provided on Kenneth R. French’s homepage. I already downloaded the file `F_Research_Data_Factors.csv`

for you and provide it on Absalon but you are encouraged to take a deeper look at his very informative homepage.

### 3.1.1 Exercises

- Read in the file
`raw_crsp.csv`

and make yourself familiar with the structure of the data. Filter out all observations that do have a share code 10 or 11 (a documentation of the variables is provided here). Make sure there are no duplicates in your dataset. - Replicate the two figures provided in the lecture slides: i) Create a time-series of the number of stocks listed on NASDAQ, NYSE and AMEX. ii) Illustrate the time series of total market values (inflation adjusted) based on industry classification
`siccd`

. The book chapters in Absalon (Empirical Asset Pricing) provide a precise walk-through if you need help. - Follow the procedure described in Chapter 7.2 of the textbook (Empirical Asset Pricing) to compute returns adjusted for delisting.
- Download and prepare time series for the risk-free rate and the market return from Kenneth Frenchs data library. Illustrate the evolution of the risk-free rate over time.
- Create summary statistics for the cross-section of returns, excess returns, adjusted returns and adjusted excess returns. There are many ways to do this, one way is to compute summary statistic for the cross-section on a monthly basis and then to report the time-series averages of these values.
- Assume you invested 100 USD at the beginning of the sample in the market portfolio and reinvested all distributed dividends. Illustrate the time-series of your wealth.

### 3.1.2 Solutions

**Note: the file is huge (roughly 4 million rows) and it may take some time to read-in the entire file**

```
library(tidyverse)
library(lubridate) # for working with dates
# Read in CRSP data
<- read_csv("raw_crsp.csv", col_types = cols(.default = col_character())) crsp
```

`read_csv`

usually does a great job in recognizing the data types in the individual columns. To avoid any parsing issues, however, I haven chosen to read in the file with the default that all columns contain character data and change the column type in the next step. You may note that the dataset actually contains 64 variables, many more than we need for this exercise.

```
# Clean the file
<- crsp %>%
crsp ::clean_names() %>% # Replace column names with lower case
janitortransmute(permno = as.integer(permno), # security identifier
date = ymd(date), # month identifier
ret = as.numeric(ret) * 100, # return (convert to percent)
shrout = as.numeric(shrout), # shares outstanding (in thousands)
altprc = as.numeric(altprc), # last traded price in a month
exchcd = as.integer(exchcd), # exchange code
shrcd = as.integer(shrcd), # share code
siccd = as.integer(siccd), # industry code
dlret = as.numeric(dlret) * 100, # delisting return (converted to percent)
dlstcd = as.integer(dlstcd) # delisting code
%>%
) mutate(date = floor_date(date, "month")) # consistent naming convention: YYYY-MM-DD for each time series
```

Next, as common in the literature, we further restrict our sample and keep only US listed stocks. The command `distinct`

allows to filter out any two rows which have the same `permno-date`

combination and are thus duplicates. The option `.keep_all = TRUE`

retains the remaining 8 columns.

```
<- crsp %>%
crsp filter(shrcd %in% c(10, 11)) %>% # keep only US listed stocks
distinct(permno, date, .keep_all = TRUE) %>% # remove duplicates
select(-shrcd)
```

As described in the textbook, stocks listing exchanges are recorded in the field `exchcd`

. Values `1`

and `31`

stand for NYSE listed stocks, `2`

and `32`

for NASDAQ, `3`

and `33`

for AMEX. There are plenty other trading venues, for the sake of brevity I aggregate the remaining ones as `Other`

. Next, industry classifications on a really broad level are provided by the `siccd`

code. The classification below differentiates between 10 different industries.

```
<- crsp %>%
crsp mutate(mktcap = abs(shrout * altprc) / 1000, # market cap in millions of dollars (shares outstanding*last reported trading price)
mktcap = if_else(mktcap == 0, as.numeric(NA), mktcap), # replace stocks with 0 market cap with NA
exchange = case_when(exchcd %in% c(1, 31) ~ "NYSE",
%in% c(2, 32) ~ "AMEX",
exchcd %in% c(3, 33) ~ "NASDAQ", # Relabel exchange codes
exchcd TRUE ~ "Other"),
industry = case_when(siccd >= 1 & siccd <= 999 ~ "Agriculture", # Create industry codes
>= 1000 & siccd <= 1499 ~ "Mining",
siccd >= 1500 & siccd <= 1799 ~ "Construction",
siccd >= 2000 & siccd <= 3999 ~ "Manufacturing",
siccd >= 4000 & siccd <= 4999 ~ "Transportation",
siccd >= 5000 & siccd <= 5199 ~ "Wholesale",
siccd >= 5200 & siccd <= 5999 ~ "Retail",
siccd >= 6000 & siccd <= 6799 ~ "Finance",
siccd >= 7000 & siccd <= 8999 ~ "Services",
siccd >= 9000 & siccd <= 9999 ~ "Public",
siccd TRUE ~ "Missing"))
```

```
# Visualize number of securities (per exchange)
%>%
crsp count(exchange, date) %>%
ggplot(aes(x = date, y = n, group = exchange, linetype = exchange)) +
geom_line(aes(color = exchange)) +
labs(x = "", y = "Number of Securities", color = "Exchange", linetype = "Exchange") +
scale_x_date(expand = c(0, 0),
date_breaks = "10 years",
date_labels = "%Y") +
theme_classic()
```

The NYSE has by far the longest tradition on US equity markets, however, during more recent decades, NASDAQ listed an increasing number of stocks.

As described in the lecture slides, we adjust for changes in purchasing power with a measure of the consumer price index. The data is provided on Absalon in exactly the same way I downloaded it. See the slides or the textbook for more information.

`<- read_csv("data/CPIAUCNS.csv") cpi `

```
<- cpi %>% # Clean CPI data
cpi rename("date" = DATE,
"cpi" = CPIAUCNS) %>%
mutate(cpi = cpi/dplyr::last(cpi))
<- crsp %>%
crsp left_join(cpi, by = "date") # Join CPI data and CRSP sample
```

Note that in the figure below I illustrate market capitalization in thousand USD, whereas all values represent purchasing power as of 2020.

```
# Visualize market capitalization by industry
%>%
crsp filter(!is.na(industry)) %>%
group_by(industry, date) %>%
summarize(mktcap = sum(mktcap / cpi, na.rm = TRUE) / 1000) %>%
ggplot(aes(x = date, y = mktcap, group = industry)) +
geom_line(aes(color = industry, linetype = industry)) +
labs(x = "", y = "Total Market Value (Billions of Dec 2020 Dollars)",
color = "Industry", linetype = "Industry") +
scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
theme_classic()
```

While stock returns are an easy to grasp concept, they are hard to measure if stocks undergo delisting procedures and therefore no publicly available quotes are available anymore. While CRSP provides some additional information on the form and reason of delisting it still remains hard to measure stock returns during such periods. Bali, Engle and Murray propose the following adjustment, which either assigns a return of -30% or -100%, depending of the reason of the delisting, in cases where CRSP does not provide a return during the delisting period.

```
# compute stock returns
<- crsp %>%
crsp mutate(ret.adj = case_when(!is.na(dlret) ~ dlret, # ret.adj computes adjusted returns based on procedure described in Bali, Murray, Engle
is.na(dlret) & !is.na(dlstcd) ~ -100,
is.na(dlret) & (dlstcd %in% c(500, 520, 580, 584) |
>= 551 & dlstcd <= 574)) ~ -30,
(dlstcd TRUE ~ ret)) %>%
select(-c(dlret, dlstcd))
```

Next, I downloaded data on market returns and the risk-free rate from Kenneth Frenchs homepage. The dataset provided on Absalon is somewhat cleaned beforehand, therefore you are adviced to repeat this analysis on your own to really understand how French’s data library is structured. I join the CRSP returns with the Market and risk-free returns to compute excess returns later on.

```
<- read_csv("F-F_Research_Data_Factors.csv", skip = 3) %>%
factors_ff rename(date = X1, MKT = `Mkt-RF`) %>%
::clean_names() %>%
janitormutate(date = ymd(parse_date_time(date, "%Y%m")))%>%
na.omit()
<- crsp %>%
crsp left_join(factors_ff, by = "date") # Join with CRSP sample
```

```
%>%
factors_ff ggplot(aes(x=date, y = rf)) +
geom_point() +
theme_bw() +
labs(x="", y ="Risk-free rate (FF) in %")
```

Above I illustrate the risk-free rate as provided by Kenneth French. Finally, we compute some summary statistics for the excess returns as well as their (delisting) adjusted counterparts. The code below first computes summary statistics on the cross-section of stocks for each date and then simply reports the time-series average of each value. **Note**: To create the tables, the packages `knitr`

and `kableExtra`

have to be installed.

```
<- crsp %>%
crsp_summaries mutate(ret.exc = ret - rf, # Compute excess returns
ret.adj.exc = ret.adj - rf) %>%
group_by(date) %>% # Compute summary statistics by month (cross-section of returns)
summarise_at(vars(ret.adj.exc, ret.adj, ret.exc, ret),
list(mean = mean,
sd = sd,
min = min,
q05 = ~quantile(., 0.05, na.rm = TRUE),
q25 = ~quantile(., 0.25, na.rm = TRUE),
median = median,
q75 = ~quantile(., 0.75, na.rm = TRUE),
q95 = ~quantile(., 0.95, na.rm = TRUE),
max = max,
n = ~as.numeric(sum(!is.na(.)))), na.rm = TRUE) %>%
select(-date) %>%
mutate_all(~if_else(!is.finite(.), as.double(NA), .)) %>%
summarise_all(mean, na.rm = TRUE) %>% # Compute time series means
pivot_longer(everything()) %>% # improve formatting
separate(name, into = c("var", "stat"), sep = "_") %>%
pivot_wider(names_from = stat, values_from = value)
%>%
crsp_summaries ::kable("markdown", digits = 2) # Return table which is used in the slides kableExtra
```

var | mean | sd | min | q05 | q25 | median | q75 | q95 | max | n |
---|---|---|---|---|---|---|---|---|---|---|

ret.adj.exc | 0.86 | 13.68 | -67.63 | -16.68 | -5.79 | -0.13 | 6.05 | 20.95 | 208.27 | 3133.86 |

ret.adj | 1.12 | 13.66 | -67.25 | -16.40 | -5.52 | 0.13 | 6.31 | 21.19 | 207.87 | 3136.53 |

ret.exc | 0.99 | 13.43 | -56.25 | -16.49 | -5.70 | -0.04 | 6.16 | 21.07 | 197.13 | 3117.27 |

ret | 1.25 | 13.42 | -55.93 | -16.21 | -5.42 | 0.23 | 6.41 | 21.31 | 196.77 | 3119.93 |

Finally, another representation of the evolution of the market portfolio. The figure below shows the cumulative log returns from the beginning of the sample and thus illustrates the dynamics of 100 USD invested in the US stock market throughout time. The self-financing option assumes you borrowed the 100 USD at the risk-free rate and rolled over your depth since then.

```
# Plot cumulative excess returns of mkt
%>%
factors_ff mutate("Self-financing" = 100*log(1 + mkt/100),
"Financed" = 100*log(1 + (rf + mkt)/100)) %>%
pivot_longer("Self-financing":"Financed") %>%
group_by(name) %>%
mutate(value = cumsum(value)) %>%
ggplot(aes(x=date,
y = value, color = name)) +
geom_line() +
labs(x = "", y = "Cumulative portfolio value") +
theme_minimal()
```

**Note:** From here it is easy to create to file `exercise_clean_crsp_returns.rds`

from exercise 2.2. The code below has been used to generate the time series of returns.

```
# Filter and clean
<- crsp %>%
crsp drop_na(permno, date, ret.adj) %>%
filter(mktcap > 0) %>%
select(permno, date, ret.adj)
# Prepare CRSP returns for exercise
<- tibble(req_date = seq(ymd('1950-01-01'), ymd('2019-12-01'), by = '1 month'))
date_seq
<- crsp %>% inner_join(date_seq, by = c("date" = "req_date"))
crsp_final <- crsp_final %>%
crsp_final group_by(permno) %>%
filter(n() == 840)
<- crsp_final %>%
crsp_final mutate(m = mean(ret.adj), sd = sd(ret.adj)) %>%
ungroup() %>%
filter(m < quantile(m, 0.95), # Some filtering to avoid annoying outliers.
< quantile(sd, 0.95)) %>%
sd select(-m, -sd)
%>% write_rds("exercise_clean_crsp_returns.rds") crsp_final
```

## 3.2 Beta in the CRSP sample and Portfolio sorts

This exercise is concerned about estimating \(\beta_i\) for the cross-section of US equity returns based on the CRSP sample.
You will estimate the exposure of individual stock with respect to changes in the market portfolio. First, compute \(\beta_i\) for an individual stock to get familiar with linear regression in `R`

. Then, we run rolling window regressions for the entire CRSP sample. To conduct a rolling window estimation you can either employ a `for`

loop or exploit the functionalities of the convenient `slider`

package.

### 3.2.1 Exercises

- Read in the cleaned CRSP data set
`crsp_clean_sample.rds`

from exercise 2.5 and compute the market beta \(\beta_\text{AAPL}\) of ticker`AAPL`

(`permno == 14594`

). You can use the function`lm()`

for that purpose (alternatively: compute the well-known OLS estimate \((X'X)^{-1}X'Y\) on your own). - For monthly data, it is common to compute \(\beta_i\) based on a rolling window of length 5 years. Implement a rolling procedure that estimates assets market beta each month based on the last 60 observations. You can either use the package
`slider`

or a simple`for`

loop for that. (**Note**: this is going to be a time-consuming computational task.) - Provide summary statistics for the cross-section of estimated betas.
- What is the theoretical prediction of CAPM with respect to the relationship between market beta and expected returns? What would you expect if you create portfolios based on beta (you create a high- and a low-beta portfolio each month and track the performance over time)? How should the expected returns differ between high and low beta portfolios?
- Verify/ test your hypothesis by implementing such a portfolio sort procedure. You can follow the following sub-steps: i) For each month, compute the decile beta of the cross-section of returns. ii) use these deciles as thresholds which define for each asset into which portfolio it belongs (if the estimated beta of asset \(i\) is below the 30% and 40% decile, it belongs into the third portfolio). iii) Weight the assets within each portfolio either by equal weighting or relative to their market capitalization. iv) Compute the portfolio return of each portfolio in the next period. v) Repeat this step for each month.

### 3.2.2 Solutions

```
# install.packages("slider")
library(tidyverse) # for grammar
library(lubridate) # for working with dates
library(slider) # for rolling window operations (https://github.com/DavisVaughan/slider)
```

We will make use of the `slider`

package later on. Make sure it is installed before you run the code below.

First, I read in the data and extract the relevant variables. To estimate the CAPM regression equation \[r_{i,t} - r_{f,t} = \alpha_i + \beta_i\left(r_{m,t} - r_{f,t}\right) + \varepsilon_{i,t}\] the excess returns of the individual asset \(i\) and the excess market returns are required.

```
# Read in the cleaned CRSP data set
<- read_rds("crsp_clean_sample.rds") crsp
```

`R`

provides a whole machinery to estimate (linear) models with the function `lm()`

. `lm()`

requires a formula as input which is specific in a compact symbolic form. An expression of the form `y ~ model`

is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by `+`

operators. In addition to **standard** linear models, `lm()`

provides a lot of flexibility, you should check out the documentation for more information.
Below I restrict the data only to the time-series of observations in CRSP that correspond to Apple’s stock (`permno`

of Apple is `14594`

) and compute \(\alpha_i\) as well as \(\beta_i\) based on the assets and the markets excess returns.

```
<- crsp %>%
crsp mutate(ret_excess = ret.adj - rf,
mkt_excess = mkt - rf) %>%
select(permno, date, ret_excess, mkt_excess, mktcap)
# Individual CAPM Regression
<- lm(ret_excess ~ mkt_excess, data = crsp %>%
fit filter(permno == "14594")) # APPLE PERMNO
summary(fit)
```

```
##
## Call:
## lm(formula = ret_excess ~ mkt_excess, data = crsp %>% filter(permno ==
## "14594"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.023 -12.758 -3.502 7.772 62.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9217 2.8074 0.684 0.497
## mkt_excess 0.9453 0.9027 1.047 0.300
##
## Residual standard error: 20.1 on 53 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.02028, Adjusted R-squared: 0.00179
## F-statistic: 1.097 on 1 and 53 DF, p-value: 0.2997
```

`lm()`

returns an object of class `lm`

which contains all information one usually cares about with linear models. `summary()`

returns an easy to understand overview of the estimated parameters. `coefficients(fit)`

would return only the estimated coefficients. The output above indicates that Apple moves closely with the market, \(\beta\) is close to one. What the output also shows, however, is that estimation uncertainty is huge and we cannot reject the \(H_0\) that \(\beta\) is actually zero. What does this imply for the expected excess returns according to the CAPM?

Next, we scale the estimation of \(\beta\) to a whole different level and perform rolling window estimations for the entire CRSP sample. To do this, I first define a little helper function `capm_regression`

which makes use of the code snippet above.

```
# Rolling window estimation
<- function(x, check = 24) {
capm_regression # A simple function that computes CAPM betas using lm if the number of non-NA observations exceeds 24
if (nrow(x) < check) { # "feasible" regressions require at least observations
return(tibble(beta = NA))
else {
} <- lm(ret_excess ~ mkt_excess, data = x)
reg return(tibble(beta = reg$coefficients[2]))
} }
```

Note that I included a test if the time-series contains at least 24 observations to avoid huge fluctuations in case the time-series is just too short. All this function returns is the market beta regression coefficient.

The next function wraps everything together: the `slider`

package comes with the convenient function `slide_index()`

which takes as first argument (`X`

) a tibble, as second argument an (date) index (`x$date`

) and performs a function (`capm_regression(.x)`

) on the subset of data `.x`

that is identified as falling withing the rolling window. The length of the window is provided via `.before = months(60)`

. Thus, for each month, the rolling window CAPM regression coefficient \(\hat\beta\) is stored in a new column which I call `beta`

.

```
<- function(x, window = 60) {
rolling_capm_regression # slide across dates
<- slide_index_dfr(x,
betas $date,
x~capm_regression(.x),
.before = months(window),
.complete = FALSE)
# collect output
%>% mutate(beta = betas$beta)
x }
```

The final step is just to let `R`

do the rest and to wait (probably quite some time) until the large number of regressions is completed. Note below, that the computation is performed by `permno`

.

```
<- crsp %>%
betas_monthly na.omit() %>%
group_by(permno) %>%
arrange(permno, date) %>%
group_modify(~rolling_capm_regression(.x)) %>%
na.omit()
```

How do the estimated betas look like? There are many ways to provide some summary statistics. The code snippet below computes some cross-sectional summary statistics (for each date) and then reports the time-series averages. In case you are wondering: the package `knitr`

provides amazing features to transform R output into ready-to-publish formats (also supports LaTeX).

```
%>%
betas_monthly group_by(date) %>%
summarise(mean = mean(beta),
sd = sd(beta),
skew = moments::skewness(beta),
kurt = moments::kurtosis(beta),
min = min(beta),
q25 = quantile(beta, 0.25),
q50 = quantile(beta, 0.50),
q75 = quantile(beta, 0.75),
max = max(beta),
n = n()) %>%
ungroup() %>%
select(-date) %>%
summarise_all(list(~mean(., na.rm = TRUE))) %>%
::kable("pipe", digits = 2) knitr
```

mean | sd | skew | kurt | min | q25 | q50 | q75 | max | n |
---|---|---|---|---|---|---|---|---|---|

1.16 | 0.64 | 0.49 | 7.3 | -2.17 | 0.73 | 1.1 | 1.53 | 5.11 | 2710.21 |

The table above shows that we computed betas for on average 2710 firms per month (after removing `NA`

s). The typical firm has a mean market beta of about \(1\) but there is a lot of variation. We also see that only very few firms exhibit a negative beta.

The next section implements portfolio sorts based on beta. The idea is to create portfolios of assets with a comparable characteristic such as market beta. The idea, however, can also be translated to sorting based on size, book-to-market values or any other firm characteristic.
First, I create a new variable `_f1`

which will track the returns of an asset in the following month. These values will be used to track the portfolio performance.

```
# Next, we construct portfolio sorts
# We add the (excess) returns for the subsequent month of each stock to the sample.
# Note that we use beginning of month dates to ensure correct matching of dates
<- betas_monthly %>%
betas_monthly arrange(permno, date) %>%
group_by(permno) %>%
mutate(ret_excess_f1 = lead(ret_excess, 1),
mkt_excess_f1 = lead(mkt_excess, 1))
```

Next, I compute the sorting thresholds. In this case, I want to create 10 portfolios of assets which have similar betas. For that purpose I first compute the beta decentiles for each month. Firms will be assigned to individual portfolios based on these thresholds.

```
# calculate quantiles
<- seq(0.1, 0.9, 0.1)
percentiles <- map_chr(percentiles, ~paste0("q", .x*100))
percentiles_names <- map(percentiles,
percentiles_funs ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = percentiles_names)
<- betas_monthly %>%
quantiles group_by(date) %>%
summarise_at(vars(beta), lst(!!!percentiles_funs))
%>%
quantiles pivot_longer(-date, names_to = "Quantile") %>%
ggplot(aes(x = date, y = value, color = Quantile)) + geom_line() + theme_bw() +
labs(x="", y = "Beta")
```

The figure above illustrates the thresholds for market beta. There is some time-series variation we account for by recomputing the quantiles each month. Next, we sort each asset to its portfolio on a monthly basis. The implied portfolio weights are usually chosen to be either equally weighted (in that case the portfolio return is just the average of all assets) or relative to market capitalization (in that case the portfolio retun is a weighted mean).

```
# sort all stocks into decile portfolios
<- betas_monthly %>%
portfolios left_join(quantiles, by = "date") %>%
mutate(portfolio = case_when(beta <= q10 ~ 1L,
> q10 & beta <= q20 ~ 2L,
beta > q20 & beta <= q30 ~ 3L,
beta > q30 & beta <= q40 ~ 4L,
beta > q40 & beta <= q50 ~ 5L,
beta > q50 & beta <= q60 ~ 6L,
beta > q60 & beta <= q70 ~ 7L,
beta > q70 & beta <= q80 ~ 8L,
beta > q80 & beta <= q90 ~ 9L,
beta > q90 ~ 10L))
beta
<- portfolios %>%
portfolios_ts filter(date <= "2019-12-01") %>%
mutate(portfolio = as.character(portfolio)) %>%
group_by(portfolio, date) %>%
summarize(ret_vw = weighted.mean(ret_excess_f1, mktcap, na.rm = TRUE),
ret_mkt = mean(mkt_excess_f1, na.rm = TRUE)) %>%
na.omit() %>%
ungroup()
```

As an additional portfolio I include a self-financing portfolio which finances a long position in high beta stocks (“10”) with a short position in low beta stocks (“1”).

```
# 10-1 portfolio
<- portfolios_ts %>%
portfolios_ts_101 filter(portfolio %in% c("1", "10")) %>%
pivot_wider(names_from = portfolio,
values_from = ret_vw) %>%
mutate(ret_vw = `10` - `1`,
portfolio = "10-1") %>%
select(portfolio, date, ret_vw, ret_mkt)
# combine everything
<- bind_rows(portfolios_ts, portfolios_ts_101) %>%
portfolio_returns mutate(portfolio = factor(portfolio, levels = c(as.character(seq(1, 10, 1)), "10-1")))
```

What is the performance of these portfolios? I provide two evaluations metrics. First, the average returns and then the CAPM risk adjusted alphas per portfolio. Recall, that high beta portfolios should, according to the CAPM, earn higher expected returns (if the market risk premium is positive) than low beta assets. However, after adjusting for the covariance with the market portfolios, both assets should return a zero alpha, such that only non-diversifiable risk-taking activities are rewarded.

```
<- portfolio_returns %>%
average_ret group_by(portfolio) %>%
arrange(date) %>%
do(model = lm(paste0("ret_vw ~ 1"), data = .)) %>%
mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6)))) %>%
mutate(broom::tidy(model)) %>%
ungroup() %>%
mutate(nw_tstat = estimate / nw_stderror) %>%
select(estimate, nw_tstat) %>%
t()
# estimate capm alpha per portfolio
<- portfolio_returns %>%
average_capm_alpha group_by(portfolio) %>%
arrange(date) %>%
do(model = lm(paste0("ret_vw ~ 1 + ret_mkt"), data = .)) %>%
mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6))[1])) %>%
mutate(estimate = coefficients(model)[1]) %>%
mutate(nw_tstat = estimate / nw_stderror) %>%
select(estimate, nw_tstat) %>%
t()
# construct output table
<- rbind(average_ret, average_capm_alpha)
out colnames(out) <-c(as.character(seq(1, 10, 1)), "10-1")
rownames(out) <- c("Excess Return", "t-Stat" , "CAPM Alpha", "t-Stat")
%>% knitr::kable(digits = 2, "pipe") out
```

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 10-1 | |
---|---|---|---|---|---|---|---|---|---|---|---|

Excess Return | 0.46 | 0.48 | 0.62 | 0.63 | 0.67 | 0.67 | 0.64 | 0.64 | 0.64 | 0.67 | 0.23 |

t-Stat | 2.66 | 2.69 | 3.56 | 3.29 | 3.11 | 3.03 | 2.70 | 2.45 | 2.16 | 1.98 | 0.78 |

CAPM Alpha | 0.25 | 0.22 | 0.32 | 0.29 | 0.29 | 0.24 | 0.18 | 0.13 | 0.06 | 0.01 | -0.53 |

t-Stat | 1.81 | 1.58 | 2.77 | 2.55 | 2.45 | 2.07 | 1.53 | 1.01 | 0.44 | 0.08 | -2.50 |

**Note: ** the code above implements standard errors based on Newey West autocorrelation adjustments in the error terms.
What do the portfolio performances reveal? At least a big surprise if you believe in the CAPM: There is no indication that average excess returns increase with beta. In fact, the self-financing portfolio yields average excess returns of 0.23, which is not significantly different from zero. Thus, there is a strong signal that the CAPM story alone is not sufficient to explain the cross-section of returns.

## 3.3 Estimating and testing the CAPM

This exercise is closely aligned to the slides on the capital asset pricing model and asks you to estimate the market beta for some portfolios of assets sorted by industry, to compute optimal portfolio weights based on the CAPM factor structure and to test the validity of the CAPM.

### 3.3.1 Exercises

- Download monthly return data for 10 industry portfolios from Kenneth French’s data library and prepare it to work with the file in R (I advice you to download the
`.csv`

file and to take a look in the file to remove unnecessary rows). Also, retrieve the market and risk-free returns from Kenneth French’s data library. Provide some summary statistics for the industry portfolio (excess) returns. - For each industry portfolio, estimate the CAPM model with simple OLS. What are the \(\alpha\)’s? What are the \(\beta\)’s? What does the CAPM imply for \(\alpha\)?
- Perform a joint test on the validity of the CAPM based on the finite sample distribution of a Wald test proposed by MacKinlay (1987) and Gibbons, Ross, and Shanken (1989). What do you conclude?
- Suppose the CAPM would hold. How could you make use of the factor structure of the returns to compute portfolio weights on the efficient frontier? What is the advantage of this procedure? Compare the portfolio weights of risky assets for a desired annualized rate of return of 10% resulting from using the sample mean and covariance, \(\hat\mu\) and \(\hat\Sigma\), as input into the optimization approach with the corresponding weights based on the moments of the return distribution implied by the CAPM.

### 3.3.2 Solutions

You can download the `.csv`

manually from the homepage or from within `R`

following this blog post. I cleaned the file by removing the annual and equally-weighted returns from the file. This raw version is also provided on Absalon.

```
# Load required packages
library(tidyverse)
library(lubridate)
# Read in the data
<- read_csv("../Empirics/data/10_Industry_Portfolios.csv", skip = 11) %>%
returns rename(date = X1) %>%
mutate(date = ymd(parse_date_time(date, "%Y%m")))%>%
# parse_date_time is a convenient way to transform the YYYYMM convention
# in Kenneth French's files into dates R can work with
na.omit()
# Merge with the market factor and the risk-free rate to compute excess returns
<- read_csv("../Empirics/data/F-F_Research_Data_Factors.csv", skip = 3) %>%
factors_ff rename(date = X1, MKT = `Mkt-RF`) %>%
::clean_names() %>%
janitormutate(date = ymd(parse_date_time(date, "%Y%m")))%>%
na.omit()
```

After reading the files I join both together to compute the excess returns. The line `mutate_at(vars(-date), ~ .-rf)`

below performs a mutation on each columns except for `date`

and substracts the value of the column `rf`

from the original value.

```
<- returns %>%
returns left_join(factors_ff, by = "date") %>%
select(-smb, -hml) %>%
mutate_at(vars(-date), ~ .-rf) %>%
select(-rf) %>%
na.omit()
```

Next, I provide some summary statistics for each ticker, including `mkt`

, the market excess returns.

```
%>%
returns pivot_longer(-date, names_to = "ticker") %>%
group_by(ticker) %>%
summarise(mean = 12*mean(value), # annualized values
sd = sqrt(12)*sd(value),
sharpe = mean/sd)
```

ticker | mean | sd | sharpe |
---|---|---|---|

Durbl | 10.39 | 27.14 | 0.38 |

Enrgy | 8.14 | 21.73 | 0.37 |

HiTec | 10.45 | 25.03 | 0.42 |

Hlth | 9.77 | 19.23 | 0.51 |

Manuf | 8.94 | 21.63 | 0.41 |

mkt | 4.83 | 18.61 | 0.26 |

NoDur | 8.28 | 15.89 | 0.52 |

Other | 7.60 | 22.21 | 0.34 |

Shops | 9.22 | 20.17 | 0.46 |

Telcm | 7.01 | 15.93 | 0.44 |

Utils | 7.18 | 19.03 | 0.38 |

The implication of the CAPM is that all elements of the \((N \times 1)\) vector \(\alpha\) are zero in the joint regression framework
\[Z_t = \alpha + \beta Z_{m,t} + \varepsilon_t\]
where \(\beta\) is the \((N \times 1)\) vector of market betas (if \(\alpha\) is zero, then \(m\) is the tangency portfolio). Then, standard Ordinary least squares (OLS) estimators procedure delivers
\[\hat\alpha = \hat\mu - \hat\beta\hat\mu_m\] with \[\hat\beta = \frac{\sum (Z_t - \hat\mu)(Z_{m,t} - \hat\mu_m)}{\sum (Z_{m,t} - \hat\mu_m)^2}.\]
Alternatively, let \(X = (1, Z_m)\) be a \((T \times 2)\) matrix, then \((X'X)^{-1}X'Z\) is the \((2 \times N)\) matrix of coefficients. You can also use `lm`

to compute the same coefficients but for the sake of brevity I provide the explicit estimation below.

```
<- returns %>% select(-date, -mkt) %>% as.matrix()
Z <- returns %>% pull(mkt)
Z_m
<- cbind(1, Z_m)
X <- solve(t(X)%*%X) %*% t(X)%*%Z
mle_estimate <- matrix(mle_estimate[1, ], nr = 1)
alphas
# Exactly the same can be computed with lm(Z~Z_m)
```

The estimated parameters

NoDur | Durbl | Manuf | Enrgy | HiTec | Telcm | Shops | Hlth | Utils | Other | |
---|---|---|---|---|---|---|---|---|---|---|

0.39 | 0.36 | 0.30 | 0.32 | 0.38 | 0.32 | 0.38 | 0.48 | 0.29 | 0.18 | |

Z_m | 0.75 | 1.25 | 1.12 | 0.89 | 1.22 | 0.66 | 0.96 | 0.83 | 0.76 | 1.12 |

To evaluate the significance of the individual parameters you can either implement the equations provided in Chapter 4 of *The Econometrics of Financial Markets* (available in Absalon) or exploit directly the functionalities provided by `lm`

(which, in principle, also allow to compute standard errors in the presence of heteroscedasticity). As an example, see the code below

```
<- lm(Z~Z_m)
fit <- summary(fit)
fit $`Response HiTec` fit
```

```
##
## Call:
## lm(formula = HiTec ~ Z_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9742 -1.8207 -0.0544 1.6076 14.9026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37904 0.09007 4.208 2.78e-05 ***
## Z_m 1.22194 0.01673 73.056 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.019 on 1128 degrees of freedom
## Multiple R-squared: 0.8255, Adjusted R-squared: 0.8254
## F-statistic: 5337 on 1 and 1128 DF, p-value: < 2.2e-16
```

The estimated parameters reveal that all industry portfolios exhibit a positive alpha (you can convince yourself that they are significantly different than 0). This indicates that each of the portfolios yield returns which are not fully explained by the exposure to market fluctuations. The estimated betas range from 0.66 (telecommunication) to 1.25 (durable goods) and reflect that industry are differently affected by fluctuations in the economy.

Next, we test if all alphas are jointly different from zero. If that is the case it implies that the market portfolio is not the efficient tangency portfolio. The following code just follows the equations outlined in the slides: MacKinlay (1987) and Gibbons, Ross, and Shanken (1989) developed the finite-sample distribution of the test-statistic \(J\) which yields \[J = \frac{T-N-1}{N}\left(1 + \frac{\hat \mu ^2}{\hat\sigma_m^2}\right)^{-1}\hat\alpha'\hat\Sigma^{-1}\hat\alpha\] where \(\hat\Sigma = Cov(\hat\varepsilon)\).

```
# Perform Wald test
<- ncol(Z)
N <- nrow(Z)
T <- cov(Z - X %*% mle_estimate)
mle_sigma <- (T-N-1)/N * (1 + mean(Z_m)^2/sd(Z_m)^2)^(-1) * alphas %*% solve(mle_sigma) %*% t(alphas)
test_statistic test_statistic
```

```
## [,1]
## [1,] 32.93979
```

Under the null hypothesis, \(J\) is unconditionally distributed central \(F\) with
\(N\) degrees of freedom in the numerator and \((T—N—l)\) degrees of freedom
in the denominator. `R`

provides extensive functions to sample from probability distributions, to evaluate quantiles and the probability mass at a specific point. You can check `?rnorm`

for a description based on the normal distribution.

`1 - pf(test_statistic, T, T - N - 1) # Compute p-value`

```
## [,1]
## [1,] 0
```

`qf(0.95, T, T - N - 1) # 95% critical value`

`## [1] 1.103137`

The test clearly discard the CAPM. In other words, the estimated alphas are too high to justify that our chosen proxy for the market portfolio can be the efficient tangent portfolio.

As pointed out in the slides, the assumption of a factor structure in the returns reduces the number of parameters to estimate substantially. Recall from Exercise 2.3 how the efficient tangent portfolio is computed.

```
<- function(mu, sigma, desired_return = 10/12){
efficient_portfolio <- length(mu)
N <- rep(1, N)
iota <- solve(sigma) %*% iota
wmvp <- wmvp/sum(wmvp)
wmvp # Compute efficient portfolio weights for given level of expected return
<- as.numeric(t(iota)%*%solve(sigma)%*%iota)
C <- as.numeric(t(iota)%*%solve(sigma)%*%mu)
D <- as.numeric(t(mu)%*%solve(sigma)%*%mu)
E <- as.numeric(2*(desired_return -D/C)/(E-D^2/C))
lambda_tilde <- wmvp + lambda_tilde/2*(solve(sigma)%*%mu - D/C*solve(sigma)%*%iota)
weff return(weff)
}
```

Next we compute the moments of the return distribution in two ways.

```
# Compute sample moments
<- returns %>% select(-date, -mkt) %>% colMeans()
mu <- returns %>% select(-date, -mkt) %>% cov()
sigma
# Same with CAPM
# Compute restricted betas
<- solve(t(Z_m)%*%Z_m) %*% t(Z_m)%*%Z
constrained_estimate
<- mean(returns$mkt) # average market risk premium
mu_m <- mean(factors_ff$rf) # average risk-free rate
rf_m
<- rf_m + constrained_estimate * mu_m
mu_capm <- as.numeric(mu_capm)
mu_capm <- apply(Z - Z_m %*% constrained_estimate, 2, sd) # Column-wise residual standard deviation
sigma_eps <- var(returns$mkt)
sigma_m
<- sigma_m * t(constrained_estimate) %*% constrained_estimate + diag(sigma_eps) sigma_capm
```

The resulting portfolio weights differ because the inputs changed

```
<- efficient_portfolio(mu, sigma)
w <- efficient_portfolio(mu_capm, sigma_capm)
w_capm
cbind(w, w_capm)
```

```
## [,1] [,2]
## NoDur 0.76066404 -0.17898124
## Durbl 0.19253393 0.29859232
## Manuf -0.24936416 0.43480983
## Enrgy 0.22018162 0.01389309
## HiTec 0.20148549 0.36365099
## Telcm 0.20022415 -0.23072028
## Shops 0.07288366 0.10376855
## Hlth 0.45092041 -0.03944451
## Utils -0.02311254 -0.09580907
## Other -0.82641661 0.33024032
```

## 3.4 Fama MacBeth Regressions

### 3.4.1 Exercises

This exercises asks you to implement Fama MacBeth regressions. For a more detailed theoretical treatment, check out the Chapter of the textbook *The econometrics of financial markets* (available in Absalon).

- Before starting with implementing the regression, write a short function to winsorize a variable. Winsorizing means to simply replace values which are above or below a pre-specified quantile with that quantile. It is common in the asset pricing literature to winsorize the independent variable before running regressions so it is a good thing to implement winsorization once.
- Estimate the market risk premium based on Fama MacBeth regressions. To do this, for each cross-section of returns (e.g., each month), project the realized returns on the estimated betas from the previous exercise and then aggregate the estimates in the time dimension. That is, for each month \(t\), estimate \[Z_t = \gamma_{0,t}\iota + \gamma_{1,t}\beta + \eta_t\] and then compute the time-series mean of \(\hat\gamma_{0,t}\) and \(\hat\gamma_{1,t}\).

### 3.4.2 Solutions

The methodology of winsorizing is straight-forward to implement (although also somewhat arbitrary). The function below implements winsorizing in a pipeable (`%>%`

) fashion.

```
<- function(x, cut = 0.005){
winsorize <- quantile(x, 1 - cut, na.rm = TRUE)
cut_point_top <- quantile(x, cut, na.rm = TRUE)
cut_point_bottom <- which(x >= cut_point_top)
i <- cut_point_top
x[i] <- which(x <= cut_point_bottom)
j <- cut_point_bottom
x[j] return(x)
}
# prepare and winsorize data
<- betas_monthly %>%
data_nested drop_na(ret_excess_f1, beta) %>%
group_by(date) %>%
mutate(beta = winsorize(beta)) %>%
nest()
```

Note that the tibble `betas_monthly`

is the output of the rolling window estimation in Exercise 2.6.
After winsorizing we start with the “first” step of the Fama-Mac Beth regression. For each cross-section (that is, for each month) we regress the (one-month ahead) excess returns on the (pre-estimated) beta.
Note the use of the `tidyverse`

package `broom`

which provides an easy way to extract summary stats for the individual regression estimations. If it is unclear what `nested()`

and `map`

are doing, check out the chapter `Many models`

in Hadley Wickham’s book.

```
# perform cross-sectional regressions for each month
<- data_nested %>%
cross_sectional_regs mutate(model = map(data, ~lm(ret_excess_f1 ~ beta, data = .x))) %>%
mutate(tidy = map(model, broom::tidy),
n = map_dbl(model, stats::nobs))
# extract average coefficient estimates
<- cross_sectional_regs %>%
fama_macbeth_coefs unnest(tidy) %>%
group_by(term) %>%
summarize(coefficient = mean(estimate))
```

Above we compute the time-series average of the estimated regression coefficients \(\hat\gamma_0\) and \(\hat\gamma_1\). The remaining code implements Newey west standard errors of the average coefficient estimates and proceeds with some householding to present the regression results in a nice way.

```
# compute newey-west standard errors of average coefficient estimates
<- cross_sectional_regs %>%
newey_west_std_errors unnest(tidy) %>%
group_by(term) %>%
arrange(date) %>%
group_modify(~enframe(sqrt(diag(sandwich::NeweyWest(lm(estimate ~ 1, data = .x), lag = 6))))) %>%
select(term, nw_std_error = value)
# put coefficient estimates and standard errors together and compute t-statistics
<- fama_macbeth_coefs %>%
fama_macbeth_coefs left_join(newey_west_std_errors, by = "term") %>%
mutate(nw_t_stat = coefficient / nw_std_error) %>%
select(term, coefficient, nw_t_stat) %>%
pivot_longer(cols = c(coefficient, nw_t_stat), names_to = "statistic") %>%
mutate(statistic = paste(term, statistic, sep = " ")) %>%
select(-term)
# extract average number of observations
<- cross_sectional_regs %>%
fama_macbeth_stats ungroup() %>%
summarize(n = mean(n)) %>%
pivot_longer(n, names_to = "statistic")
# combine desired output and return results
<- rbind(fama_macbeth_coefs, fama_macbeth_stats)
out
%>% knitr::kable(digits = 2, "pipe") out
```

statistic | value |
---|---|

(Intercept) coefficient | 0.79 |

(Intercept) nw_t_stat | 3.94 |

beta coefficient | 0.08 |

beta nw_t_stat | 0.68 |

n | 2692.88 |

What do the results show? One the one hand side, we cannot reject the \(H_0\) that the intercept is zero. However, the coefficient on the market beta, which is the market risk premium is also not significantly different from zero. Although slightly positive (which is what we would expect from a priced risk factor), these regression results indicate that the market factor is not priced.

## 3.6 Shrinkage Estimation

In this exercise you are supposed to familiarize yourself with the details behind shrinkage regression methods such as Ridge and Lasso. Although `R`

provides amazing interfaces which perform the estimation flawlessly, you are first asked to implement Ridge and Lasso regression estimators from scratch before moving on to using the package `glmnet`

next.

### 3.6.1 Exercises

- Write a function that requires three inputs,
`y`

(a \(T\) vector),`X`

(a \((T \times K)\) matrix), and`lambda`

and which returns the**Ridge**estimator (a \(K\) vector) for given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should allow to indicate whether \(X\) contains a vector of ones as first column which should be exempt from the \(L_2\) penalty. - Compute the \(L_2\) norm (\(\beta'\beta\)) for the regression coefficients based on the predictive regression from the previous exercise for a range of \(\lambda\)’s and illustrate the effect the penalization in a suitable figure.
- Now, write a function that requires three inputs,
`y`

(a \(T\) vector),`X`

(a \((T \times K)\) matrix), and ’lambda` and which returns the**Lasso**estimator (a \(K\) vector) for given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should allow to indicate whether \(X\) contains a vector of ones as first column which should be exempt from the \(L_1\) penalty. - After you are really sure you understand what Ridge and Lasso regression are doing, familiarize yourself with the documentation of the package
`glmnet()`

. It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients not only for Ridge and Lasso but also for combinations therefore, commonly called*elastic nets*.

- The choice of the hyperparameter \(\lambda\) should be based on the predictive performance within a validation sample. Split the entire sample into a training, validation and final test set. The test set should not be used for any computations but only to evaluate the out-of-sample predictive performance. One unified framework for such machine learning methods is the
**still very expiremental**`tidymodels`

library which consists of a number of packages that aim to improve machine learning tasks. You are advised to familiarize yourself with their great online tutorials to learn more about ‘tidymodels’. Use the sample split to choose the optimal penalty parameter \(\lambda\) for Ridge regression and show that penalization provides effective means against overfitting.

### 3.6.2 Solutions

- As for OLS, the objective function is to minimize the sum of squared errors \((y - X\beta)'(y - X\beta)\) under the condition that \(\sum\limits_{j=2}^K \beta_j \leq t(\lambda)\)
**if**an intercept is present. This can be rewritten as \(\beta'A\beta \leq t(\lambda)\) where \(A = \begin{pmatrix}0&&\ldots&&0\\\vdots& 1&0&\ldots&0\\&\vdots&\ddots&&0\\0& &\ldots&&1\end{pmatrix}\). Otherwise, the condition is simply that \(\beta'I_K\beta \leq t(\lambda)\) where \(I_k\) is an identity matrix of size (\(k \times k\)).

```
<- function(y, X, lambda = 0, intercept = FALSE){
ridge_regression <- ncol(X)
K <- diag(K)
A if(intercept){
1,] <- 0
A[
}<- solve(t(X)%*%X + lambda * A) %*%t(X)%*%y
coeffs return(coeffs)
}
```

Below, I apply the function to the data set from the previous exercise to illustrate the output of the function `ridge_regression()`

.

```
<- welch_data$rp_div
y <- welch_data %>% select(-date, -rp_div) %>% as.matrix()
X
# OLS for Welsh data fails because X is not of full rank
c(ncol(X), Matrix::rankMatrix(X))
```

`## [1] 14 12`

```
<- c(NA, ridge_regression(y, X, lambda = 1))
rc
# Add an intercept term
<- ridge_regression(y, cbind(1, X), lambda = 1, intercept = TRUE)
rc_i cbind(rc, rc_i)
```

```
## rc
## NA 0.068450076
## dp -0.239343694 -0.231886874
## dy 0.361579150 0.364407261
## ep -0.124960781 -0.118116681
## de -0.114382913 -0.113770193
## svar -0.014811852 -0.014348260
## csp 0.001835536 0.002063538
## bm -0.004624298 -0.026906940
## ntis -0.021298568 -0.017659394
## tbl -0.013561551 -0.018478176
## lty -0.001868527 -0.008946509
## ltr 0.059168150 0.058178639
## tms 0.011693024 0.009531667
## dfy 0.005863087 0.006118038
## infl -0.003994673 -0.003826009
```

Finally, the following code sequence computes the \(L_2\) norm of the ridge regression coefficients for a given \(\lambda\). Note that in order to implement this evaluation in a tidy manner, vectorized functions are important! `R`

provides great ways to vectorize a function, to learn more, Hadley Wickham’s book Advanced R is a highly recommended read!

```
<- function(lambda) sum(ridge_regression(y = y, X = X, lambda = lambda)^2) # small helper function to extract the L_2 norm of the ridge regression coefficients
l2_norm
<- Vectorize(l2_norm) # In order to use the function within a `mutate` operation, it needs to be vectorized
l2_norm
seq(from = 0.01, to =20, by =0.1) %>%
as_tibble() %>%
mutate(norm = l2_norm(value)) %>%
ggplot(aes(x=value, y = norm))+ geom_line() +
labs(x = "Lambda", y = " L2 norm")
```

To compute the coefficients of a linear regression with a penalty on the sum of the absolute value of the regression coefficients, numerical optimization routines are required. Recall, the objective function is to minimize the sum of squared residuals, \(\hat\varepsilon'\hat\varepsilon\) under the constraint that
\(\sum\limits_{j=2}^K\beta_j\leq t(\lambda)\). Make sure you familiarize yourself with the way, numerical optimization in R works: we first define the objective function (`objective_lasso()`

) which has the parameter we aim to optimize as its first argument. The main function `lasso_regression()`

then only calls this helper function.

```
<- function(beta, y, X, lambda, intercept){
objective_lasso <- y - X%*%beta
residuals <- sum(residuals^2)
sse <- sum(abs(beta))
penalty if(intercept){
<- penalty - abs(beta[1])
penalty
}return(sse + lambda * penalty)
}
= function(y, X, lambda = 0, intercept = FALSE){
lasso_regression <- ncol(X)
K <- rep(0, K)
beta_init return(optim(par = beta_init, fn = objective_lasso, y = y, X = X, lambda = lambda, intercept = intercept)$par)
}
<- c(NA, lasso_regression(y, X, lambda = 0.01))
rc
# Add an intercept term
<- lasso_regression(y, cbind(1, X), lambda = 0.01, intercept = TRUE)
rc_i cbind(rc, rc_i)
```

```
## rc rc_i
## [1,] NA -0.015481329
## [2,] -0.002309567 -0.007432227
## [3,] 0.018950587 0.007577932
## [4,] -0.019328955 -0.007565038
## [5,] -0.041474430 -0.018622341
## [6,] -0.873890375 -1.668871810
## [7,] 0.473858053 0.398850112
## [8,] 0.004826174 0.008262646
## [9,] -0.239587377 -0.405821201
## [10,] 0.002021899 -0.072684496
## [11,] -0.416048940 -0.374240823
## [12,] 0.256429715 0.368949682
## [13,] 0.335641082 0.226014373
## [14,] 0.230499721 1.327395136
## [15,] 0.011043346 0.019841244
```

Finally, as in the previous example with Ridge regression I illustrate how a larger penalization term \(\lambda\) affects the \(L_1\) norm of the regression coefficients.

```
<- function(lambda) sum(abs(lasso_regression(y = y, X = X, lambda = lambda)))
l1_norm <- Vectorize(l1_norm) # In order to use the function within a `mutate` operation, it needs to be vectorized
l1_norm
seq(from = 0, to =0.5, by =0.05) %>%
as_tibble() %>%
mutate(norm = l1_norm(value)) %>%
ggplot(aes(x=value, y = norm))+ geom_line() +
labs(x = "Lambda", y = " L1 norm")
```

While the code above should work, there are well-tested R packages available that provide a much more reliable and faster implementation. Thus, you can safely use the package `glmnet`

. As a first step, the following code create the sequence of regressions coefficients for the Goyal-Welch dataset which should be identical to the example we discussed in the lecture slides.

```
library(glmnet)
# Lasso and Ridge regression
<- glmnet(X, y, # Model
fit_ridge alpha = 0)
::tidy(fit_ridge) %>%
broomfilter(term != "(Intercept)") %>%
ggplot(aes(x=lambda, y = estimate, color = term)) +
geom_line() +
geom_hline(data = . %>% filter(lambda == min(lambda)), aes(yintercept = estimate, color = term), linetype = "dotted") +
theme_minimal() +
scale_x_log10() +
labs(x = "Lambda", y = "Estimate")
```

Note the function argument `alpha`

. `glmnet()`

allows a more flexible combination of \(L_1\) and \(L_2\) penalization on the regression coefficients. The *pure* ridge estimation is implemented with `alpha = 0`

, lasso requires `alpha = 1`

.

```
<- glmnet(X, y, # Model
fit_lasso alpha = 1)
::tidy(fit_lasso) %>%
broomfilter(term != "(Intercept)") %>%
ggplot(aes(x=lambda, y = estimate, color = term)) +
geom_line() +
geom_hline(data = . %>% filter(lambda == min(lambda)), aes(yintercept = estimate, color = term), linetype = "dotted") +
theme_minimal() +
scale_x_log10() +
labs(x = "Lambda", y = "Estimate")
```

Note in the figure how Lasso discards all variables for high values of \(\lambda\) and then gradually incorporates more predictors. It seems like stock variance is selected first. As expected, for \(\lambda \rightarrow 0\), the lasso coefficients converge towards the OLS estimates (illustrated with dotted lines).

Next we split the sample using the function ‘initial_time_split’ from the ´rsample´ package (you can just install ´tidymodels’ to get access to all the related packages). The split takes the last 20% of the data as test set which will **not** be used for any model tuning. The remaining sample is split in two even parts which are used for hyper-parameter selection.

```
#install.packages("tidymodels")
library(tidymodels)
<- rsample::initial_time_split(welch_data, prop = 0.8)
data
<- rsample::testing(data) # Don't touch again until you selected your parameters!
test_file <- rsample::initial_time_split(training(data), prop = 0.5)
data
<- training(data) %>% select(-date, -rp_div) %>% as.matrix()
training_x <- testing(data) %>% select(-date, -rp_div) %>% as.matrix() validation_x
```

Next I perform the typical hyperparameter tuning task: The idea is to choose a grid of possible values for the hyperparameter (in that case, \(\lambda\)). Then, fit the model using the training set and evaluate the goodness of fit (e.g. mean-squared prediction error) using the *validation* set. You will choose the hyperparameter that provides the best fit in the validation set.

```
# Compute Lasso for range of lambdas
<- fit_ridge$lambda
lambdas <- lambdas[lambdas >min(lambdas[lambdas!=min(lambdas)] )]
lambdas
<- purrr::map_df(lambdas, function(k){
accuracy <- glmnet(training_x, training(data)$rp_div, lambda = k, alpha = 0)
fit <- predict(fit, training_x)
y_hat <- mean((y_hat - training(data)$rp_div)^2)
training_error
<- predict(fit, validation_x)
y_hat <- mean((y_hat - testing(data)$rp_div)^2)
validation_error
tibble(lambda = k,
train = training_error,
validation = validation_error)
})
<- accuracy %>%
p1 pivot_longer(-lambda, names_to = "Model") %>%
filter(lambda > min(lambda)) %>%
ggplot(aes (x = lambda, y = value, color = Model)) +
theme_minimal() +
geom_point() +
scale_x_log10() +
labs(x = latex2exp::TeX("Penalization ($\\lambda$)"), "MSPE",
y = "MSPE")
<- glmnet(training_x, training(data)$rp_div,
final_fit lambda = accuracy %>% filter(validation == min(validation)) %>% pull(lambda),
alpha = 0)
<- predict(final_fit, test_file %>% select(-date, -rp_div) %>% as.matrix())
y_hat_final_fit + geom_hline(yintercept = mean((y_hat_final_fit - test_file$rp_div)^2)) p1
```

The figure illustrates two main insights: First, imposing penalization harms the fit within the test set. The more we penalize, the worse the in-sample gets. This is not true, however, for the validation set: For less restrictive models, overfitting becomes an issues which affects the out-of-sample performance of the ridge regression. Instead, the validation prediction error is smallest for a hyperparameter \(\lambda>0\)! The black line shows the out-of-sample prediction error of the test set which has not been at all for hyperparameter tuning and thus resembles an estimate of the actual predictive performance of the ridge regression.

## 3.7 Regression trees and random forests

Gu, Kelly and Xiu (2019) propose regression trees and especially random forests as suitable tools for factor and predictor selection in asset pricing. Basic regression trees partition a data set into smaller subgroups and then fit a simple constant for each observation in the subgroup. The constant to predict is based on the average response values for all observations that fall in that subgroup. First, its important to realize the partitioning of variables are done in a top-down, greedy fashion. This just means that a partition performed earlier in the tree will not change based on later partitions. But how are these partitions made?

R packages provide very efficient algorithms to generate predictions based on random forests but in this exercise you will first be asked to implement a simple function which builds a tree with predetermined depth. Afterward, we will use packages like `rpart`

or `randomForest`

to increase the efficiency in the estimation process.

### 3.7.1 Exercises

- Write a function that takes
`y`

, a matrix of predictors`X`

as inputs and returns a characterization of the relevant parameters of a regression tree with**1**branch. - Create a function that allows to create predictions for a new matrix of predictors `newX´ based on the estimated regression tree. The slide ´how do regression trees work?´ provides a step-by-step description on how to do this
- Use the package
`rpart`

to*grow*a tree based on the Goyal-Welch data set as shown in the lecture - Make use of a training and a test set to choose the optimal depth (number of sample splits) of the tree
- Grow a random forest and evaluate the predictive performance relative to the Lasso regression above

### 3.7.2 Solutions

Coding a regression manually is somewhat cumbersome but a good practice to make sure you really understand what is required for this task.
In the following, I first define a function `rss`

which, in its most basic form just takes two vectors `xsort`

and `y`

as input. The input `threshold´ then defines a partition of the observation based on if \(x_i >\text{threshold}\) or not. Then, I compute the residual sum of squares (note: we do not standardize by \(T\) to avoid that small clusters receive relatively more weight).

```
#objective function: find threshold to minimize overall sum of squared errors
<- function(threshold, xsort, y, optim = TRUE){
rss <- y[xsort >= threshold]
partition_1 <- y[xsort < threshold]
partition_2 <- sum((partition_1 - mean(partition_1))^2)
m1 <- sum((partition_2 - mean(partition_2))^2)
m2
if(optim) return(m1 + m2)
if(!optim) return(c(threshold, mean(partition_1), mean(partition_2)))
}
```

Next, we search for the variable across which the partition split should occur. For that purpose we simply loop through all columns of ´X´ and use ´optimize´, a simple wrapper around ´optim´ for univariate optimization to identify the RSS minimizing break point for each variable.

```
<- c()
val for(j in 1:ncol(X)){
<- optimize(f = rss, lower = quantile(X[,j], 0.01), upper = quantile(X[,j], 0.99), xsort = X[,j], y = y)
sol <- sol
val[[j]] }
```

The remainder is householding: Identify the variable at which the sample split generates the minimum RSS. Our final function ´rf_predict´ then uses the output of our search algorithm and a new matrix `Xnew`

as input to derive new predictions.

```
<- val %>%
selected_variable bind_rows() %>%
mutate(variable = colnames(X),
id = 1:ncol(X)) %>%
filter(objective == min(objective))
<- function(newX, tree = selected_variable) {
rf_predict <- rss(selected_variable$minimum,
relevant_values $id],
X[,selected_variabley = y,
optim = FALSE)
<- newX[, selected_variable$id]
newx <- newx
output >= relevant_values[1]] <- relevant_values[2]
output[newx < relevant_values[1]] <- relevant_values[3]
output[newx return(output)
}
```

To illustrate the resulting predictions, I sample 300 observations from our original dataset and let ´rf_predict` generate return predictions. As we have only one splitting variable all observations are associated with either one of the two partitions. Thus, this very simple model only generates two different predictions.

```
<- X[sample.int(300),]
newX ::kable(table(rf_predict(newX))) kableExtra
```

Var1 | Freq |
---|---|

-0.0228835360507263 | 17 |

0.00325337128357212 | 283 |

Before moving forward to using the packages `rpart`

and ´randomForest´, make sure you know how you would use the code above to introduce a second branch!

Regression trees with `R`

are straightforward to implement and to understand. See the code below which has been used in the lecture.

```
#install.packages("rpart")
library(rpart)
<- rpart(
model ~ .,
rp_div data = welch_data %>% select(-date),
control = rpart.control(maxdepth = 3))
plot(model, compress = TRUE)
text(model, cex = 0.7, fancy = TRUE, all = TRUE)
```

Similar to hyperparameter tuning for Lasso, the *depth* of the tree is a parameter which can be optimally chosen using a training and a validation dataset. I implement such a simple search based on a grid of possible number of splits below.

```
# Regression tree hyperparameter tuning
<- 1:15
depth <- purrr::map_df(depth, function(k){
accuracy <- rpart(
fit ~ .,
rp_div data = training(data) %>% select(-date),
control = rpart.control(maxdepth = k))
<- predict(fit, training(data))
y_hat <- mean((y_hat - training(data)$rp_div)^2)
training_error
<- predict(fit, testing(data))
y_hat <- mean((y_hat - testing(data)$rp_div)^2)
validation_error
tibble(depth = k,
train = training_error,
validation = validation_error)
})
```

Finally, below the code to generate the random forest predictions presented during the lecture.

```
# Random Forests
library(randomForest)
set.seed(2021)
<- randomForest(
fit ~.,
rp_div data = training(data) %>% select(-date),
xtest = testing(data) %>% select(-date, -rp_div) ,
ytest = testing(data) %>% pull(rp_div),
keep.forest=TRUE)
<- predict(fit, test_file)
y_hat_random_forest <- mean((y_hat_random_forest - test_file$rp_div)^2)
test_error
+
p1 geom_hline (yintercept = mean ((y_hat_final_fit - test_file $ rp_div) ^ 2)) +
geom_hline(yintercept = test_error, color = "gray")
```

## 3.8 Neural networks

- Fitting neural networks is a tedious job because neural networks are among the most heavily parametrized statistical methods. Luckily, it became remarkable easy during recent years to exploit the rich coding infrastructure of tensorflow in R. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well. To get started, you will need some further system adjustments. For that purpose, you can follow this easy tutorial.
- For a lot of very interesting applications, you are advised to read through at least a few of these exemplary sessions which provide an overview of the rich infrastructure you can exploit.

### 3.8.1 Exercises

- Use ‘keras’ to initialize a sequential neural network which can take the 14 predictors from the Goyal Welch dataset as input, contains at least one hidden layer and generates continuous predictions.
**This sounds harder than it is:**see a simply regression example here. How many parameters does the neural network you aim to fit have? - Next, compile the object. It is important that you specify a loss function. In our case, we focus on the mean squared error (“mse”).
- Finally, fit the neural network. This may take some time. Make sure the optimizer uses the training dataset for fitting and the validation set for hyperparamter tuning. What is the mean squared prediction error of your first fitted neural network on the Goyal Welch dataset? Play around with the network architecture, node drop out or parameter penalization parameters to see if you can beat to Lasso regression.

### 3.8.2 Solutions

While the package ´keras`provides an extraordinary API (which even uses the pipe ´%>%`

), the setup costs to implement Neural networks are nevertheless substantial, especially if you have never used ´tensorflow´ before. In the following, I provide my minimal code example on how to initialize and then how to compile a model but there is no doubt that in order to really familiarize yourself with the subject matter, the tutorials and examples linked to above will be very important.

´keras´ divides the work of training a neural network into three steps: Defining the architecture, compiling and then training. To provide you some idea on the architecture of the network, I added some example below: The object ´model´ represents a neural network (sequential, there is no recursive elements included, but ´keras´ could handle this as well!). First, the input layer should allow vectors of length 14, the number of predictors used in the Goyal Welch dataset. Then, I include a hidden layer with 64 units with the ReLu activation function. As discussed in class, common tools to avoid overfitting is to either introduce regularization on the parameters (which I implement for the second hidden layer) or a random dropout which sets nodes to zero in order to avoid to put too much weight on irrelevant nodes. Both features can easily be implemented with ´keras´ as you see below. Recall that we want to predict market premia. The output of the network should thus be a simple one-unit layer with linear activation function. If you would like to perform classification tasks, the final layer could also include multiple nodes (e.g. 10 nodes if you want to predict the value of hand-written numbers based on the MNIST dataset). Alternatively, if you want to predict binary outcomes, another activation function could make sense as well.

```
library(keras)
<- keras_model_sequential() %>%
model layer_flatten(input_shape = 14) %>%
layer_dense(units = 64, activation = "relu") %>%
# layer_dropout(0.6) %>%
layer_dense(units = 16,
activation = "relu",
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dense(1)
summary(model)
```

```
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## flatten (Flatten) (None, 14) 0
## ________________________________________________________________________________
## dense_2 (Dense) (None, 64) 960
## ________________________________________________________________________________
## dense_1 (Dense) (None, 16) 1040
## ________________________________________________________________________________
## dense (Dense) (None, 1) 17
## ================================================================================
## Total params: 2,017
## Trainable params: 2,017
## Non-trainable params: 0
## ________________________________________________________________________________
```

Next, the model is compiled into machine-readable code which will be handled by `tensorflow`

. The `compile`

function requires to specify the loss function. In our case the most natural loss function is `mse`

, the mean square error. When dealing with categorical predictions, one may consider other loss functions, e.g. past on the entropy of the predictions. The option `metrics`

is just specifying what kind of output the console should provide during the (often very time-consuming) training process.

```
%>%
model compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = "mean_absolute_error")
```

Final step: Train the model. This works very easy with ´keras´, just use the ´fit()´ function. Note how I introduce training and validation test tests. The option ´epochs´ determines the number of times the parameters should be updated. Note that there is no analytic solution for weights which minimize the loss, instead, numerical optimization procedures which are to some extend stochastic have been found to work very well for this class of training algorithms. The basic idea is that the model should be trained long enough such that the optimizer really reaches a (local) minimum. **Note:** unlike in the tidyverse, `%>%fit()`

changes the object model without the need to assign a new variable.

```
%>%
model fit(
x = training_x,
y = training(data) %>% pull(rp_div),
validation_data = list(validation_x,
testing(data) %>% pull(rp_div)),
epochs = 3000)
```

The prediction step is easy and similar to prediction with all other machine learning tools considered so far.

```
<- predict(model, test_file %>% select(-date,-rp_div) %>% as.matrix())
predictions
+
p1 geom_hline (yintercept = mean ((y_hat_final_fit - test_file $ rp_div) ^ 2)) +
geom_hline(yintercept = test_error, color = "gray") +
geom_hline(yintercept = mean ((predictions - test_file $ rp_div) ^ 2), color = "red")
```

Note the final performance of the trained neural network in the test set - can you improve upon that?

## 3.9 Empirical Asset Pricing via Machine Learning

This exercise should help you to fight your way through an actual academic application of machine learning methods in asset pricing. The exercises should guide you step-by-step to replicate the empirical framework and therefore at some point also the main results of the paper Empirical Asset Pricing via Machine Learning by Shihao Gu, Bryan Kelly and Dacheng Xiu.

- Start reading the paper at Section 2 (“An empirical study of U.S. equities”). In their study, the authors aim to use different machine learning procedures to approximate the overarching empirical model \(E_t\left(r_{i, t+1}\right) = g^\star(z_{i,t})\) as defined in the paper.
The returns are monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ. The sample starts in March 1957 (the
start date of the S&P 500) and ends in December 2016, totaling 60 years.
**Prepare**the data from CRSP according to these requirements. - To calculate individual excess returns, the authors claim they “obtain the Treasury-bill rate to proxy for the risk-free rate”. Use the risk-free rate provided by Kenneth French’s data library to compute individual excess returns.
- One major contribution of the paper is to implement the predictive framework with a huge collection of stock-level predictive
characteristics based on the cross-section of stock returns literature. Table A.6 in the Internet Appendix provides the
details of all these characteristics. You don’t have to generate each of the characteristics, instead
**download**the data from Dacheng Xius homepage.

### 3.9.1 Exercises

- Extract the downloaded data, read the
**readme**file and process the data the way the authors explain it in footnote 29: “We cross-sectionally rank all stock characteristics period-by-period and map these ranks into the [-1,1] interval following Kelly, Pruitt, and Su (2019) and Freyberger, Neuhierl, and Weber (2020).” Also,**process**the column `sic2´ which currently contains categorical data and which should be transformed into a matrix with 74 columns: The values in the cells should be 1 if the ´permno´ corresponds to the specific industry classification and 0 otherwise. - Lastly, merge the dataset with the eight macroeconomic predictors following the variable definitions detailed in Welch and Goyal (2008), including dividend-price ratio (dp), earnings-price ratio (ep), book-to-market ratio (bm), net equity expansion (ntis), Treasury-bill rate (tbl), term spread (tms), default spread (dfy), and stock variance (svar).
- In the original paper, the authors inflate the dataset by considering any possible interaction between macroeconomic variable and firm characteristic. How could you implement this? (
**Note: the dataset is huge already without interaction terms. For the purpose of this lecture, feel free to skip this final step.**)

### 3.9.2 Solutions

First I start with the CRSP dataset. We already cleaned the raw file (provided on Absalon) and computed adjusted returns) in a previous exercise. Note, that I make use of the command `vroom::vroom`

which provides probably the fastest lazy read-in in R so far.

```
# install.packages("vroom")
library(tidyverse) # for grammar
library(lubridate) # for working with dates
library(vroom) # To import REALLY large datasets
# Read in CRSP data
<- vroom("data/raw_crsp.csv")
crsp
# Clean the file
<- crsp %>%
crsp ::clean_names() %>% # Replace column names with lower case
janitortransmute(permno = as.integer(permno), # security identifier
date = ymd(date), # month identifier
ret = as.numeric(ret) * 100, # return (convert to percent)
shrout = as.numeric(shrout), # shares outstanding (in thousands)
altprc = as.numeric(altprc), # last traded price in a month
exchcd = as.integer(exchcd), # exchange code
shrcd = as.integer(shrcd), # share code
siccd = as.integer(siccd), # industry code
dlret = as.numeric(dlret) * 100, # delisting return (converted to percent)
dlstcd = as.integer(dlstcd) # delisting code
%>%
) mutate(date = floor_date(date, "month")) # consistent naming convention: YYYY-MM-DD for each time series
<- crsp %>%
return_data filter(date > "1957-02-01", # Sample begins in March 1957 and ends in December 2016
<= "2016-12-31") %>%
date filter(shrcd %in% c(10, 11)) %>% # keep only US listed stocks
filter(exchcd %in% c(1, 31, 2, 32, 3, 33)) %>% # listed at NYSE, AMEX or NASDAQ
distinct(permno, date, .keep_all = TRUE)%>% # remove duplicates
select(-shrcd) %>%
mutate(ret.adj = case_when(!is.na(dlret) ~ dlret, # ret.adj computes adjusted returns based on procedure described in Bali, Murray, Engle
is.na(dlret) & !is.na(dlstcd) ~ -100,
is.na(dlret) & (dlstcd %in% c(500, 520, 580, 584) |
>= 551 & dlstcd <= 574)) ~ -30,
(dlstcd TRUE ~ ret)) %>%
select(permno, date, ret.adj) %>%
na.omit()
```

Next, read in the file with the stock characteristics. You can check that it makes a huge difference if you use `read_csv()`

or `vroom::vroom()`

.

```
# Data from Dacheng Xius homepage
<- vroom("datashare.csv")
predictor_data <- predictor_data %>%
predictor_data mutate(DATE = ymd(DATE),
DATE = floor_date(DATE, "month")) %>%
rename("date" = "DATE")
<- return_data %>%
return_data inner_join(predictor_data, by = c("date", "permno"))
```

The cross-sectionally ranking may be somewhat time-consuming. The idea is really just that at each date the cross section of each predictor should be scaled such that the maximum value is 1 and the minimum value is -1.

```
# Cross-sectionally rank all characteristics period-by-period and map into the [-1, 1] interval
<- function(x){
rank_transform <- rank(x, na.last = TRUE)
rx <- sum(!is.na(x))
non_nas >non_nas] <- NA
rx[rx2*(rx/non_nas - 0.5)
}
<- return_data %>%
clean_data group_by(date) %>%
mutate_at(vars(-permno, -date, -ret.adj, -sic2), rank_transform)
```

One-hot encoding is a common task in machine learning. Instead of working with the industry classification codes in one column, we inflate this column such that for each value of `sic2´ we create a new column which is one if the permno-date combinations indicates the firm operates within the specific industry and 0 otherwise. There are many ways how to do this in R, below I simply use a creative version of ´pivot_wider´ for that task.

```
# One-hot encoding of SIC2 variable (for more in depth information see here: https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods)
<- clean_data %>%
clean_data pivot_wider(values_from = sic2, names_from = sic2, names_prefix = "sic.") %>%
mutate_at(vars(contains("sic")), function(x) if_else(is.na(x), 0, 1)) %>%
select(-sic.NA)
```

Now it comes to merging everything with the Goyal Welch dataset we have been working with before to capture the relevance of macroeconomic variables. A lot of the processing below is just a repitition of the earlier preparation steps.

```
# Prepare Welch Goyal dataset (retrieved from http://www.hec.unil.ch/agoyal/docs/PredictorData2019.xlsx)
library(readxl) # for xlsx file import
# Read in dataset
<- readxl::read_xlsx("PredictorData2019.xlsx")
welch_data
<- welch_data %>%
welch_data ::clean_names() %>%
janitormutate(yyyymm = ymd(paste0(yyyymm,"01"))) %>%
rename("date" = "yyyymm") %>%
mutate_at(vars(-date), ~as.numeric(.)) %>%
filter(date >= "1957-03-01") %>%
select(-crsp_s_pvw, -crsp_s_pvwx)
<- left_join(clean_data, welch_data, by = "date")
clean_data <- clean_data %>% ungroup()
clean_data
<- clean_data %>%
clean_data group_by(permno) %>%
arrange(permno, date) %>%
mutate(ret.adj = ret.adj - rfree) %>% # Variable to predict: future return
ungroup()
```

The code below replaces missing values with the cross-sectional median at each month for each stock (which should probably be zero). One of the coauthors of the paper actually also just claims that NA values are set to zero.

```
# Replace missing values with the cross-sectional median at each month for each stock
<- function(x){
replace_nas <- median(x, na.rm = TRUE)
med is.na(x)] <- med
x[is.na(x)] <- 0
x[return(x)
}
<- clean_data %>%
clean_data group_by(date) %>%
mutate_at(vars(-permno, -date, -ret.adj), replace_nas)
```

The final step is to generate interaction terms. This is a step which will probably break on most normal computers and thus I do not execute the code below. On the computer cluster I usually use for large scale computations, task like the one below work, however.

```
<- clean_data %>%
clean_data ungroup() %>%
select(permno, date, ret.adj, mvel1, beta, infl, ntis) %>% pivot_longer(infl:ntis, names_to = "macro", values_to = "macro_value") %>% pivot_longer(mvel1:beta) %>% transmute(permno, date, ret.adj, name = paste0(macro, "_", name), s = macro_value * value) %>% pivot_wider(names_from = name, values_from = s)
```

`%>% ungroup() %>% write_rds("clean_sample.rds") clean_data `