# 5 Working with Lobster

## 5.1 Read in and process Lobster files

This exercise requires you mainly to familiarize yourself with Lobster. As a member of KU you can retrieve data from Lobster. Lobster is an online limit order book data tool to provide easy-to-use, high-quality limit order book data for the entire universe of NASDAQ traded stocks. I requested some of the data based on their online interface and stored it on Absalon before running the code below.
Lobster files contain **all** trading messages (order submissions, cancellations, trades, …) for a pre-specified ticker that went through NASDAQ. The sample file in Absalon contains the entire orderbook for ticker `SAP`

until level 5 on one trading day, December 8th, 2020.
LOBSTER generates a ‘message’ and an ‘orderbook’ file for each active trading day of a selected ticker. The ‘orderbook’ file contains the evolution of the limit order book up to the requested number of levels. The ’message’ file contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are timestamped to seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period.

Both the ‘message’ and ‘orderbook’ files are provided in the .csv format and can easily be read with any statistical software package.

We will make use of the convenient `tidverse`

package `lubridate`

to handle timestamps in R.

```
# install.packages(tidyverse)
# install.packages(lubridate)
library(tidyverse)
library(lubridate)
```

### 5.1.1 Exercises

- Familiarize yourself with the output documentation of Lobster files on their homepage. Download the two files
`SAP_2020-12-08_34200000_57600000_message_5.csv`

and`SAP_2020-12-08_34200000_57600000_orderbook_5.csv`

from Absalon and read them in using`read_csv`

. Lobster files always come with the same naming convention`ticker_date_34200000_57600000_filetype_level.csv`

, whereas`filetype`

either denotes`message`

or the corresponding`orderbook`

snapshots. - Merge the two files after applying some intuitive naming convention such that you can work with one file which traces all information about messages and the corresponding orderbook snapshots.. Familiarize yourself with every step in the function "process_orderbook`. The function performs essential cleaning of the data by making sure we filter out trading halts and aggregate order execution messages which may have triggered multiple limit orders. You can (should) use this function whenever you work with raw orderbook data from Lobster to be able to actually use the resulting orderbook. In the solution part I provide some further explanations of the actual code.
- Write a function that processes the orderbooks by applying some basic cleaning steps:
- Identify potential trading stops (consult the Lobster documentation for this) and discard all observations that have been recorded during the window when trading was interrupted.
- Sometimes, messages related to the daily opening and closing auction are included in the Lobster file (especially during days when trading stops earlier, e.g. before holidays). Remove such observations (opening auctions are identified as messages with ´type == 6´ and ´ID == -1´ and closing auctions can be identified as messages with ´type == 6` and ´ID == -2).
- Replace “emtpy” slots in the orderbook with ´NA´s in case there was no liquidity quoted at higher levels of the orderbook. In that case Lobster reports negative bid prices and large ask prices (above 199999).
- Remove observations with seemingly crossed prices (when the best ask is below the best bid).
- Market orders are sometimes executed against multiple limit orders. In that case, Lobster records multiple transactions with identical timestamp. Aggregate these transactions and generate the corresponding trade size and the (size weighted) execution price. Also, retain the orderbook snapshot
*after*the transaction has been executed. - The direction of a trade is hard to evaluate (not observable) if the transaction has been executed against a hidden order (`order_type == 5´). Create a proxy for the direction based on the executed price relative to the last observed orderbook snapshot.

- Use the function to clean the SAP Lobster output.
- Summarize the orderbook by extracting the average midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask), the average spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote), the aggregate volume in USD and the quoted depth (in million USD 5 basis points from the concurrent best price on each side of the orderbook) for each hour of the day.

- Provide a meaningful illustration of the orderbook dynamics during the particular trading day.
*This is intentionally a very open question: There is so much information within an orderbook that it is impossible to highlight everything worth considering. I encourage you to post your visualization on the discussion board in Absalon.*

### 5.1.2 Solutions

I use the following code to read-in (general) Lobster files - just change `ticker`

, `date`

and `level`

in the code snippet below.

```
# Initial information
<- "SAP"
ticker <- as.Date("2020-12-08")
date <- 5
level
# Lobster files always follow the same naming convention.
<- paste0(ticker, "_", date,"_34200000_57600000_message_", level, ".csv")
messages_filename <- paste0(ticker, "_", date,"_34200000_57600000_orderbook_", level, ".csv") orderbook_filename
```

Note my own naming convention for column names. The command `col_types`

makes sure that there is easy processing afterwards.
By default, `ts`

denotes the time in seconds since midnight (decimals are precise until nanosecond level) and `price`

always comes in 10.000 USD. `type`

denotes the message type: `4`

, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain `price`

and quoted `size`

for each of the `5`

levels.

Each message is associated with the corresponding orderbook snapshot at that point in time.
After merging `message`

and `orderbook`

files, the entire data thus looks as follows

```
<- read_csv(messages_filename,
messages_raw col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
col_types = cols(
ts = col_double(),
type = col_integer(),
order_id = col_integer(),
m_size = col_double(),
m_price = col_double(),
direction = col_integer(),
null = col_skip())) %>%
mutate(ts = as.POSIXct(ts, origin = date, tz ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
m_price = m_price / 10000) # Note the conversion of the time stamp to actually useful information
```

The code below is rather flexible to handle Lobster files with more or less number of levels. Each column contains the corresponding prices and sizes on each side of the orderbook. More information is available on the documentation of Lobster. The output contains one file with all information available in the orderbook.

```
<- read_csv(orderbook_filename,
orderbook_raw col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
cols(.default = col_double())) %>%
mutate_at(vars(contains("price")), ~./10000)
<- bind_cols(messages_raw, orderbook_raw)
orderbook orderbook
```

```
## # A tibble: 250,437 x 26
## ts type order_id m_size m_price direction ask_price_1
## <dttm> <int> <int> <dbl> <dbl> <int> <dbl>
## 1 2020-12-08 09:30:00 3 31551464 350 123. 1 123.
## 2 2020-12-08 09:30:00 3 31556960 1478 123. 1 123.
## 3 2020-12-08 09:30:00 1 31585356 181 123. -1 123.
## 4 2020-12-08 09:30:00 3 31556712 181 123. -1 123.
## 5 2020-12-08 09:30:00 1 31585856 1770 123. 1 123.
## 6 2020-12-08 09:30:00 1 31586632 195 123. -1 123.
## 7 2020-12-08 09:30:00 2 31585856 787 123. 1 123.
## 8 2020-12-08 09:30:00 3 31529708 205 123. 1 123.
## 9 2020-12-08 09:30:00 3 31537012 214 123. 1 123.
## 10 2020-12-08 09:30:00 1 31604124 393 123. 1 123.
## # ... with 250,427 more rows, and 19 more variables: ask_size_1 <dbl>,
## # bid_price_1 <dbl>, bid_size_1 <dbl>, ask_price_2 <dbl>, ask_size_2 <dbl>,
## # bid_price_2 <dbl>, bid_size_2 <dbl>, ask_price_3 <dbl>, ask_size_3 <dbl>,
## # bid_price_3 <dbl>, bid_size_3 <dbl>, ask_price_4 <dbl>, ask_size_4 <dbl>,
## # bid_price_4 <dbl>, bid_size_4 <dbl>, ask_price_5 <dbl>, ask_size_5 <dbl>,
## # bid_price_5 <dbl>, bid_size_5 <dbl>
```

You see that the orderbook contains 250,437 observations. While the columns `ts`

to `direction`

help to understand the specific messages, the remaining columns contain a snapshot of the orderbook at that particular point in time.
In line with the exercise, the function `process_orderbook()`

is doing a number of essential steps. You can call it from within `R`

with the following command.

```
<- function(orderbook){
process_orderbook
# Did a trading halt happen?
<- orderbook %>%
halt_index filter(type == 7 & direction == -1 & m_price == -1/10000 | type == 7 & direction == -1 & m_price == 1/10000)
while (nrow(halt_index) > 1) {
# Filter out messages that occurred in between trading halts
cat("Trading halt detected")
<- orderbook %>%
orderbook filter(ts < halt_index$ts[1] | ts > halt_index$ts[2])
<- halt_index %>% filter(row_number() > 2)
halt_index
}
# Discard everything before type 6 & ID -1 and everything after type 6 & ID -2
<- orderbook %>% filter(type == 6,
opening_auction == -1) %>% .$ts
order_id
<- orderbook %>% filter(type == 6,
closing_auction == -2) %>% .$ts
order_id
if(length(opening_auction) != 1) opening_auction <- orderbook %>% select(ts) %>% head(1) %>% .$ts - seconds(1)
if(length(closing_auction) != 1) closing_auction <- orderbook %>% select(ts) %>% tail(1) %>% .$ts + seconds(1)
<- orderbook %>% filter(ts > opening_auction & ts < closing_auction)
orderbook <- orderbook %>% filter(type != 6 & type != 7)
orderbook
# Replace "empty" slots in orderbook (0 volume) with NA prices
<- orderbook %>%
orderbook mutate_at(vars(contains("bid_price")), list(~replace(., . < 0, NA))) %>%
mutate_at(vars(contains("ask_price")), list(~replace(., . >= 199999, NA))) %>%
filter(ask_price_1 > bid_price_1)
# Replace "unreasonable" prices from orderbook
#orderbook <- orderbook %>%
# mutate_at(vars(contains("ask_price")), ~if_else(. > ask_price_1*(1 + 5/100), as.numeric(NA), .)) %>%
# mutate_at(vars(contains("bid_price")), ~if_else(. < bid_price_1*(1 - 5/100), as.numeric(NA), .))
# Merge transactions with unique timestamp
<- orderbook %>%
trades filter(type == 4 | type == 5) %>%
select(ts:direction)
<- inner_join(trades,
trades %>%
orderbook group_by(ts) %>%
filter(row_number()==1) %>%
ungroup() %>%
transmute(ts, ask_price_1, bid_price_1,
midquote = ask_price_1/2 + bid_price_1/2,
lag_midquote = lag(midquote)),
by = "ts")
<- trades %>%
trades mutate(direction = case_when(type == 5 & m_price < lag_midquote ~ 1, # direction = 1 if execution price is below current midquote
== 5 & m_price > lag_midquote ~ -1,
type == 4 ~ as.double(direction),
type TRUE ~ as.double(NA)))
<- trades %>%
trade_m group_by(ts) %>%
summarise(type = dplyr::last(type),
order_id = NA,
m_price = sum(m_price*m_size)/sum(m_size),
m_size = sum(m_size),
direction = dplyr::last(direction))
<- inner_join(trade_m,
trade_m %>%
orderbook select(ts, ask_price_1:last_col()) %>%
group_by(ts) %>%
filter(row_number() == n()),
by = "ts")
<- orderbook %>% filter(type != 4,
orderbook != 5) %>%
type bind_rows(trade_m) %>%
arrange(ts) %>%
mutate(direction = if_else(type==4|type==5, direction, as.double(NA)))
return(orderbook)
}
```

You can process the raw data from before by simply calling.

`<- process_orderbook(orderbook) orderbook `

Next, I compute the midquote, bid ask spread, aggregate volume and depth (the amount of tradeable units in the orderbook):

- Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
- Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
- Volume is the aggregate sum of traded units of the stock. I do not differentiate between hidden (
`type==5`

) and visible volume.

```
<- orderbook %>%
orderbook mutate(midquote = ask_price_1/2 + bid_price_1/2,
spread = (ask_price_1 - bid_price_1)/midquote * 10000,
volume = if_else(type ==4|type ==5, m_size*m_price, 0))
```

As a last step, `depth`

of the orderbook denotes the number of assets that can be traded without moving the quoted price more than a given range (measured in basis points) from the concurrent midquote. The function below takes care of the slightly involved computations.

```
<- function(df, side = "bid", bp = 0){
compute_depth if(side =="bid"){
<- (1-bp/10000)*df %>% select("bid_price_1")
value_bid <- df %>% select(contains("bid_price")) %>%
index_bid mutate_all(function(x) {x >= value_bid})
<- (df %>% select(contains("bid_size"))*index_bid) %>% rowSums()
sum_vector else{
}<- (1+bp/10000)*df %>% select("ask_price_1")
value_ask <- df %>% select(contains("ask_price")) %>%
index_ask mutate_all(function(x) {x <= value_ask})
<- (df %>% select(contains("ask_size"))*index_ask) %>% rowSums()
sum_vector
}return(sum_vector)
}
<- orderbook %>% mutate(depth_b_5 = midquote * compute_depth(orderbook, bp = 5),
orderbook depth_a_5 = midquote * compute_depth(orderbook, bp = 5, side="ask"))
```

Almost there! The snippet below splits the data into 60 minute intervals and computes the averages of the computed summary statistics.

```
<- orderbook %>%
orderbook_summary mutate(ts_minute = floor_date(ts, "1 hour")) %>%
select(midquote:ts_minute) %>%
group_by(ts_minute) %>%
mutate(messages = n(),
volume = sum(volume)) %>%
summarise_all(mean)
```

`%>%knitr::kable() orderbook_summary `

ts_minute | midquote | spread | volume | depth_b_5 | depth_a_5 | messages |
---|---|---|---|---|---|---|

2020-12-08 09:00:00 | 123.1138 | 5.027067 | 2327438 | 705791.50 | 785250.16 | 75840 |

2020-12-08 10:00:00 | 123.0940 | 4.177430 | 3018837 | 637904.15 | 688300.21 | 113783 |

2020-12-08 11:00:00 | 123.4496 | 4.074974 | 1504755 | 585285.96 | 539908.49 | 50766 |

2020-12-08 12:00:00 | 123.7538 | 8.736876 | 1195042 | 45367.95 | 48192.92 | 1870 |

2020-12-08 13:00:00 | 123.7922 | 5.419168 | 1398596 | 47210.76 | 40231.32 | 1787 |

2020-12-08 14:00:00 | 123.9048 | 5.661990 | 1146187 | 42046.84 | 42393.34 | 2031 |

2020-12-08 15:00:00 | 123.8382 | 6.993637 | 2615158 | 40208.25 | 70597.35 | 3733 |

Here we go: during the first trading hour on December 8th, 2020, 75.000 messages related to the orderbook of SAP have been processed by NASDAQ. The quoted spread on average was around 4bp. On average, roughly 2 to 3 million USD worth of SAP contracts have been traded during each hour. Quoted liquidity 5 basis points around the midquote seem rather small relative to the tremendous amounts of trading activity during this trading day.

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher levels of the orderbook.

```
<- orderbook %>%
orderbook_trades filter(type==4|type==5) %>%
select(ts, m_price)
<- orderbook %>%
orderbook_quotes mutate(ts_new = floor_date(ts, "minute")) %>%
group_by(ts_new) %>%
summarise_all(dplyr::last) %>%
select(-ts_new) %>%
mutate(id = row_number()) %>%
select(ts, id, matches("bid|ask")) %>%
gather(level, price, -ts, -id) %>%
separate(level, into=c("side","variable","level"), sep="_") %>%
mutate(level = as.numeric(level)) %>%
pivot_wider(names_from = variable, values_from = price)
ggplot() +
theme_bw() +
geom_point(data = orderbook_quotes, aes(x=ts, y=price, color=as_factor(level), size = size/max(size)), alpha = 0.1)+
geom_point(data = orderbook_trades, aes(x=ts, y=m_price), color='red') +
labs(title="Orderbook Dynamics",
y="Price",
x="") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position ="none") +
scale_y_continuous()
```

## 5.2 Working with many files

In this exercise you will work with a longer time series of Lobster data, again for the stock SAP. In order to do so, I provide a zipped folder `SAP.7z´ on Absalon which contains daily Lobster messages for the period from January 1st 2020 until March 8th, 2021. You will analyze typical market-microstructure properties and dynamics such as the intra-daily evolution of the bid-ask spread and compute daily realized volatility measures. For more information and follow-up analysis, consult this blog post on my homepage.

### 5.2.1 Exercises

- Download the file ´SAP.7z´ from Absalon. Unzip it from within
`R`

using the package ´archive´. - Write a function that performs the following useful task: Identify all files within one folder which comply with the Lobster file naming convention. Then, extract dates, ticker and levels out of the file name. Consult the section on string manipulation in R for Data Science for help.
- For
**each**of the existing files, run the progress from the previous exercise: Read-in and clean the data. Then, return a tibble which denotes the trading volume, average quoted spread, return, a measure for the realized volatility and the number of messages*within each 5 minute window*. The average quoted spread should be computed as a weighted average based on the time it takes until the orderbook is updated. - Illustrate the intra-daily dynamics by computing averages across dates for each 5 minute window for each of the variables considered above. Also, indicate the uncertainty by additionally plotting meaningful time-series quantiles.

### 5.2.2 Solutions

Unzipping ´7z´ files can be somewhat painful, especially if you are a Windows user. The `R`

package `archive`

provides a useful function for that tasks.

```
# install.packages("archive")
library(archive)
<- "SAP" # target directory
directory
archive_extract("SAP_2020-10-01_2020-12-31_1.7z",
dir = directory)
```

The next few lines of code are a good practice to learn regular expressions in R which can be incredibly helpful when working with strings or with many files like in the following example. The tibble extracts the relevant information based on the Lobster filenames, thus you will be able to use this code snippet also for more general purposes, e.g. when working with multiple tickers.

```
<- tibble(files = dir(directory, full.names = TRUE)) %>%
files mutate(ticker = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\1", files),
date = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\2", files),
date = as.Date(date),
type = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\3", files),
level = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\4", files))
<- files %>%
existing_files group_by(ticker, date, level) %>%
filter(n() == 2) %>%
summarise_all(dplyr::first) %>%
ungroup()
```

Note that I only keep ticker-date-level combinations for which 2 files are available: the orderbook and the messages from Lobster. In total, the data contains observations for 296 trading days. The next function is a general wrapper to read-in files, to clean the data using the `process_orderbook()`

function from above and then to summarize the relevant information. Note the weighted mean when it comes to averaging spreads. The final part including `final_grid`

simply makes sure that we can compare the function output across many different days even if trading finishes early or if we, for whatever reasons, do not have observations on particular trading days.

```
<- function(n, directory, existing_files){
compute_summaries
<- existing_files %>% filter(row_number() == n) %>% .$ticker
ticker <- existing_files %>% filter(row_number() == n) %>% .$date
date <- existing_files %>% filter(row_number() == n) %>% .$level
level
# Read in Messages
<- paste0(directory, "/", ticker, "_", date,"_34200000_57600000_message_", level, ".csv")
messages_filename <- paste0(directory, "/", ticker, "_", date,"_34200000_57600000_orderbook_", level, ".csv")
orderbook_filename
<- read_csv(messages_filename,
messages_raw col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
col_types = cols(
ts = col_double(),
type = col_integer(),
order_id = col_integer(),
m_size = col_double(),
m_price = col_double(),
direction = col_integer(),
null = col_skip())) %>%
mutate(ts = as.POSIXct(ts, origin = date, tz ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
m_price = m_price / 10000)
<- read_csv(orderbook_filename,
orderbook_raw col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
cols(.default = col_double())) %>%
mutate_at(vars(contains("price")), ~./10000)
<- bind_cols(messages_raw, orderbook_raw)
orderbook <- orderbook %>% filter((type ==6& order_id ==-1) | row_number()==1) %>% filter(row_number() == n()) %>% .$ts
opening_auction <- orderbook %>% filter((type ==6& order_id ==-2) | row_number()==n()) %>% filter(row_number()==1) %>% .$ts
closing_auction
<- process_orderbook(orderbook)
orderbook
<- orderbook %>% transmute(ts,
orderbook_summaries midquote = ask_price_1/2 + bid_price_1/2,
trading_volume = if_else(type ==4 | type ==5, m_price * m_size, as.double(NA)),
spread = 10000 * (ask_price_1 - bid_price_1)/midquote,
return = log(midquote) - lag(log(midquote))) %>%
mutate(ts_latency = as.numeric(lead(ts)) - as.numeric(ts),
ts_latency = if_else(is.na(ts_latency), 0, ts_latency)) %>%
mutate(ts_minute = floor_date(ts, "5 minutes")) %>%
select(-ts) %>%
group_by(ts_minute) %>%
summarise(midquote = dplyr::last(midquote),
trading_volume = sum(trading_volume, na.rm = TRUE),
trading_volume = if_else(is.na(trading_volume), as.double(0), trading_volume) / 1000000,
n = n(),
spread = weighted.mean(spread, ts_latency),
rv = 100 * sqrt(12 * 6.5) * sum(return ^2))
<- tibble(ts_minute = seq(from = as.POSIXct(paste0(date, "09:30:00"), tz = "GMT"),
full_grid to = as.POSIXct(paste0(date, "16:00:00"), tz = "GMT"),
"5 min"))
<- left_join(full_grid,
orderbook_summaries
orderbook_summaries,by = "ts_minute") %>%
fill(midquote, spread) %>%
mutate_at(vars(n, ,trading_volume, rv), ~replace(., is.na(.), 0)) %>%
mutate(date = as.Date(date),
ticker = ticker,
return = 100 * (log(midquote) - log(lag(midquote)))) %>%
select(ts = ts_minute,
everything())
date, ticker, return(orderbook_summaries)
}
```

Next, I evaluate each of the files available. You can also consider a `for`

loop or nesting based on the tibble `existing_files`

. The code below executes the function `compute_summaries`

for each *valid* trading day and stores the output in a convenient tibble. In the final step I make use of the package `hms`

which provides functionalities in excess of the `lubridate`

package to work with timestamps. I summarize the variables of interest by extracting the average (across trading days) as well as the corresponding time series quantiles. The final figure illustrates the intra-daily dynamics. Note the use of `geom_ribbon`

which is a nice way to illustrate confidence intervals.
Depending on the size and number of Lobster files, processing the code below may be somewhat time consuming. For illustrative purposes I used parallelization based on the package ´doFuture´.

```
library("doFuture")
registerDoFuture()
plan(multisession) ## on MS Windows
options(future.rng.onMisuse="ignore")
<- foreach(i = 1:nrow(existing_files)) %dopar% {
processed_data tryCatch({
compute_summaries(i, directory, existing_files)
error = function(e){cat("error", i, "\n")})
},
}
<- processed_data %>% bind_rows() processed_data
```

```
# install.packages("hms")
%>%
processed_data select(-midquote) %>%
mutate(time = hms::as_hms(ts)) %>%
pivot_longer(trading_volume:return) %>%
group_by(ticker, time, name) %>%
na.omit() %>%
summarise(m = mean(value),
q5 = quantile(value, 0.05),
q95 = quantile(value, 0.95),
q25 = quantile(value, 0.25),
q75 = quantile(value, 0.75)) %>%
ggplot(aes(x = time, y = m)) +
geom_line() +
geom_ribbon(aes(ymin = q5, ymax = q95), alpha = 0.1) +
geom_ribbon(aes(ymin = q25, ymax = q75), alpha = 0.1) +
facet_wrap(~name, scales = "free_y", nc = 1) +
labs(x = "", y ="")
```

```
%>%
processed_data select(-date, -ticker) %>%
pivot_longer(midquote:last_col()) %>%
ggplot(aes(x = ts, y = value)) +
geom_point(size = 0.1) +
facet_wrap(~name, scales = "free_y", nc = 1) +
labs(x = "", y ="")
```

## 5.3 Brownian Motion and realized volatility

In this exercise you will analyze the empirical properties of the (geometric) Brownian motion based on a simulation study. As a matter of fact, such simulation studies can be useful for manifold applications in continuous time finance as well, e.g. for option pricing.

Finance theory suggests the following description of prices, that they must be so-called semimartingales. The most commonly used semimartingale is the Geometric Brownian Motion (GBM) where we assume that the logarithm of the stock price \(S_t\), \(X_t = \log(S_t)\) follows the following dynamic \[X_t = X_0 + \mu t + \sigma W_t\] where \(\mu\) and \(\sigma\) are constants and \(W_t\) is a Brownian motion.

### 5.3.1 Exercises

- Write a function that simulates (log) stock prices according to the model above. Recall that the brownian motion has independent increments which are normally distributed. The function should generate a path of the log stock price with a pre-defined number of observations (granularity), drift term and volatility. Plot one sample path of the implied (log) stock price.
- Use the function to simulate a large number of geometric brownian motions. Illustrate the quadtratic variation of the process.
- Compute the realized volatility for each simulated process and show illlustrate the sampling distribution.
- Finally, consider the infill asymptotics. Show the effect of sampling once per day, once per hour, once per minute and so on on the variance of the realized volatitility.

### 5.3.2 Solutions

For this exercise I make use of some functionalities from the `xts`

and `highfrequency`

packages, particularly for aggregation at different frequencies.

```
library(tidyverse)
library(lubridate)
library(xts)
library(highfrequency)
```

**Note: When loading xts and the tidyverse jointly you may run into serious issues if you did not take naming conflicts into account!** Especially the convenient functions

`last`

and `first`

from `dplyr`

may be affected. For that purpose, I consistently use `dplyr::first`

and `dplyr::last`

in this chapter.First, the function below is a rather general version of a simulated generalized brownian motion. Note that there is no direct need to assign trading date and hour, I just implement this because it will provide output which is closer to actual data and therefore parts of my code can be recycled for future projects. The function can be called with a specific seed to replicate the results. The relevant parameters are the initial values `start`

and the parameters of the geometric brownian motion `mu`

and `sigma`

. The input `steps`

defines the granularity (sampling frequency) of the simulated brownian motion.

```
<- function(seed = 3010,
simulate_geom_brownian_motion steps = 1000,
mu = 0,
sigma = 1,
start = 0){
set.seed(seed)
<- Sys.Date() + days(seed)
date hour(date) <- 9
minute(date) <- 30
<- date
date_end hour(date_end) <- 16
minute(date_end) <- 0
tibble(step = 1:(steps+1),
time = seq(from = date,
to = date_end,
length.out = steps + 1),
price = cumsum(c(start, rnorm(steps, mean = mu/steps, sd = sigma/sqrt(steps)))))
}
```

Note the last line of the function which is doing most of the work: We simply exploit the fact that increments of a brownian motion are independent and normally distributed. The entire path is nothing but the cumulative sum of the increments.

Next I simulate the stock price with 1000 intradaily observations. Recall that the geometric brownian motion implies dynamics for the log of the stock price which is why I adjust the plotted value in the \(y\)-axis.

```
<- simulate_geom_brownian_motion(start = log(18))
tab %>%
tab ggplot(aes(x=time, y = exp(price))) +
geom_line() + labs(x="", y = "Stock price")
```

Next we scale the simulation up. The code below simulates 250 paths of the geometric brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

```
<- 2.4
true_sigma <- 1000
steps
<- tibble(iteration = 1:250)
simulation <- simulation %>%
simulation mutate(values = map(iteration, simulate_geom_brownian_motion, sigma = true_sigma, steps = steps)) %>%
unnest(values)
%>%
simulation ggplot(aes(x = step, y = price, group = iteration)) +
geom_line(alpha = 0.3) +
labs(x = "Iteration", y = "(log) price")
```

Next we investigate some of the properties of the geometric Brownian motion. First, by definition,
the sample variation of \(X_t = \int\limits_0^t\mu dt + \int\limits_0^t\sigma dW_t\) is \(\int\limits_0^t\sigma^2 dt = \sigma^2t\). The figure below illustrates the sample variation over time and shows that the standard deviation across all simulated (log) prices increases linearly in \(t\) as expected. For the simulation I used \(\sigma = 2.4\) (`true_sigma`

) which corresponds to the slope of the line (adjusted by the number of observations).

```
%>%
simulation group_by(step) %>%
summarise(sd = sd(price)) %>%
ggplot(aes(x = step, y = sd^2)) +
geom_line() +
geom_abline(aes(slope = true_sigma^2/steps, intercept = 0)) +
labs(x = "Iteration", y = "Sample variation")
```

What about the realized volatility? As defined in the lecture, the realized variance \[RV = \sum\limits_{i=0}^n \Delta X_t^2\] converges to the integrated variance (in the absence of jumps, etc). The bar plot below illustrates the sampling distribution of the realized volatility which, as expected, centers around `true_sigma`

, the integrated variance of the simulated geometric Brownian motion. You can change the value `true_sigma`

above and analyze the effect.

```
%>%
simulation group_by(iteration) %>%
summarise(rv = sum(((price - lag(price))^2), na.rm = TRUE)) %>%
ggplot(aes(x = rv)) +
geom_histogram(bins = 100) +
labs(x = "Realized Volatility") +
geom_vline(aes(xintercept = true_sigma^2), color = "red", linetype = "dotted")
```

Finally, some more analysis regarding the infill asymptotics. As should be clear by now, sampling at the highest frequency possible will ensure that the realized volatility recovers the integrated variance of the (log) price process. In other words, we can measure the variance. The following few lines of code illustrate the effect: Whereas we simulated our process at a high frequency, we now simulate sub samples at different frequencies, ranging from every second to once per day. The figure then illustrates the daily realized volatilities relative to the integrated variance of the process. Not surprising, the dispersion around the true parameter decreases substantially with higher sampling frequency.

Note that the code below also illustrates how to work with the `xts`

package and how to switch between the `tidyverse`

and the `highfrequency`

data format requirements. This is arguably a bit cumbersome but in my experience it is worth to exploit the functionalities of both worlds instead of recoding everything in a tidy manner.

```
<- function(tab, sampling_frequency = 5){
infill_asymptotics <- as.xts(tab %>% select(price),
tab_xtsorder.by = tab %>% pull(time))
<- aggregatePrice(tab_xts,
tab_xts alignBy = "minutes", alignPeriod = sampling_frequency,
tz = "GMT", marketOpen = "09:30:00")
%>%
tab_xts %>%
fortify.zoo %>%
as_tibble ::clean_names() %>%
janitorrename("ts" = index) %>%
summarise(rv = sum((price -lag(price))^2, na.rm = TRUE))%>%
as.numeric()
}
<- simulation %>%
rvs group_by(iteration) %>%
nest(data = c(step, time, price)) %>%
mutate("1 Second" = map_dbl(data, infill_asymptotics, 1/60),
"1 Minute" = map_dbl(data, infill_asymptotics, 1),
"5 Minutes" = map_dbl(data, infill_asymptotics, 5),
"15 Minutes" = map_dbl(data, infill_asymptotics, 15),
"1 hour" = map_dbl(data, infill_asymptotics, 60),
"1 day" = map_dbl(data, infill_asymptotics, 6.5 * 60))
%>%
rvs select(-data) %>%
ungroup() %>%
pivot_longer(-iteration,
names_to = "Sampling Frequency",
values_to = "rv") %>%
mutate(`Sampling Frequency` = as.factor(`Sampling Frequency`),
`Sampling Frequency` = fct_reorder(`Sampling Frequency`, rv, sd)) %>%
ggplot(aes(x = sqrt(rv),
fill = `Sampling Frequency`)) +
geom_density() +
geom_vline(aes(xintercept = true_sigma), linetype ="dotted") +
facet_wrap(~`Sampling Frequency`, scales = "free_y") +
labs(x = "(Log) Realized Volatility", y = "")
```

## 5.4 Refresh-time sampling and realized covariances

This collection of code chunks should provide you with some advanced methods on data handling suitable for the use of LOBSTER data with the `highfrequency`

package. Due to storage limitations, I do not provide the raw data itself, but provide you with some code that should work in general if you follow some easy workflows to get your data downloaded from LOBSTER and imported into R-Studio.

### 5.4.1 Exercises

- Assume you want to compute (time-series) of realized covariance matrices based on LOBSTER data. For that purpose, I, for instance requested data for 6 stocks,
*UNH, MCD, HD, GS, CRM, AMGN*for the period from March 1st, 2021 until March 25, 2021 from LOBSTER. All files are stored in ticker-specific folders with the common LOBSTER naming convention. Write a script that generates a tibble which contains all existing files suitable for the following read-in. - Write code in a tidy fashion that reads in all available files for a pre-specified day, clean the lobster files and generates an
`xts`

object of log midquote prices. - Perform refresh-time sampling and compute the realized variance-covariance matrix.

### 5.4.2 Solutions

LOBSTER downloads always come in one (zipped) folder per ticker. As described above, I first write a small script that scans through all folders which comply with the naming convention to identify the available tickers, dates and the level information.

`# install.packages("xts")`

```
# Read in the files in some useful manner
<- dir("../Skripts/Part_2", pattern= "_1", full.names = TRUE) %>%
all_files map(function(.) tibble(files = dir(., full.names = TRUE))) %>%
bind_rows() %>%
mutate(ticker = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\1", files),
date = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\2", files),
date = as.Date(date),
type = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\3", files),
level = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\4", files))
```

Next, I restrict myself to a specific date. Needless to say, you can simply run the code without filtering but grouping (nesting) additionally by date to perform multiple cross-asset computations simultaneously. You should be warned, however, that this poses substantial challenges with respect to the required memory space.

```
# Specify date
<- as.Date("2021-03-01")
selected_date
<- all_files %>%
midquote_data group_by(ticker, date, level) %>%
filter(date == selected_date) %>%
nest(data = c(-ticker))
```

Next I write a simple function that reads in the LOBSTER files, cleans them using the already documented function `process_orderbook`

and returns an `xts`

file with the time-series of midquotes. Note that for that purpose the package `xts`

needs to be installed already.

```
<- function(file_tibble){
compute_log_midquotes
<- (file_tibble$level)[[1]]
level <- (file_tibble$date)[[1]]
date
# Read in Messages
<- file_tibble %>% filter(type == "message") %>% pull(files)
messages_filename <- file_tibble %>% filter(type == "orderbook") %>% pull(files)
orderbook_filename
<- read_csv(messages_filename,
messages_raw col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
col_types = cols(
ts = col_double(),
type = col_integer(),
order_id = col_integer(),
m_size = col_double(),
m_price = col_double(),
direction = col_integer(),
null = col_skip())) %>%
mutate(ts = as.POSIXct(ts, origin = date, tz ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
m_price = m_price / 10000)
<- read_csv(orderbook_filename,
orderbook_raw col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
cols(.default = col_double())) %>%
mutate_at(vars(contains("price")), ~./10000)
<- bind_cols(messages_raw, orderbook_raw)
orderbook <- orderbook %>% filter((type ==6& order_id ==-1) | row_number()==1) %>% filter(row_number() == n()) %>% .$ts
opening_auction <- orderbook %>% filter((type ==6& order_id ==-2) | row_number()==n()) %>% filter(row_number()==1) %>% .$ts
closing_auction
<- process_orderbook(orderbook)
orderbook <- orderbook %>% transmute(ts,
orderbook midquote = ask_price_1/2 + bid_price_1/2) %>%
filter(midquote > 0) %>%
na.omit()
<- xts::as.xts(orderbook %>% select(midquote),
orderbook order.by = orderbook %>% pull(ts))
return(orderbook)
}
```

The function `compute_log_midquotes`

is vectorized and can thus be used in a tidy manner. As a result, the object `midquote_data`

contains *all* time series of midquotes for each available ticker on the selected day.

```
<- midquote_data %>%
midquote_data mutate(midquote = map(data, compute_log_midquotes))
<- midquote_data %>%
midquote_data mutate(observations = map_dbl(midquote, nrow)) %>%
filter(observations > 0 | is.na(observations)) %>%
ungroup()
midquote_data
```

```
## # A tibble: 6 x 4
## ticker data midquote observations
## <chr> <list> <list> <dbl>
## 1 AMGN <grouped_df[,4] [2 x 4]> <xts[,1] [53,268 x 1]> 53268
## 2 CRM <grouped_df[,4] [2 x 4]> <xts[,1] [268,158 x 1]> 268158
## 3 GS <grouped_df[,4] [2 x 4]> <xts[,1] [56,978 x 1]> 56978
## 4 HD <grouped_df[,4] [2 x 4]> <xts[,1] [62,769 x 1]> 62769
## 5 MCD <grouped_df[,4] [2 x 4]> <xts[,1] [49,338 x 1]> 49338
## 6 UNH <grouped_df[,4] [2 x 4]> <xts[,1] [49,815 x 1]> 49815
```

The final piece of code simply shows how to work with the clean structure in `midquote_data´. More specifically, I make use of the fact that the column`

midquote´ is a list of `xts`

elements which is the desired input for most multivariate functions in the `highfrequency`

package. To perform refreshtime sampling, for instance, and to compute the realized volatility, only 4 lines of code are required and the rest is simple householding.

```
<- midquote_data %>%
real_cov_matrix pull(midquote) %>%
::refreshTime() %>%
highfrequencyrCov(makeReturns = TRUE)
colnames(real_cov_matrix) <- midquote_data %>% pull(ticker)
rownames(real_cov_matrix) <- midquote_data %>% pull(ticker)
real_cov_matrix
```

```
## AMGN CRM GS HD MCD
## AMGN 1.094213e-04 1.340901e-05 7.978317e-06 1.339822e-05 6.853033e-06
## CRM 1.340901e-05 3.844353e-04 1.386922e-05 2.318136e-05 1.499004e-05
## GS 7.978317e-06 1.386922e-05 2.843359e-04 7.798312e-06 1.642862e-06
## HD 1.339822e-05 2.318136e-05 7.798312e-06 2.178838e-04 1.659505e-05
## MCD 6.853033e-06 1.499004e-05 1.642862e-06 1.659505e-05 1.431650e-04
## UNH 9.855284e-06 5.543091e-07 7.613928e-06 5.143702e-06 1.260943e-05
## UNH
## AMGN 9.855284e-06
## CRM 5.543091e-07
## GS 7.613928e-06
## HD 5.143702e-06
## MCD 1.260943e-05
## UNH 2.385806e-04
```

```
<- solve(real_cov_matrix) %*% rep(1, ncol(real_cov_matrix))
w / sum(w) w
```

```
## [,1]
## AMGN 0.30702005
## CRM 0.07241542
## GS 0.12071794
## HD 0.13301906
## MCD 0.22990253
## UNH 0.13692501
```