5 Working with Lobster

5.1 Read in and process Lobster files

This exercise requires you mainly to familiarize yourself with Lobster. As a member of KU you can retrieve data from Lobster. Lobster is an online limit order book data tool to provide easy-to-use, high-quality limit order book data for the entire universe of NASDAQ traded stocks. I requested some of the data based on their online interface and stored it on Absalon before running the code below. Lobster files contain all trading messages (order submissions, cancellations, trades, …) for a pre-specified ticker that went through NASDAQ. The sample file in Absalon contains the entire orderbook for ticker SAP until level 5 on one trading day, December 8th, 2020. LOBSTER generates a ‘message’ and an ‘orderbook’ file for each active trading day of a selected ticker. The ‘orderbook’ file contains the evolution of the limit order book up to the requested number of levels. The ’message’ file contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are timestamped to seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period.

Both the ‘message’ and ‘orderbook’ files are provided in the .csv format and can easily be read with any statistical software package.

We will make use of the convenient tidverse package lubridate to handle timestamps in R.

# install.packages(tidyverse)
# install.packages(lubridate)
library(tidyverse)
library(lubridate)

5.1.1 Exercises

  • Familiarize yourself with the output documentation of Lobster files on their homepage. Download the two files SAP_2020-12-08_34200000_57600000_message_5.csv and SAP_2020-12-08_34200000_57600000_orderbook_5.csv from Absalon and read them in using read_csv. Lobster files always come with the same naming convention ticker_date_34200000_57600000_filetype_level.csv, whereas filetype either denotes message or the corresponding orderbook snapshots.
  • Merge the two files after applying some intuitive naming convention such that you can work with one file which traces all information about messages and the corresponding orderbook snapshots.. Familiarize yourself with every step in the function "process_orderbook`. The function performs essential cleaning of the data by making sure we filter out trading halts and aggregate order execution messages which may have triggered multiple limit orders. You can (should) use this function whenever you work with raw orderbook data from Lobster to be able to actually use the resulting orderbook. In the solution part I provide some further explanations of the actual code.
  • Write a function that processes the orderbooks by applying some basic cleaning steps:
    1. Identify potential trading stops (consult the Lobster documentation for this) and discard all observations that have been recorded during the window when trading was interrupted.
    2. Sometimes, messages related to the daily opening and closing auction are included in the Lobster file (especially during days when trading stops earlier, e.g. before holidays). Remove such observations (opening auctions are identified as messages with ´type == 6´ and ´ID == -1´ and closing auctions can be identified as messages with ´type == 6` and ´ID == -2).
    3. Replace “emtpy” slots in the orderbook with ´NA´s in case there was no liquidity quoted at higher levels of the orderbook. In that case Lobster reports negative bid prices and large ask prices (above 199999).
    4. Remove observations with seemingly crossed prices (when the best ask is below the best bid).
    5. Market orders are sometimes executed against multiple limit orders. In that case, Lobster records multiple transactions with identical timestamp. Aggregate these transactions and generate the corresponding trade size and the (size weighted) execution price. Also, retain the orderbook snapshot after the transaction has been executed.
    6. The direction of a trade is hard to evaluate (not observable) if the transaction has been executed against a hidden order (`order_type == 5´). Create a proxy for the direction based on the executed price relative to the last observed orderbook snapshot.
  • Use the function to clean the SAP Lobster output.
  • Summarize the orderbook by extracting the average midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask), the average spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote), the aggregate volume in USD and the quoted depth (in million USD 5 basis points from the concurrent best price on each side of the orderbook) for each hour of the day.
  • Provide a meaningful illustration of the orderbook dynamics during the particular trading day. This is intentionally a very open question: There is so much information within an orderbook that it is impossible to highlight everything worth considering. I encourage you to post your visualization on the discussion board in Absalon.

5.1.2 Solutions

I use the following code to read-in (general) Lobster files - just change ticker, date and level in the code snippet below.

# Initial information
ticker <- "SAP"
date <- as.Date("2020-12-08")
level <- 5

# Lobster files always follow the same naming convention.

messages_filename <- paste0(ticker, "_", date,"_34200000_57600000_message_", level, ".csv")
orderbook_filename <- paste0(ticker, "_", date,"_34200000_57600000_orderbook_", level, ".csv")

Note my own naming convention for column names. The command col_types makes sure that there is easy processing afterwards. By default, ts denotes the time in seconds since midnight (decimals are precise until nanosecond level) and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 5 levels.

Each message is associated with the corresponding orderbook snapshot at that point in time. After merging message and orderbook files, the entire data thus looks as follows

messages_raw <- read_csv(messages_filename,
                         col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
                         col_types = cols(
                           ts = col_double(),
                           type = col_integer(),
                           order_id = col_integer(),
                           m_size = col_double(),
                           m_price = col_double(),
                           direction = col_integer(),
                           null = col_skip())) %>%
  mutate(ts = as.POSIXct(ts, origin = date, tz  ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
         m_price = m_price / 10000) # Note the conversion of the time stamp to actually useful information

The code below is rather flexible to handle Lobster files with more or less number of levels. Each column contains the corresponding prices and sizes on each side of the orderbook. More information is available on the documentation of Lobster. The output contains one file with all information available in the orderbook.

orderbook_raw <- read_csv(orderbook_filename,
                          col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
                          cols(.default = col_double())) %>%
  mutate_at(vars(contains("price")), ~./10000) 

orderbook <- bind_cols(messages_raw, orderbook_raw) 
orderbook
## # A tibble: 250,437 x 26
##    ts                   type order_id m_size m_price direction ask_price_1
##    <dttm>              <int>    <int>  <dbl>   <dbl>     <int>       <dbl>
##  1 2020-12-08 09:30:00     3 31551464    350    123.         1        123.
##  2 2020-12-08 09:30:00     3 31556960   1478    123.         1        123.
##  3 2020-12-08 09:30:00     1 31585356    181    123.        -1        123.
##  4 2020-12-08 09:30:00     3 31556712    181    123.        -1        123.
##  5 2020-12-08 09:30:00     1 31585856   1770    123.         1        123.
##  6 2020-12-08 09:30:00     1 31586632    195    123.        -1        123.
##  7 2020-12-08 09:30:00     2 31585856    787    123.         1        123.
##  8 2020-12-08 09:30:00     3 31529708    205    123.         1        123.
##  9 2020-12-08 09:30:00     3 31537012    214    123.         1        123.
## 10 2020-12-08 09:30:00     1 31604124    393    123.         1        123.
## # ... with 250,427 more rows, and 19 more variables: ask_size_1 <dbl>,
## #   bid_price_1 <dbl>, bid_size_1 <dbl>, ask_price_2 <dbl>, ask_size_2 <dbl>,
## #   bid_price_2 <dbl>, bid_size_2 <dbl>, ask_price_3 <dbl>, ask_size_3 <dbl>,
## #   bid_price_3 <dbl>, bid_size_3 <dbl>, ask_price_4 <dbl>, ask_size_4 <dbl>,
## #   bid_price_4 <dbl>, bid_size_4 <dbl>, ask_price_5 <dbl>, ask_size_5 <dbl>,
## #   bid_price_5 <dbl>, bid_size_5 <dbl>

You see that the orderbook contains 250,437 observations. While the columns ts to direction help to understand the specific messages, the remaining columns contain a snapshot of the orderbook at that particular point in time. In line with the exercise, the function process_orderbook() is doing a number of essential steps. You can call it from within R with the following command.

process_orderbook <- function(orderbook){
  
  # Did a trading halt happen?
  halt_index <- orderbook %>% 
    filter(type == 7 & direction == -1 & m_price == -1/10000 | type == 7 & direction == -1 & m_price == 1/10000) 
  
  while (nrow(halt_index) > 1) {
    # Filter out messages that occurred in between trading halts
    cat("Trading halt detected")
    orderbook <- orderbook %>% 
      filter(ts < halt_index$ts[1] | ts > halt_index$ts[2]) 
    halt_index <- halt_index %>% filter(row_number() > 2)
  } 
  
  # Discard everything before type 6 & ID -1 and everything after type 6 & ID -2
  
  opening_auction <- orderbook %>% filter(type == 6, 
                                          order_id == -1) %>% .$ts
  
  closing_auction <- orderbook %>% filter(type == 6, 
                                          order_id == -2) %>% .$ts
  
  if(length(opening_auction) != 1) opening_auction <- orderbook %>% select(ts) %>% head(1) %>% .$ts - seconds(1)
  if(length(closing_auction) != 1) closing_auction <- orderbook %>% select(ts) %>% tail(1) %>% .$ts + seconds(1)
  
  orderbook <- orderbook %>% filter(ts > opening_auction & ts < closing_auction)
  orderbook <- orderbook %>% filter(type != 6 & type != 7)
  
  # Replace "empty" slots in orderbook (0 volume) with NA prices
  orderbook <- orderbook %>% 
    mutate_at(vars(contains("bid_price")), list(~replace(., . < 0, NA))) %>% 
    mutate_at(vars(contains("ask_price")), list(~replace(., . >= 199999, NA))) %>% 
    filter(ask_price_1 > bid_price_1)
  
  # Replace "unreasonable" prices from orderbook
  #orderbook <- orderbook %>% 
  #  mutate_at(vars(contains("ask_price")), ~if_else(. > ask_price_1*(1 + 5/100), as.numeric(NA), .)) %>%
  #  mutate_at(vars(contains("bid_price")), ~if_else(. < bid_price_1*(1 - 5/100), as.numeric(NA), .)) 
  
  # Merge transactions with unique timestamp
  trades <- orderbook %>% 
    filter(type == 4 | type == 5) %>% 
    select(ts:direction)
  
  trades <- inner_join(trades,
                       orderbook %>% 
                         group_by(ts) %>% 
                         filter(row_number()==1) %>% 
                         ungroup() %>%
                         transmute(ts, ask_price_1, bid_price_1, 
                                   midquote = ask_price_1/2 + bid_price_1/2, 
                                   lag_midquote = lag(midquote)),
                       by = "ts")  
  
  trades <- trades %>% 
    mutate(direction = case_when(type == 5 & m_price < lag_midquote ~ 1, # direction = 1 if execution price is below current midquote
                                 type == 5 & m_price > lag_midquote ~ -1,
                                 type == 4 ~ as.double(direction),
                                 TRUE ~ as.double(NA)))
  
  trade_m <- trades %>% 
    group_by(ts) %>% 
    summarise(type = dplyr::last(type),
              order_id = NA,
              m_price = sum(m_price*m_size)/sum(m_size),
              m_size = sum(m_size),
              direction = dplyr::last(direction))
  
  trade_m <- inner_join(trade_m, 
                        orderbook %>%
                          select(ts, ask_price_1:last_col()) %>% 
                          group_by(ts) %>%
                          filter(row_number() == n()),
                        by = "ts")
  
  orderbook <-  orderbook %>% filter(type != 4,
                                     type != 5) %>% 
    bind_rows(trade_m) %>% 
    arrange(ts) %>%
    mutate(direction = if_else(type==4|type==5, direction, as.double(NA)))
  
  return(orderbook)
}

You can process the raw data from before by simply calling.

orderbook <- process_orderbook(orderbook)

Next, I compute the midquote, bid ask spread, aggregate volume and depth (the amount of tradeable units in the orderbook):

  • Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
  • Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
  • Volume is the aggregate sum of traded units of the stock. I do not differentiate between hidden (type==5) and visible volume.
orderbook <- orderbook %>%
  mutate(midquote = ask_price_1/2 + bid_price_1/2,
                     spread = (ask_price_1 - bid_price_1)/midquote * 10000,
                     volume = if_else(type ==4|type ==5, m_size*m_price, 0))

As a last step, depth of the orderbook denotes the number of assets that can be traded without moving the quoted price more than a given range (measured in basis points) from the concurrent midquote. The function below takes care of the slightly involved computations.

compute_depth <- function(df, side = "bid", bp = 0){
  if(side =="bid"){
    value_bid <- (1-bp/10000)*df %>% select("bid_price_1")
    index_bid <- df %>% select(contains("bid_price")) %>%
      mutate_all(function(x) {x >= value_bid})
    sum_vector <- (df %>% select(contains("bid_size"))*index_bid) %>% rowSums()
  }else{
    value_ask <- (1+bp/10000)*df %>% select("ask_price_1")
    index_ask <- df %>% select(contains("ask_price")) %>%
      mutate_all(function(x) {x <= value_ask})
    sum_vector <- (df %>% select(contains("ask_size"))*index_ask) %>% rowSums()

  }
  return(sum_vector)
}

orderbook <- orderbook %>% mutate(depth_b_5 = midquote * compute_depth(orderbook, bp = 5),
                                  depth_a_5 = midquote * compute_depth(orderbook, bp = 5, side="ask"))

Almost there! The snippet below splits the data into 60 minute intervals and computes the averages of the computed summary statistics.

orderbook_summary <- orderbook %>%
  mutate(ts_minute = floor_date(ts, "1 hour")) %>%
  select(midquote:ts_minute) %>%
  group_by(ts_minute) %>%
  mutate(messages = n(),
         volume = sum(volume)) %>%
  summarise_all(mean)
orderbook_summary %>%knitr::kable()
ts_minute midquote spread volume depth_b_5 depth_a_5 messages
2020-12-08 09:00:00 123.1138 5.027067 2327438 705791.50 785250.16 75840
2020-12-08 10:00:00 123.0940 4.177430 3018837 637904.15 688300.21 113783
2020-12-08 11:00:00 123.4496 4.074974 1504755 585285.96 539908.49 50766
2020-12-08 12:00:00 123.7538 8.736876 1195042 45367.95 48192.92 1870
2020-12-08 13:00:00 123.7922 5.419168 1398596 47210.76 40231.32 1787
2020-12-08 14:00:00 123.9048 5.661990 1146187 42046.84 42393.34 2031
2020-12-08 15:00:00 123.8382 6.993637 2615158 40208.25 70597.35 3733

Here we go: during the first trading hour on December 8th, 2020, 75.000 messages related to the orderbook of SAP have been processed by NASDAQ. The quoted spread on average was around 4bp. On average, roughly 2 to 3 million USD worth of SAP contracts have been traded during each hour. Quoted liquidity 5 basis points around the midquote seem rather small relative to the tremendous amounts of trading activity during this trading day.

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher levels of the orderbook.

orderbook_trades <- orderbook %>%
  filter(type==4|type==5) %>%
  select(ts, m_price)

orderbook_quotes <- orderbook %>%
    mutate(ts_new = floor_date(ts, "minute")) %>%
    group_by(ts_new) %>%
    summarise_all(dplyr::last) %>%
    select(-ts_new) %>%
    mutate(id = row_number()) %>%
    select(ts, id, matches("bid|ask")) %>%
    gather(level, price, -ts, -id) %>%
    separate(level, into=c("side","variable","level"), sep="_") %>%
    mutate(level = as.numeric(level))  %>%
    pivot_wider(names_from = variable, values_from = price)

ggplot() +
  theme_bw() +
  geom_point(data = orderbook_quotes, aes(x=ts, y=price, color=as_factor(level), size = size/max(size)), alpha = 0.1)+
  geom_point(data = orderbook_trades, aes(x=ts, y=m_price), color='red') +
  labs(title="Orderbook Dynamics",
       y="Price",
       x="") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position ="none") +
  scale_y_continuous()
Visualization of the orderbook

Figure 5.1: Visualization of the orderbook

5.2 Working with many files

In this exercise you will work with a longer time series of Lobster data, again for the stock SAP. In order to do so, I provide a zipped folder `SAP.7z´ on Absalon which contains daily Lobster messages for the period from January 1st 2020 until March 8th, 2021. You will analyze typical market-microstructure properties and dynamics such as the intra-daily evolution of the bid-ask spread and compute daily realized volatility measures. For more information and follow-up analysis, consult this blog post on my homepage.

5.2.1 Exercises

  • Download the file ´SAP.7z´ from Absalon. Unzip it from within R using the package ´archive´.
  • Write a function that performs the following useful task: Identify all files within one folder which comply with the Lobster file naming convention. Then, extract dates, ticker and levels out of the file name. Consult the section on string manipulation in R for Data Science for help.
  • For each of the existing files, run the progress from the previous exercise: Read-in and clean the data. Then, return a tibble which denotes the trading volume, average quoted spread, return, a measure for the realized volatility and the number of messages within each 5 minute window. The average quoted spread should be computed as a weighted average based on the time it takes until the orderbook is updated.
  • Illustrate the intra-daily dynamics by computing averages across dates for each 5 minute window for each of the variables considered above. Also, indicate the uncertainty by additionally plotting meaningful time-series quantiles.

5.2.2 Solutions

Unzipping ´7z´ files can be somewhat painful, especially if you are a Windows user. The R package archive provides a useful function for that tasks.

# install.packages("archive")
library(archive)

directory <- "SAP" # target directory

archive_extract("SAP_2020-10-01_2020-12-31_1.7z",
                dir = directory)

The next few lines of code are a good practice to learn regular expressions in R which can be incredibly helpful when working with strings or with many files like in the following example. The tibble extracts the relevant information based on the Lobster filenames, thus you will be able to use this code snippet also for more general purposes, e.g. when working with multiple tickers.

files <- tibble(files = dir(directory, full.names = TRUE)) %>%
  mutate(ticker = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\1", files),
       date = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\2", files),
       date = as.Date(date),
       type = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\3", files),
       level = gsub(paste0(directory, "/(.*)_(.*)_3420.*0_(.*)_(.*).csv"), "\\4", files))

existing_files <- files %>%
  group_by(ticker, date, level) %>%
  filter(n() == 2) %>%
  summarise_all(dplyr::first) %>%
  ungroup()

Note that I only keep ticker-date-level combinations for which 2 files are available: the orderbook and the messages from Lobster. In total, the data contains observations for 296 trading days. The next function is a general wrapper to read-in files, to clean the data using the process_orderbook() function from above and then to summarize the relevant information. Note the weighted mean when it comes to averaging spreads. The final part including final_grid simply makes sure that we can compare the function output across many different days even if trading finishes early or if we, for whatever reasons, do not have observations on particular trading days.

compute_summaries <- function(n, directory, existing_files){

  ticker <- existing_files %>% filter(row_number() == n) %>% .$ticker
  date <- existing_files %>% filter(row_number() == n) %>% .$date
  level <- existing_files %>% filter(row_number() == n) %>% .$level

  # Read in Messages
  messages_filename <- paste0(directory, "/", ticker, "_", date,"_34200000_57600000_message_", level, ".csv")
  orderbook_filename <- paste0(directory, "/", ticker, "_", date,"_34200000_57600000_orderbook_", level, ".csv")

  messages_raw <- read_csv(messages_filename,
                           col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
                           col_types = cols(
                             ts = col_double(),
                             type = col_integer(),
                             order_id = col_integer(),
                             m_size = col_double(),
                             m_price = col_double(),
                             direction = col_integer(),
                             null = col_skip())) %>%
    mutate(ts = as.POSIXct(ts, origin = date, tz  ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
           m_price = m_price / 10000)

  orderbook_raw <- read_csv(orderbook_filename,
                            col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
                            cols(.default = col_double())) %>%
    mutate_at(vars(contains("price")), ~./10000)

  orderbook <- bind_cols(messages_raw, orderbook_raw)
  opening_auction <- orderbook %>% filter((type ==6& order_id ==-1) | row_number()==1) %>% filter(row_number() == n()) %>% .$ts
  closing_auction <- orderbook %>% filter((type ==6& order_id ==-2) | row_number()==n()) %>% filter(row_number()==1) %>% .$ts

  orderbook <- process_orderbook(orderbook)

  orderbook_summaries <- orderbook %>% transmute(ts,
                                                 midquote = ask_price_1/2 + bid_price_1/2,
                                                 trading_volume = if_else(type ==4 | type ==5, m_price * m_size, as.double(NA)),
                                                 spread = 10000 * (ask_price_1 - bid_price_1)/midquote,
                                                 return = log(midquote) - lag(log(midquote))) %>%
    mutate(ts_latency = as.numeric(lead(ts)) - as.numeric(ts),
           ts_latency = if_else(is.na(ts_latency), 0, ts_latency)) %>%
    mutate(ts_minute = floor_date(ts, "5 minutes")) %>%
    select(-ts) %>%
    group_by(ts_minute) %>%
    summarise(midquote = dplyr::last(midquote),
              trading_volume = sum(trading_volume, na.rm = TRUE),
              trading_volume = if_else(is.na(trading_volume), as.double(0), trading_volume) / 1000000,
              n = n(),
              spread = weighted.mean(spread, ts_latency),
              rv = 100 * sqrt(12 * 6.5) * sum(return ^2))

  full_grid <- tibble(ts_minute = seq(from = as.POSIXct(paste0(date, "09:30:00"), tz = "GMT"),
                                      to = as.POSIXct(paste0(date, "16:00:00"), tz = "GMT"),
                                      "5 min"))

  orderbook_summaries <- left_join(full_grid,
                                   orderbook_summaries,
                                   by = "ts_minute") %>%
    fill(midquote, spread) %>%
    mutate_at(vars(n, ,trading_volume, rv), ~replace(., is.na(.), 0)) %>%
    mutate(date = as.Date(date),
           ticker = ticker,
           return = 100 * (log(midquote) - log(lag(midquote)))) %>%
    select(ts = ts_minute,
           date, ticker, everything())
  return(orderbook_summaries)
}

Next, I evaluate each of the files available. You can also consider a for loop or nesting based on the tibble existing_files. The code below executes the function compute_summaries for each valid trading day and stores the output in a convenient tibble. In the final step I make use of the package hms which provides functionalities in excess of the lubridate package to work with timestamps. I summarize the variables of interest by extracting the average (across trading days) as well as the corresponding time series quantiles. The final figure illustrates the intra-daily dynamics. Note the use of geom_ribbon which is a nice way to illustrate confidence intervals. Depending on the size and number of Lobster files, processing the code below may be somewhat time consuming. For illustrative purposes I used parallelization based on the package ´doFuture´.

library("doFuture")
registerDoFuture()
plan(multisession)  ## on MS Windows
options(future.rng.onMisuse="ignore")

processed_data <- foreach(i =  1:nrow(existing_files)) %dopar% {
  tryCatch({
    compute_summaries(i, directory, existing_files)
  }, error = function(e){cat("error", i, "\n")})
}

processed_data <- processed_data %>% bind_rows()
# install.packages("hms")
processed_data %>%
  select(-midquote) %>%
  mutate(time = hms::as_hms(ts)) %>%
  pivot_longer(trading_volume:return) %>%
  group_by(ticker, time, name) %>%
  na.omit() %>%
  summarise(m = mean(value),
            q5 = quantile(value, 0.05),
            q95 = quantile(value, 0.95),
            q25 = quantile(value, 0.25),
            q75 = quantile(value, 0.75)) %>%
  ggplot(aes(x = time, y = m)) +
  geom_line() +
  geom_ribbon(aes(ymin = q5, ymax = q95), alpha = 0.1) +
  geom_ribbon(aes(ymin = q25, ymax = q75), alpha = 0.1) +
  facet_wrap(~name, scales = "free_y", nc = 1) +
  labs(x = "", y ="")
Intraday pattern

Figure 5.2: Intraday pattern

processed_data %>%
  select(-date, -ticker) %>%
  pivot_longer(midquote:last_col()) %>%
  ggplot(aes(x = ts, y = value)) +
    geom_point(size = 0.1) +
    facet_wrap(~name, scales = "free_y", nc = 1) +
    labs(x = "", y ="")
High-frequency data in the long run

Figure 5.3: High-frequency data in the long run

5.3 Brownian Motion and realized volatility

In this exercise you will analyze the empirical properties of the (geometric) Brownian motion based on a simulation study. As a matter of fact, such simulation studies can be useful for manifold applications in continuous time finance as well, e.g. for option pricing.

Finance theory suggests the following description of prices, that they must be so-called semimartingales. The most commonly used semimartingale is the Geometric Brownian Motion (GBM) where we assume that the logarithm of the stock price \(S_t\), \(X_t = \log(S_t)\) follows the following dynamic \[X_t = X_0 + \mu t + \sigma W_t\] where \(\mu\) and \(\sigma\) are constants and \(W_t\) is a Brownian motion.

5.3.1 Exercises

  • Write a function that simulates (log) stock prices according to the model above. Recall that the brownian motion has independent increments which are normally distributed. The function should generate a path of the log stock price with a pre-defined number of observations (granularity), drift term and volatility. Plot one sample path of the implied (log) stock price.
  • Use the function to simulate a large number of geometric brownian motions. Illustrate the quadtratic variation of the process.
  • Compute the realized volatility for each simulated process and show illlustrate the sampling distribution.
  • Finally, consider the infill asymptotics. Show the effect of sampling once per day, once per hour, once per minute and so on on the variance of the realized volatitility.

5.3.2 Solutions

For this exercise I make use of some functionalities from the xts and highfrequency packages, particularly for aggregation at different frequencies.

library(tidyverse)
library(lubridate)
library(xts)
library(highfrequency)

Note: When loading xts and the tidyverse jointly you may run into serious issues if you did not take naming conflicts into account! Especially the convenient functions last and first from dplyr may be affected. For that purpose, I consistently use dplyr::first and dplyr::last in this chapter.

First, the function below is a rather general version of a simulated generalized brownian motion. Note that there is no direct need to assign trading date and hour, I just implement this because it will provide output which is closer to actual data and therefore parts of my code can be recycled for future projects. The function can be called with a specific seed to replicate the results. The relevant parameters are the initial values start and the parameters of the geometric brownian motion mu and sigma. The input steps defines the granularity (sampling frequency) of the simulated brownian motion.

simulate_geom_brownian_motion <- function(seed = 3010,
                                          steps = 1000,
                                          mu = 0,
                                          sigma = 1,
                                          start = 0){
  set.seed(seed)
  date <- Sys.Date() + days(seed)
  hour(date) <- 9
  minute(date) <- 30
  date_end <- date
  hour(date_end) <- 16
  minute(date_end) <- 0

  tibble(step = 1:(steps+1),
         time =   seq(from = date,
                      to = date_end,
                      length.out = steps + 1),
         price = cumsum(c(start, rnorm(steps, mean = mu/steps, sd = sigma/sqrt(steps)))))
}

Note the last line of the function which is doing most of the work: We simply exploit the fact that increments of a brownian motion are independent and normally distributed. The entire path is nothing but the cumulative sum of the increments.

Next I simulate the stock price with 1000 intradaily observations. Recall that the geometric brownian motion implies dynamics for the log of the stock price which is why I adjust the plotted value in the \(y\)-axis.

tab <- simulate_geom_brownian_motion(start = log(18))
tab %>%
  ggplot(aes(x=time, y = exp(price))) +
  geom_line() + labs(x="", y = "Stock price")
Simulated brownian motion

Figure 5.4: Simulated brownian motion

Next we scale the simulation up. The code below simulates 250 paths of the geometric brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

true_sigma <- 2.4
steps <- 1000

simulation <- tibble(iteration = 1:250)
simulation <- simulation %>%
  mutate(values = map(iteration, simulate_geom_brownian_motion, sigma = true_sigma, steps = steps)) %>%
  unnest(values)

simulation %>%
  ggplot(aes(x = step, y = price, group = iteration)) +
  geom_line(alpha = 0.3) +
  labs(x = "Iteration", y = "(log) price")
Simulated brownian motion

Figure 5.5: Simulated brownian motion

Next we investigate some of the properties of the geometric Brownian motion. First, by definition, the sample variation of \(X_t = \int\limits_0^t\mu dt + \int\limits_0^t\sigma dW_t\) is \(\int\limits_0^t\sigma^2 dt = \sigma^2t\). The figure below illustrates the sample variation over time and shows that the standard deviation across all simulated (log) prices increases linearly in \(t\) as expected. For the simulation I used \(\sigma = 2.4\) (true_sigma) which corresponds to the slope of the line (adjusted by the number of observations).

simulation %>%
  group_by(step) %>%
  summarise(sd = sd(price)) %>%
  ggplot(aes(x = step, y = sd^2)) +
  geom_line() +
  geom_abline(aes(slope = true_sigma^2/steps, intercept = 0)) +
  labs(x = "Iteration", y = "Sample variation")
Quadtratic variation

Figure 5.6: Quadtratic variation

What about the realized volatility? As defined in the lecture, the realized variance \[RV = \sum\limits_{i=0}^n \Delta X_t^2\] converges to the integrated variance (in the absence of jumps, etc). The bar plot below illustrates the sampling distribution of the realized volatility which, as expected, centers around true_sigma, the integrated variance of the simulated geometric Brownian motion. You can change the value true_sigma above and analyze the effect.

simulation %>%
  group_by(iteration) %>%
  summarise(rv = sum(((price - lag(price))^2), na.rm = TRUE)) %>%
  ggplot(aes(x = rv)) +
  geom_histogram(bins = 100) +
  labs(x = "Realized Volatility") +
  geom_vline(aes(xintercept = true_sigma^2), color = "red", linetype = "dotted")
Realized volatities

Figure 5.7: Realized volatities

Finally, some more analysis regarding the infill asymptotics. As should be clear by now, sampling at the highest frequency possible will ensure that the realized volatility recovers the integrated variance of the (log) price process. In other words, we can measure the variance. The following few lines of code illustrate the effect: Whereas we simulated our process at a high frequency, we now simulate sub samples at different frequencies, ranging from every second to once per day. The figure then illustrates the daily realized volatilities relative to the integrated variance of the process. Not surprising, the dispersion around the true parameter decreases substantially with higher sampling frequency.

Note that the code below also illustrates how to work with the xts package and how to switch between the tidyverse and the highfrequency data format requirements. This is arguably a bit cumbersome but in my experience it is worth to exploit the functionalities of both worlds instead of recoding everything in a tidy manner.

infill_asymptotics <- function(tab, sampling_frequency = 5){
  tab_xts<- as.xts(tab %>% select(price),
                 order.by = tab %>% pull(time))
tab_xts <- aggregatePrice(tab_xts,
                                  alignBy = "minutes", alignPeriod = sampling_frequency,
                                  tz = "GMT",  marketOpen = "09:30:00")

  tab_xts %>%
    fortify.zoo %>%
    as_tibble %>%
    janitor::clean_names() %>%
    rename("ts" = index) %>%
    summarise(rv = sum((price -lag(price))^2, na.rm = TRUE))%>%
    as.numeric()
}

rvs <- simulation %>%
  group_by(iteration) %>%
  nest(data = c(step, time, price)) %>%
  mutate("1 Second" = map_dbl(data, infill_asymptotics, 1/60),
         "1 Minute" = map_dbl(data, infill_asymptotics, 1),
         "5 Minutes" = map_dbl(data, infill_asymptotics, 5),
         "15 Minutes" = map_dbl(data, infill_asymptotics, 15),
         "1 hour" = map_dbl(data, infill_asymptotics, 60),
         "1 day" = map_dbl(data, infill_asymptotics, 6.5 * 60))

rvs %>%
    select(-data) %>%
    ungroup() %>%
    pivot_longer(-iteration,
                 names_to = "Sampling Frequency",
                 values_to = "rv") %>%
    mutate(`Sampling Frequency` = as.factor(`Sampling Frequency`),
           `Sampling Frequency` = fct_reorder(`Sampling Frequency`, rv, sd)) %>%
    ggplot(aes(x = sqrt(rv),
               fill = `Sampling Frequency`)) +
    geom_density() +
    geom_vline(aes(xintercept = true_sigma), linetype ="dotted") +
    facet_wrap(~`Sampling Frequency`, scales = "free_y") +
    labs(x = "(Log) Realized Volatility", y = "")
Infill asymptotics

Figure 5.8: Infill asymptotics

5.4 Refresh-time sampling and realized covariances

This collection of code chunks should provide you with some advanced methods on data handling suitable for the use of LOBSTER data with the highfrequency package. Due to storage limitations, I do not provide the raw data itself, but provide you with some code that should work in general if you follow some easy workflows to get your data downloaded from LOBSTER and imported into R-Studio.

5.4.1 Exercises

  • Assume you want to compute (time-series) of realized covariance matrices based on LOBSTER data. For that purpose, I, for instance requested data for 6 stocks, UNH, MCD, HD, GS, CRM, AMGN for the period from March 1st, 2021 until March 25, 2021 from LOBSTER. All files are stored in ticker-specific folders with the common LOBSTER naming convention. Write a script that generates a tibble which contains all existing files suitable for the following read-in.
  • Write code in a tidy fashion that reads in all available files for a pre-specified day, clean the lobster files and generates an xts object of log midquote prices.
  • Perform refresh-time sampling and compute the realized variance-covariance matrix.

5.4.2 Solutions

LOBSTER downloads always come in one (zipped) folder per ticker. As described above, I first write a small script that scans through all folders which comply with the naming convention to identify the available tickers, dates and the level information.

# install.packages("xts")
# Read in the files in some useful manner
all_files <- dir("../Skripts/Part_2", pattern= "_1", full.names = TRUE) %>% 
  map(function(.) tibble(files = dir(., full.names = TRUE))) %>% 
  bind_rows() %>% 
  mutate(ticker = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\1", files),
         date = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\2", files),
         date = as.Date(date),
         type = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\3", files),
         level = gsub(".*/(.*)_(.*)_3420.*0_(.*)_(.*).csv", "\\4", files)) 

Next, I restrict myself to a specific date. Needless to say, you can simply run the code without filtering but grouping (nesting) additionally by date to perform multiple cross-asset computations simultaneously. You should be warned, however, that this poses substantial challenges with respect to the required memory space.

# Specify date
selected_date <- as.Date("2021-03-01")

midquote_data <- all_files %>% 
  group_by(ticker, date, level) %>% 
  filter(date == selected_date) %>% 
  nest(data = c(-ticker))

Next I write a simple function that reads in the LOBSTER files, cleans them using the already documented function process_orderbook and returns an xts file with the time-series of midquotes. Note that for that purpose the package xts needs to be installed already.

compute_log_midquotes <- function(file_tibble){
  
  level <- (file_tibble$level)[[1]]
  date <- (file_tibble$date)[[1]]
  
  # Read in Messages
  messages_filename <- file_tibble %>% filter(type == "message") %>% pull(files) 
  orderbook_filename <- file_tibble %>% filter(type == "orderbook") %>% pull(files) 
  
  messages_raw <- read_csv(messages_filename,
                           col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
                           col_types = cols(
                             ts = col_double(),
                             type = col_integer(),
                             order_id = col_integer(),
                             m_size = col_double(),
                             m_price = col_double(),
                             direction = col_integer(),
                             null = col_skip())) %>%
    mutate(ts = as.POSIXct(ts, origin = date, tz  ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
           m_price = m_price / 10000)
  
  orderbook_raw <- read_csv(orderbook_filename,
                            col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
                            cols(.default = col_double())) %>%
    mutate_at(vars(contains("price")), ~./10000)
  
  orderbook <- bind_cols(messages_raw, orderbook_raw)
  opening_auction <- orderbook %>% filter((type ==6& order_id ==-1) | row_number()==1) %>% filter(row_number() == n()) %>% .$ts
  closing_auction <- orderbook %>% filter((type ==6& order_id ==-2) | row_number()==n()) %>% filter(row_number()==1) %>% .$ts
  
  orderbook <- process_orderbook(orderbook)
  orderbook  <- orderbook %>% transmute(ts,
                                        midquote = ask_price_1/2 + bid_price_1/2) %>% 
    filter(midquote > 0) %>% 
    na.omit() 
  orderbook <- xts::as.xts(orderbook %>% select(midquote),
                   order.by = orderbook %>% pull(ts))
  
  return(orderbook)
}

The function compute_log_midquotes is vectorized and can thus be used in a tidy manner. As a result, the object midquote_data contains all time series of midquotes for each available ticker on the selected day.

midquote_data <- midquote_data %>% 
  mutate(midquote = map(data, compute_log_midquotes))

midquote_data <- midquote_data %>% 
  mutate(observations = map_dbl(midquote, nrow)) %>% 
  filter(observations > 0 | is.na(observations)) %>% 
  ungroup()

midquote_data
## # A tibble: 6 x 4
##   ticker data                     midquote                observations
##   <chr>  <list>                   <list>                         <dbl>
## 1 AMGN   <grouped_df[,4] [2 x 4]> <xts[,1] [53,268 x 1]>         53268
## 2 CRM    <grouped_df[,4] [2 x 4]> <xts[,1] [268,158 x 1]>       268158
## 3 GS     <grouped_df[,4] [2 x 4]> <xts[,1] [56,978 x 1]>         56978
## 4 HD     <grouped_df[,4] [2 x 4]> <xts[,1] [62,769 x 1]>         62769
## 5 MCD    <grouped_df[,4] [2 x 4]> <xts[,1] [49,338 x 1]>         49338
## 6 UNH    <grouped_df[,4] [2 x 4]> <xts[,1] [49,815 x 1]>         49815

The final piece of code simply shows how to work with the clean structure in midquote_data´. More specifically, I make use of the fact that the columnmidquote´ is a list of xts elements which is the desired input for most multivariate functions in the highfrequency package. To perform refreshtime sampling, for instance, and to compute the realized volatility, only 4 lines of code are required and the rest is simple householding.

real_cov_matrix <- midquote_data %>% 
  pull(midquote) %>% 
  highfrequency::refreshTime() %>% 
  rCov(makeReturns = TRUE)

colnames(real_cov_matrix) <- midquote_data %>% pull(ticker)
rownames(real_cov_matrix) <- midquote_data %>% pull(ticker)

real_cov_matrix
##              AMGN          CRM           GS           HD          MCD
## AMGN 1.094213e-04 1.340901e-05 7.978317e-06 1.339822e-05 6.853033e-06
## CRM  1.340901e-05 3.844353e-04 1.386922e-05 2.318136e-05 1.499004e-05
## GS   7.978317e-06 1.386922e-05 2.843359e-04 7.798312e-06 1.642862e-06
## HD   1.339822e-05 2.318136e-05 7.798312e-06 2.178838e-04 1.659505e-05
## MCD  6.853033e-06 1.499004e-05 1.642862e-06 1.659505e-05 1.431650e-04
## UNH  9.855284e-06 5.543091e-07 7.613928e-06 5.143702e-06 1.260943e-05
##               UNH
## AMGN 9.855284e-06
## CRM  5.543091e-07
## GS   7.613928e-06
## HD   5.143702e-06
## MCD  1.260943e-05
## UNH  2.385806e-04
w <- solve(real_cov_matrix) %*% rep(1, ncol(real_cov_matrix))
w / sum(w)
##            [,1]
## AMGN 0.30702005
## CRM  0.07241542
## GS   0.12071794
## HD   0.13301906
## MCD  0.22990253
## UNH  0.13692501