11 Working with Lobster

11.1 Read in and process Lobster files

This exercise requires you mainly to familiarize yourself with Lobster. As a member of KU you can retrieve data from Lobster. Lobster is an online limit order book data tool to provide easy-to-use, high-quality limit order book data for the entire universe of NASDAQ traded stocks. I requested some of the data based on their online interface and stored it on Absalon before running the code below. Lobster files contain all trading messages (order submissions, cancellations, trades, …) for a pre-specified ticker that went through NASDAQ. The sample file in Absalon contains the entire orderbook history for ticker SAP until level 5 during January 2022. LOBSTER generates a ‘message’ and an ‘orderbook’ file for each active trading day of a selected ticker. The ‘orderbook’ file contains the evolution of the limit order book up to the requested number of levels. The ’message’ file contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are time stamped to seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period.

Both the ‘message’ and ‘orderbook’ files are provided in the .csv format and can easily be read with any statistical software package.

** Exercises: **

  • Familiarize yourself with the output documentation of Lobster files on their homepage. Download the two files SAP_2020-12-08_34200000_57600000_message_5.csv and SAP_2020-12-08_34200000_57600000_orderbook_5.csv from Absalon and read them in using read_csv. Lobster files always come with the same naming convention ticker_date_34200000_57600000_filetype_level.csv, whereas filetype either denotes message or the corresponding orderbook snapshots. Take notes on what you think are important cleaning steps in order to get a useful dataset which comprises of order book and trading information.

  • Merge the two files after you applied some intuitive naming convention such that you can work with one file which traces all information about messages and the corresponding orderbook snapshots. Try to create a function “process_orderbook` which performs essential cleaning of the data by making sure you filter out trading halts and aggregate order execution messages which may have triggered multiple limit orders. More specifically, the function should complete the following generic steps:

    1. Identify potential trading stops (consult the Lobster documentation for this) and discard all observations that have been recorded during the window when trading was interrupted.
    2. Sometimes, messages related to the daily opening and closing auction are included in the Lobster file (especially during days when trading stops earlier, e.g. before holidays). Remove such observations (opening auctions are identified as messages with ´type == 6´ and ´ID == -1´ and closing auctions can be identified as messages with ´type == 6` and ´ID == -2).
    3. Replace “empty” slots in the orderbook with ´NA´s in case there was no liquidity quoted at higher levels of the orderbook. In that case Lobster reports negative bid prices and large ask prices (above 199999).
    4. Remove observations with seemingly crossed prices (when the best ask is below the best bid).
    5. Market orders are sometimes executed against multiple limit orders. In that case, Lobster records multiple transactions with identical time stamp. Aggregate these transactions and generate the corresponding trade size and the (size weighted) execution price. Also, retain the orderbook snapshot after the transaction has been executed.
    6. The direction of a trade is hard to evaluate (not observable) if the transaction has been executed against a hidden order (`order_type == 5´). Create a proxy for the direction based on the executed price relative to the last observed orderbook snapshot.

You can (should) use this function whenever you work with raw orderbook data from Lobster to be able to actually use the resulting orderbook. In the solution part I provide some further explanations of my own implementation. - Use the function to clean the SAP Lobster output. - Summarize the orderbook by extracting the average midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask), the average spread \(S_t= (a_t - b_t)/q_t\) (in basis points relative to the concurrent midquote), the aggregate volume in USD and the quoted depth (in million USD 5 basis points from the concurrent best price on each side of the orderbook) for each hour of the day.
- Provide a meaningful illustration of the orderbook dynamics during the particular trading day. This is intentionally a very open question: There is so much information within an orderbook that it is impossible to highlight everything worth considering. I encourage you to post your visualization on the discussion board in Absalon.

** Solutions: **

I use the following code to read-in (general) Lobster files - just change ticker, date and level in the code snippet below.

files_available <- (archive::archive("../Lectures/data/SAP_2022-01-01_2022-01-31_5.7z")) %>%
  mutate(ticker = gsub("(.*)_(.*)_342.*_(.*).csv", "\\1", path),
         date = gsub("(.*)_(.*)_342.*_(.*).csv", "\\2", path),
         level = gsub("(.*)_(.*)_342.*_(.*).csv", "\\3", path))

files_available
## # A tibble: 40 x 5
##   path                                                 size date     ticker level
##   <chr>                                               <int> <chr>    <chr>  <chr>
## 1 SAP_2022-01-28_34200000_57600000_message_5.csv          0 2022-01~ SAP    5    
## 2 SAP_2022-01-28_34200000_57600000_orderbook_5.csv        0 2022-01~ SAP    5    
## 3 SAP_2022-01-03_34200000_57600000_message_5.csv   14902108 2022-01~ SAP    5    
## 4 SAP_2022-01-03_34200000_57600000_orderbook_5.csv 38240040 2022-01~ SAP    5    
## 5 SAP_2022-01-04_34200000_57600000_message_5.csv   13248413 2022-01~ SAP    5    
## # ... with 35 more rows
level <- 5

Lobster files always follow the same naming convention thus the code above can be recycled in a more general fashion for other Lobster files as well. Next we read in data for one(!) of the 20 trading days.

messages_raw <- read_csv(archive::archive_read("../Lectures/data/SAP_2022-01-01_2022-01-31_5.7z", 3),
col_names = c("ts", "type", "order_id", "m_size", "m_price", "direction", "null"),
col_type = cols(null = col_skip())) %>%
mutate(ts = as.POSIXct(ts, origin = as.Date("2022-01-07"), tz ="GMT", format = "%Y-%m-%d %H:%M:%OS6"),
m_price = m_price / 10000) # Note the conversion of the time stamp to actually useful information

messages_raw %>%
filter(type ==4| type ==5) %>%
ggplot(aes(x = ts, y = m_price)) + geom_step() +
geom_point(data = messages_raw %>% tail(1), color = "red", size = 3) +
labs(x = "", y = "USD", title = "Traded prices: SAP, January. 3th, 2022" )

Note my own naming convention for column names. The command col_types makes sure that there is easy processing afterwards. By default, ts denotes the time in seconds since midnight (decimals are precise until nanosecond level) and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 5 levels.

Each message is associated with the corresponding orderbook snapshot at that point in time. After merging message and orderbook files, the entire data thus looks as follows

The code below is rather flexible to handle Lobster files with more or less number of levels. Each column contains the corresponding prices and sizes on each side of the orderbook. More information is available on the documentation of Lobster. The output contains one file with all information available in the orderbook.

orderbook_raw <- read_csv(archive::archive_read("../Lectures/data/SAP_2022-01-01_2022-01-31_5.7z", 4),
                          col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level), rep(1:level, each = 4), sep = "_"),
                          cols(.default = col_double())) %>%
  mutate_at(vars(contains("price")), ~./10000) 

orderbook <- bind_cols(messages_raw, orderbook_raw) 
orderbook
## # A tibble: 316,379 x 26
##   ts                   type order_id m_size m_price direction ask_price_1
##   <dttm>              <dbl>    <dbl>  <dbl>   <dbl>     <dbl>       <dbl>
## 1 2022-01-07 09:30:00     3 34351556    500    140.        -1        140.
## 2 2022-01-07 09:30:00     1 34456580    177    140.         1        140.
## 3 2022-01-07 09:30:00     3 34390184    100    140.        -1        140.
## 4 2022-01-07 09:30:00     1 34457908    160    140.        -1        140.
## 5 2022-01-07 09:30:00     3 34302936    100    140.         1        140.
## # ... with 316,374 more rows, and 19 more variables: ask_size_1 <dbl>,
## #   bid_price_1 <dbl>, bid_size_1 <dbl>, ask_price_2 <dbl>, ask_size_2 <dbl>,
## #   bid_price_2 <dbl>, bid_size_2 <dbl>, ask_price_3 <dbl>, ask_size_3 <dbl>,
## #   bid_price_3 <dbl>, bid_size_3 <dbl>, ask_price_4 <dbl>, ask_size_4 <dbl>,
## #   bid_price_4 <dbl>, bid_size_4 <dbl>, ask_price_5 <dbl>, ask_size_5 <dbl>,
## #   bid_price_5 <dbl>, bid_size_5 <dbl>

You see that the orderbook contains 316379 observations. While the columns ts to direction help to understand the specific messages, the remaining columns contain a snapshot of the orderbook at that particular point in time.

Next, Lobster sometimes contains opening and closing auctions (in particular on days when trading hours deviate from the regular times). If you follow the documentation on Lobster you will see that these times can be identified as I do in the code below.

opening_auction <- orderbook %>%
  filter((type == 6 & order_id == -1) | row_number() == 1) %>%
  filter(row_number() == n()) %>%
  .$ts
closing_auction <- orderbook %>%
  filter((type == 6 & order_id == -2) | row_number() == n()) %>%
  filter(row_number() == 1) %>%
  .$ts

c(opening_auction, closing_auction)
## [1] "2022-01-07 09:30:02 GMT" "2022-01-07 15:59:59 GMT"

In line with the exercise, the function process_orderbook() is doing a number of essential steps. You can call it from within R with the following command.

process_orderbook <- function(orderbook) {
  
  # Did a trading halt happen?
  halt_index <- orderbook %>%
    filter(type == 7 & direction == -1 & m_price == -1 / 10000 | type == 7 & direction == -1 & m_price == 1 / 10000)
  
  while (nrow(halt_index) > 1) {
    # Filter out messages that occurred in between trading halts
    cat("Trading halt detected")
    orderbook <- orderbook %>%
      filter(ts < halt_index$ts[1] | ts > halt_index$ts[2])
    halt_index <- halt_index %>% filter(row_number() > 2)
  }
  
  # Discard everything before type 6 & ID -1 and everything after type 6 & ID -2
  
  opening_auction <- orderbook %>%
    filter(
      type == 6,
      order_id == -1
    ) %>%
    .$ts
  
  closing_auction <- orderbook %>%
    filter(
      type == 6,
      order_id == -2
    ) %>%
    .$ts
  
  if (length(opening_auction) != 1) {
    opening_auction <- orderbook %>%
      select(ts) %>%
      head(1) %>%
      .$ts - seconds(0.1)
  }
  if (length(closing_auction) != 1) {
    closing_auction <- orderbook %>%
      select(ts) %>%
      tail(1) %>%
      .$ts + seconds(0.1)
  }
  
  orderbook <- orderbook %>% filter(ts > opening_auction & ts < closing_auction)
  orderbook <- orderbook %>% filter(type != 6 & type != 7)
  
  # Replace "empty" slots in orderbook (0 volume) with NA prices
  orderbook <- orderbook %>%
    mutate(across(contains("bid_price"), ~replace(., . < 0, NA)),
           across(contains("ask_price"), ~replace(., . >= 999999, NA))) 

  # Remove crossed orderbook observations
  orderbook <- orderbook %>%
    filter(ask_price_1 > bid_price_1)
  
  # Merge transactions with unique time stamp
  trades <- orderbook %>%
    filter(type == 4 | type == 5) %>%
    select(ts:direction)
  
  trades <- inner_join(trades,
                       orderbook %>%
                         group_by(ts) %>%
                         filter(row_number() == 1) %>%
                         ungroup() %>%
                         transmute(ts,
                                   ask_price_1, bid_price_1,
                                   midquote = ask_price_1 / 2 + bid_price_1 / 2,
                                   lag_midquote = lag(midquote)
                         ),
                       by = "ts"
  )
  
  trades <- trades %>%
    mutate(direction = case_when(
      type == 5 & m_price < lag_midquote ~ 1, # lobster convention: direction = 1 if executed against a limit buy order 
      type == 5 & m_price > lag_midquote ~ -1,
      type == 4 ~ as.double(direction),
      TRUE ~ as.double(NA)
    ))
  
  # Aggregate transactions with size and volume weighted price
  trade_aggregated <- trades %>%
    group_by(ts) %>%
    summarise(
      type = last(type),
      order_id = NA,
      m_price = sum(m_price * m_size) / sum(m_size),
      m_size = sum(m_size),
      direction = last(direction)
    )
  
  # Merge trades with last observed orderbook snapshot
  trade_aggregated <- inner_join(trade_aggregated,
                        orderbook %>%
                          select(ts, ask_price_1:last_col()) %>%
                          group_by(ts) %>%
                          filter(row_number() == n()),
                        by = "ts"
  )
  
  orderbook <- orderbook %>%
    filter(
      type != 4,
      type != 5
    ) %>%
    bind_rows(trade_aggregated) %>%
    arrange(ts) %>%
    mutate(direction = if_else(type == 4 | type == 5, direction, as.double(NA)))
  
  return(orderbook)
}

You can process the raw data from before by simply calling.

orderbook <- process_orderbook(orderbook)

The function compute_depth below computes the number of traded shares within bp basis points of the associated best price which can be used as a measure of liquidity.

compute_depth <- function(df,
                          side = "bid",
                          bp = 0) {
  
  # Computes depth (in contract) based on orderbook snapshots
  if (side == "bid") {
    value_bid <- (1 - bp / 10000) * df %>% select("bid_price_1")
    index_bid <- df %>%
      select(contains("bid_price")) %>%
      mutate_all(function(x) {
        if_else(is.na(x),
                FALSE,
                x >= value_bid
        )
      })
    
    sum_vector <- (df %>% select(contains("bid_size")) * index_bid) %>% rowSums(na.rm = TRUE)
  } else {
    value_ask <- (1 + bp / 10000) * df %>% select("ask_price_1")
    index_ask <- df %>%
      select(contains("ask_price")) %>%
      mutate_all(function(x) {
        if_else(is.na(x),
                FALSE,
                x <= value_ask
        )
      })
    sum_vector <- (df %>% select(contains("ask_size")) * index_ask) %>% rowSums(na.rm = TRUE)
  }
  return(sum_vector)
}

Next, I compute the midquote, bid ask spread, aggregate volume and depth (the amount of trade-able units in the orderbook):

  • Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
  • Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
  • Volume is the aggregate sum of traded units of the stock. I do not differentiate between hidden (type==5) and visible volume.
  • Depth at the best level.

I aggregate the information to buckets of 5 minutes where I use time-weighting for the order book variables.

orderbook_summaries <- orderbook %>%
  transmute(ts,
    midquote = ask_price_1 / 2 + bid_price_1 / 2,
    trading_volume = if_else(type == 4 | type == 5, m_price * m_size, as.double(NA)),
    depth0_bid = bid_size_1,
    depth0_ask = ask_size_1,
    spread = 10000 * (ask_price_1 - bid_price_1) / midquote
  ) %>%
  mutate(
    ts_latency = as.numeric(lead(ts)) - as.numeric(ts), # time between messages
    ts_latency = if_else(is.na(ts_latency), 0, ts_latency), # 0 for first message
    ts_minute = floor_date(ts, "5 minutes")
  ) %>% # create 5 minute buckets (e.g. messages from 10:30:00 - 10:34:59.xx belong into 10:30:00)
  group_by(ts_minute) %>%
  summarise(
    midquote = last(midquote),
    n_trades = sum(!is.na(trading_volume)),
    n = n(), # number of messages
    trading_volume = sum(trading_volume, na.rm = TRUE),
    trading_volume = if_else(is.na(trading_volume), as.double(0), trading_volume),
    depth0_bid = weighted.mean(depth0_bid, ts_latency),
    depth0_ask = weighted.mean(depth0_ask, ts_latency),
    spread = weighted.mean(spread, ts_latency)
  )

orderbook_summaries %>% knitr::kable()
ts_minute midquote n_trades n trading_volume depth0_bid depth0_ask spread
2022-01-07 09:30:00 140 61 12926 1313466 484.9 534.3 3.62
2022-01-07 09:35:00 140 23 14844 602060 669.3 583.9 4.33
2022-01-07 09:40:00 140 24 11150 299088 761.1 427.0 4.31
2022-01-07 09:45:00 139 43 12891 711751 777.4 556.6 4.54
2022-01-07 09:50:00 139 45 20975 643746 713.1 719.7 4.76
2022-01-07 09:55:00 139 46 17916 666569 748.4 627.6 4.79
2022-01-07 10:00:00 140 37 17288 640948 749.1 671.1 5.05
2022-01-07 10:05:00 140 42 14336 517071 835.2 520.4 4.93
2022-01-07 10:10:00 140 32 11862 609344 805.3 795.8 5.33
2022-01-07 10:15:00 140 24 11218 218798 493.2 633.2 4.59
2022-01-07 10:20:00 140 69 10213 1392951 567.3 651.8 4.31
2022-01-07 10:25:00 140 17 12386 193138 470.3 805.5 4.54
2022-01-07 10:30:00 140 27 12563 443163 446.8 715.0 4.43
2022-01-07 10:35:00 140 44 10616 408664 435.0 426.2 3.61
2022-01-07 10:40:00 140 91 10862 1516834 554.0 547.6 4.04
2022-01-07 10:45:00 140 28 12057 249738 467.3 758.1 4.51
2022-01-07 10:50:00 140 36 11627 292742 406.6 746.2 4.32
2022-01-07 10:55:00 140 19 8978 363035 299.5 787.5 3.61
2022-01-07 11:00:00 140 45 8117 852378 178.2 950.5 2.79
2022-01-07 11:05:00 140 29 8625 788170 454.5 820.7 4.08
2022-01-07 11:10:00 140 57 9471 1794017 366.2 513.4 3.13
2022-01-07 11:15:00 140 36 9019 547094 277.8 470.2 2.79
2022-01-07 11:20:00 140 40 7751 742033 528.3 593.9 3.72
2022-01-07 11:25:00 141 58 12209 1843627 526.0 532.6 4.10
2022-01-07 11:30:00 140 47 2706 1043222 219.2 521.8 6.24
2022-01-07 11:35:00 140 41 501 926211 240.0 203.4 8.53
2022-01-07 11:40:00 140 16 207 82531 172.9 166.4 8.63
2022-01-07 11:45:00 140 19 212 201726 136.2 188.0 5.78
2022-01-07 11:50:00 140 10 405 111357 82.4 56.0 6.04
2022-01-07 11:55:00 140 12 398 142420 153.9 169.9 6.21
2022-01-07 12:00:00 140 10 234 143158 138.3 118.7 6.09
2022-01-07 12:05:00 140 14 188 171964 141.6 45.6 6.90
2022-01-07 12:10:00 140 14 157 129363 188.3 72.1 5.57
2022-01-07 12:15:00 140 17 192 191348 108.7 67.7 3.55
2022-01-07 12:20:00 140 9 143 63614 120.0 43.7 3.94
2022-01-07 12:25:00 140 21 210 173409 134.5 121.0 4.88
2022-01-07 12:30:00 140 26 286 407699 138.3 164.6 6.99
2022-01-07 12:35:00 140 32 300 586622 50.1 121.2 5.66
2022-01-07 12:40:00 140 8 198 82985 229.2 75.8 5.94
2022-01-07 12:45:00 140 10 188 27437 210.3 78.8 5.80
2022-01-07 12:50:00 140 13 255 122189 287.7 102.6 4.92
2022-01-07 12:55:00 140 6 257 59784 170.7 113.2 4.16
2022-01-07 13:00:00 140 13 240 168846 255.5 167.2 4.83
2022-01-07 13:05:00 140 2 137 6995 151.8 142.3 6.02
2022-01-07 13:10:00 140 3 310 19176 192.3 122.3 7.01
2022-01-07 13:15:00 140 5 139 21009 161.9 220.5 6.59
2022-01-07 13:20:00 140 10 189 84217 212.7 26.1 3.50
2022-01-07 13:25:00 140 9 147 79495 250.4 156.4 2.68
2022-01-07 13:30:00 140 12 196 112770 237.5 100.0 2.34
2022-01-07 13:35:00 140 26 340 322772 172.4 119.7 3.00
2022-01-07 13:40:00 141 18 483 121353 273.7 88.3 6.59
2022-01-07 13:45:00 141 17 247 172959 106.2 125.5 4.52
2022-01-07 13:50:00 141 7 178 40787 250.9 71.0 5.11
2022-01-07 13:55:00 141 17 270 193457 167.6 153.5 4.62
2022-01-07 14:00:00 141 10 127 72340 284.7 149.9 4.61
2022-01-07 14:05:00 141 16 348 154616 398.3 195.5 6.84
2022-01-07 14:10:00 141 32 285 362107 278.1 110.2 4.68
2022-01-07 14:15:00 141 14 283 176133 191.0 145.9 4.06
2022-01-07 14:20:00 141 25 375 242762 129.5 125.2 2.63
2022-01-07 14:25:00 141 17 204 196725 138.8 61.1 4.58
2022-01-07 14:30:00 141 12 201 77879 182.6 74.2 3.71
2022-01-07 14:35:00 141 30 406 330383 170.1 45.8 4.19
2022-01-07 14:40:00 141 11 247 142360 163.0 88.4 8.42
2022-01-07 14:45:00 141 17 240 155433 390.8 48.0 4.63
2022-01-07 14:50:00 141 27 358 381499 296.6 86.1 5.42
2022-01-07 14:55:00 141 23 492 263047 86.6 75.9 3.00
2022-01-07 15:00:00 141 19 577 123330 145.2 146.9 5.22
2022-01-07 15:05:00 141 14 285 199390 173.4 124.1 3.25
2022-01-07 15:10:00 141 23 430 332351 303.3 18.7 2.59
2022-01-07 15:15:00 141 41 600 451828 218.5 137.3 2.48
2022-01-07 15:20:00 141 56 517 806482 180.7 174.1 2.08
2022-01-07 15:25:00 141 39 582 559393 287.2 111.1 1.74
2022-01-07 15:30:00 141 64 1028 739473 197.9 116.7 2.30
2022-01-07 15:35:00 141 65 864 886603 263.3 110.9 2.13
2022-01-07 15:40:00 141 57 1252 742247 206.2 103.2 2.51
2022-01-07 15:45:00 142 35 1065 242671 355.4 164.6 2.84
2022-01-07 15:50:00 141 154 2458 2367376 193.4 150.7 2.50
2022-01-07 15:55:00 141 140 2142 1857272 245.8 160.7 2.00

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher levels of the orderbook.

orderbook_trades <- orderbook %>%
  filter(type==4|type==5) %>%
  select(ts, m_price)

orderbook_quotes <- orderbook %>%
    mutate(ts_new = floor_date(ts, "minute")) %>%
    group_by(ts_new) %>%
    summarise_all(dplyr::last) %>%
    select(-ts_new) %>%
    mutate(id = row_number()) %>%
    select(ts, id, matches("bid|ask")) %>%
    gather(level, price, -ts, -id) %>%
    separate(level, into=c("side","variable","level"), sep="_") %>%
    mutate(level = as.numeric(level))  %>%
    pivot_wider(names_from = variable, values_from = price)

ggplot() +
  theme_bw() +
  geom_point(data = orderbook_quotes, aes(x=ts, y=price, color=as_factor(level), size = size/max(size)), alpha = 0.1)+
  geom_point(data = orderbook_trades, aes(x=ts, y=m_price), color='red') +
  labs(title="Orderbook Dynamics",
       y="Price",
       x="") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position ="none") +
  scale_y_continuous()

11.2 Brownian Motion and realized volatility

In this exercise you will analyze the empirical properties of the (geometric) Brownian motion based on a simulation study. As a matter of fact, such simulation studies can be useful for manifold applications in continuous time finance as well, e.g. for option pricing.

Finance theory suggests that prices must be so-called semimartingales. The most commonly used semimartingale is the Geometric Brownian Motion (GBM) where we assume that the logarithm of the stock price \(S_t\), \(X_t = \log(S_t)\) follows the following dynamic \[X_t = X_0 + \mu t + \sigma W_t\] where \(\mu\) and \(\sigma\) are constants and \(W_t\) is a Brownian motion.

** Exercises: **

  • Write a function that simulates (log) stock prices according to the model above. Recall that the brownian motion has independent increments which are normally distributed. The function should generate a path of the log stock price with a pre-defined number of observations (granularity), drift term and volatility. Plot one sample path of the implied (log) stock price.
  • Use the function to simulate a large number of geometric brownian motions. Illustrate the quadtratic variation of the process.
  • Compute the realized volatility for each simulated process and show illlustrate the sampling distribution.
  • Finally, consider the infill asymptotics. Show the effect of sampling once per day, once per hour, once per minute and so on on the variance of the realized volatitility.

** Solutions: **

For this exercise I make use of some functionalities from the xts and highfrequency packages, particularly for aggregation at different frequencies.

Note: When loading xts and the tidyverse jointly you may run into serious issues if you did not take naming conflicts into account! Especially the convenient functions last and first from dplyr may be affected. For that purpose, I consistently use dplyr::first and dplyr::last in this chapter.

First, the function below is a rather general version of a simulated generalized brownian motion. Note that there is no direct need to assign trading date and hour, I just implement this because it will provide output which is closer to actual data and therefore parts of my code can be recycled for future projects. The function can be called with a specific seed to replicate the results. The relevant parameters are the initial values start and the parameters of the geometric brownian motion mu and sigma. The input steps defines the granularity (sampling frequency) of the simulated brownian motion.

simulate_geom_brownian_motion <- function(seed = 3010,
                                          steps = 1000,
                                          mu = 0,
                                          sigma = 1,
                                          start = 0){
  set.seed(seed)
  date <- Sys.Date() + days(seed)
  hour(date) <- 9
  minute(date) <- 30
  date_end <- date
  hour(date_end) <- 16
  minute(date_end) <- 0

  tibble(step = 1:(steps+1),
         time =   seq(from = date,
                      to = date_end,
                      length.out = steps + 1),
         price = cumsum(c(start, rnorm(steps, mean = mu/steps, sd = sigma/sqrt(steps)))))
}

Note the last line of the function which is doing most of the work: We simply exploit the fact that increments of a brownian motion are independent and normally distributed. The entire path is nothing but the cumulative sum of the increments.

Next I simulate the stock price with 1000 intradaily observations. Recall that the geometric brownian motion implies dynamics for the log of the stock price which is why I adjust the plotted value in the \(y\)-axis.

tab <- simulate_geom_brownian_motion(start = log(18))
tab %>%
  ggplot(aes(x=time, y = exp(price))) +
  geom_line() + labs(x="", y = "Stock price")

Next we scale the simulation up. The code below simulates 250 paths of the geometric brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

true_sigma <- 2.4
steps <- 1000

simulation <- tibble(iteration = 1:250)
simulation <- simulation %>%
  mutate(values = map(iteration, simulate_geom_brownian_motion, sigma = true_sigma, steps = steps)) %>%
  unnest(values)

simulation %>%
  ggplot(aes(x = step, y = price, group = iteration)) +
  geom_line(alpha = 0.3) +
  labs(x = "Iteration", y = "(log) price")

Next we investigate some of the properties of the geometric Brownian motion. First, by definition, the sample variation of \(X_t = \int\limits_0^t\mu dt + \int\limits_0^t\sigma dW_t\) is \(\int\limits_0^t\sigma^2 dt = \sigma^2t\). The figure below illustrates the sample variation over time and shows that the standard deviation across all simulated (log) prices increases linearly in \(t\) as expected. For the simulation I used \(\sigma = 2.4\) (true_sigma) which corresponds to the slope of the line (adjusted by the number of observations).

simulation %>%
  group_by(step) %>%
  summarise(sd = sd(price)) %>%
  ggplot(aes(x = step, y = sd^2)) +
  geom_line() +
  geom_abline(aes(slope = true_sigma^2/steps, intercept = 0)) +
  labs(x = "Iteration", y = "Sample variation")

What about the realized volatility? As defined in the lecture, the realized variance \[RV = \sum\limits_{i=0}^n \Delta X_t^2\] converges to the integrated variance (in the absence of jumps, etc). The bar plot below illustrates the sampling distribution of the realized volatility which, as expected, centers around true_sigma, the integrated variance of the simulated geometric Brownian motion. You can change the value true_sigma above and analyze the effect.

simulation %>%
  group_by(iteration) %>%
  summarise(rv = sum(((price - lag(price))^2), na.rm = TRUE)) %>%
  ggplot(aes(x = rv)) +
  geom_histogram(bins = 100) +
  labs(x = "Realized Volatility") +
  geom_vline(aes(xintercept = true_sigma^2), color = "red", linetype = "dotted")

Finally, some more analysis regarding the infill asymptotics. As should be clear by now, sampling at the highest frequency possible will ensure that the realized volatility recovers the integrated variance of the (log) price process. In other words, we can measure the variance. The following few lines of code illustrate the effect: Whereas we simulated our process at a high frequency, we now simulate sub samples at different frequencies, ranging from every second to once per day. The figure then illustrates the daily realized volatilities relative to the integrated variance of the process. Not surprising, the dispersion around the true parameter decreases substantially with higher sampling frequency.

Note that the code below also illustrates how to work with the xts package and how to switch between the tidyverse and the highfrequency data format requirements. This is arguably a bit cumbersome but in my experience it is worth to exploit the functionalities of both worlds instead of recoding everything in a tidy manner.

infill_asymptotics <- function(tab, sampling_frequency = 5){
  tab_xts<- as.xts(tab %>% select(price),
                 order.by = tab %>% pull(time))
tab_xts <- aggregatePrice(tab_xts,
                                  alignBy = "minutes", alignPeriod = sampling_frequency,
                                  tz = "GMT",  marketOpen = "09:30:00")

  tab_xts %>%
    fortify.zoo %>%
    as_tibble %>%
    janitor::clean_names() %>%
    rename("ts" = index) %>%
    summarise(rv = sum((price -lag(price))^2, na.rm = TRUE))%>%
    as.numeric()
}

rvs <- simulation %>%
  group_by(iteration) %>%
  nest(data = c(step, time, price)) %>%
  mutate("1 Second" = map_dbl(data, infill_asymptotics, 1/60),
         "1 Minute" = map_dbl(data, infill_asymptotics, 1),
         "5 Minutes" = map_dbl(data, infill_asymptotics, 5),
         "15 Minutes" = map_dbl(data, infill_asymptotics, 15),
         "1 hour" = map_dbl(data, infill_asymptotics, 60),
         "1 day" = map_dbl(data, infill_asymptotics, 6.5 * 60))

rvs %>%
    select(-data) %>%
    ungroup() %>%
    pivot_longer(-iteration,
                 names_to = "Sampling Frequency",
                 values_to = "rv") %>%
    mutate(`Sampling Frequency` = as.factor(`Sampling Frequency`),
           `Sampling Frequency` = fct_reorder(`Sampling Frequency`, rv, sd)) %>%
    ggplot(aes(x = sqrt(rv),
               fill = `Sampling Frequency`)) +
    geom_density() +
    geom_vline(aes(xintercept = true_sigma), linetype ="dotted") +
    facet_wrap(~`Sampling Frequency`, scales = "free_y") +
    labs(x = "(Log) Realized Volatility", y = "")