23 Decomposing High-frequency Time Series

In this chapter, we will work with ultra-high-frequency trading data that consists of time series measured with nanosecond precision. Our task is to decompose stock price into temporary and permanent components using state-space modeling.

In the next chapter, we will link the temporary and permanent stock price components with tweets to generate the two outcome variables: temporary and permanent stock price impact of firm-generated content on Twitter.

23.1 Temporary and permanent price impacts

Price impact is defined as the impact on the variance of stock price.

A firm’s stock price reflects information relevant to the value of a firm, but is also distorted by noise. Estimating the proportion of stock price driven by information (permanent component) and the proportion driven by noise (temporary component) is a critical aspect of this analysis.

Temporary price impacts are short-term impacts that result in momentary changes in the price of a stock before it returns to its pre-FGC value. Temporary price impacts are often the result of uninformed trader activity.

These noises can include trading frictions due to low levels of liquidity, defined as the ability to trade large quantities of a firm’s stock quickly with little or no price impact, or the activity of traders who lack adequate information regarding the value of a stock, known as “uninformed traders”.

Permanent price impacts often result in the price attaining an enduring new value after an event (e.g., FGC). This occurs when the event provides information that updates informed investor/trader expectations related to a firm’s long-term performance.

A crucial element in such analysis is knowing the event time (i.e., timestamp), which refers to the time at which an event occurs, such as the FGC dissemination time. An event (e.g., firm-generated content on Twitter) can generate a permanent price impact and a temporary price impact.

This study estimates the permanent and temporary price impacts of FGC by first conducting a state-space decomposition of firm stock price into its permanent and temporary components and then linking the changes in these components to individual pieces of FGC (i.e., tweets).

23.2 State-space modeling

State-space modeling is commonly used for decomposition of price.

It is a tool for describing a phenomenon that has an underlying system with a time-varying dynamical relationship. The state of the system at time $t$ is related to the state of the system at time $t-1$ . If the state of the system at time $t-1$ is known, then the state at $t$ can be inferred.

We cannot observe the true underlying state of the system, but rather we observe a noisy version of it. In its simplest form, we can specify an state equation and an observation equation to summarize this characteristic.

The observation equation describes how the underlying state is transformed (with noise added) into something that we directly measure.

$y(t) = \alpha(t) + \epsilon(t)$

The state equation describes how the system evolves from one time point to the next.

$\alpha(t) = \alpha(t-1) + \eta(t)$

It is assumed that $\epsilon(t)$ is normally distributed with mean 0 and covariance $H(t)$ , and that $\eta(t)$ is normally distributed with mean 0 and covariance $Q(t)$ .

In this study, the structure of the state-space model for price is expressed through the observation and state equations below.

$v_{s,t,\tau} = m_{s,t,\tau} + i_{s,t,\tau}$

$m_{s,t,\tau} = m_{s,t,\tau-1} + u_{s,t,\tau}$

$v_{s,t,\tau} = ln(p_{s,t,\tau})$

It consists of a multiple of $S$ stock prices, $T$ intraday periods, and $N$ intervals. $s = 1,...,S$ , $\tau = 1,...,T$ , and $t = 1,...,N$ . $t$ is equal to one second (time interval of the analysis). $N$ equals the number of one-second intervals during a stock trading day. $p_{s,t,\tau}$ is the price of stock $s$ at interval $t$ and period $\tau$ .

$m_{s,t,\tau}$ is a nonstationary permanent component of the price of stock $s$ at interval $t$ and period $\tau$ . $i_{s,t,\tau}$ is a stationary transitory component of the price of stock $s$ at interval $t$ and period $\tau$ . $u_{s,t,\tau}$ is an idiosyncratic disturbance error in the permanent price component of stock $s$ at interval $t$ and period $\tau$ .

We will estimate $\sigma_{s,t}^{2_u}$ (permanent component) and $\sigma_{s,t}^{2_i}$ (temporary component) using maximum likelihood constructed with the Kalman filter.

filtering

The main goal of the state-space modelling is to gain knowledge of the unknown latent state $\alpha$ given the observations $y$ . This is achieved by using two important recursive algorithms, the Kalman filtering and smoothing. The purpose of filtering is to ensure that estimates of the true state of nature are updated with the introduction of every additional observation. An example of filtering could be adjusting the estimate of the unemployment level in the United States based on the latest data release from the Bureau of Labor Statistics, given the history of monthly unemployment data.

For our purpose, state-space modeling with a Kalman filter offers a solution to dealing with the unequal time intervals or irregular frequency inherent in intraday high-frequency transaction data like stock price movements. The Kalman filter facilitates the decomposition of any change in the time series (e.g., variance in the stock prices).

For an illustrative introduction of Kalman filtering, watch the video Kalman Filtering with Applications in Finance.

We will use the package KFAS for state-space modeling.

23.3 Preparing high-frequency trading data

To prepare the data for state-space modeling, first, we will read stocks data we collected from the WRDS databases. For one day’s data on 2022-01-03, this could be:

colnames <- c("DATE", "TIME_M", "EX", "SYM_ROOT", "SYM_SUFFIX", "TR_SCOND", "SIZE", "PRICE",
              "TR_STOP_IND", "TR_CORR", "TR_SEQNUM", "TR_ID", "TR_SOURCE", "TR_RF")

stocks <- read_delim("20220103.txt",  
                     delim = "\t", 
                     col_names = colnames,
                     col_types = cols(.default = col_character()))

cleaning data

Next, we will perform data cleaning using the criteria in Chordia, Roll, and Subrahmanyam (2001) and Ibikunle (2015), documented in the Data sections.

Chordia, T., Roll, R., & Subrahmanyam, A. (2001). Market Liquidity and Trading Activity. The Journal of Finance, 56(2), 501-530. https://doi.org/10.1111/0022-1082.00335.

Ibikunle, G. (2015). Opening and Closing Price Efficiency: Do Financial Markets Need the Call Auction? Journal of International Financial Markets, Institutions and Money, 34, 208-227. https://doi.org/10.1016/j.intfin.2014.11.014.

We will refer to DAILY TAQ CLIENT SPECIFICATION (2017) and the variable descriptions provided by WRDS as our codebooks.

For demo purposes, we will use a subset of stocks identified by the ticker “BR” on 2022-01-03 to illustrate the data cleaning process.

Previously, we acquired ultra-high-frequency trading data from the database TAQ - Millisecond Consolidated Trades in WRDS. Below is the codebook of the variables in these datasets.

DATE: Date of trade
TIME_M: Time of trade
EX: Exchange that issued the trade
SYM_ROOT: Security symbol root
TR_SCOND: Trade Sale Condition
SIZE: Volume of trade
PRICE: Price of trade

Here are the steps we will take to clean the data.

(1) Trades out of sequence and trades with special settlement conditions are purged.

A trade out of sequence is a transaction that printed late, which may include the following types of transactions: Cash (only) Market, Average Price Trade, Next Day (only) Market, and Sold.

The variable TR_SCOND defines trade sale condition. More than one code can be displayed in the field (up to 4 codes).

table(stocks_br$TR_SCOND)

## 
##  4 B  4 I    6    B    F  F I    I    M N  I    O  O I    Q    T   TI   TP 
##   21  108    1    1  290 2988 4525    3    1    1    1    3    6    7    5

We keep a trade if it is not a trade out of sequence and if it is not a trade with special settlement conditions. We go through each condition to filter the trades.

sale_cond <- c("A", "B", "N", "R", "Z")
exclude_trades <- apply(sapply(sale_cond, grepl, stocks_br$TR_SCOND), 1, any)
stocks_br <- stocks_br[!exclude_trades, ]

table(stocks_br$TR_SCOND)

## 
##  4 I    6    F  F I    I    M    O  O I    Q    T   TI   TP 
##  108    1  290 2988 4525    3    1    1    3    6    7    5

These conditions are denoted by the codes below.

A: Cash-Only Basis
B: Average Price Trade
N: Next day - Calls for delivery of securities on the first business day following the day of the contract
R: Seller - Delivery date is specified by the seller and must be between two and sixty calendar days following the day of the contract
Z: Sold Sale – A transaction that is reported to the tape at a time later than it occurred and when other trades occurred between the time of the transaction its report time

(2) Use only NYSE stocks to avoid any possibility of the results being influenced by differences in trading protocols.

EX is the exchange that issued the trade.

table(stocks_br$EX)

## 
##    A    B    C    D    H    J    K    M    N    P    T    U    V    X    Y    Z 
##   37   50   65 2807   75  269  262   10 2133  305 1895   56  638   40   84  765

N stands for “New York Stock Exchange (NYSE)”. New York Stock Exchange (NYSE) is one of the major US Stock Exchanges.

stocks_br <- subset(stocks_br, EX == "N")

table(stocks_br$EX)

## 
##    N 
## 2133

Below is a sample list of code for exchange on which the trade occurred:

A: American Stock Exchange (AMEX)
N: New York Stock Exchange (NYSE)
C: National Stock Exchange (NSX)
T/Q: NASDAQ
M: Chicago Stock Exchange (CHX)

Exchange is a marketplace where securities, commodities, derivatives and other financial instruments are traded. The core function of an exchange is to ensure fair and orderly trading and the efficient dissemination of price information for any securities trading on that exchange.

The great majority of trades are completed through electronic means without regard to a physical location. This process has resulted in a substantial increase in high-frequency trading programs and the use of complex algorithms by traders on exchanges.

(3) To avoid the influence of unduly high-priced stocks, if the price is greater than $999, the stock is deleted from the sample.

stocks_br <- subset(stocks_br, subset = PRICE <= 999)

23.4 Time series

State-space model operates on time series data. Therefore, after completing the data cleaning process, the next step is to represent stock price in the format of time series.

Time series data are observations or measurements that are indexed according to time. The time index has a special ordering, which is a key property of time series data, and which distinguishes it from other types of data.

representing time series

Choice of time series representation in R is critical, because the choice affects not only how the data is stored, but also which functions will be available for processing, analyzing, and plotting our time series data.

To conduct time series analysis in R, specialized packages are often employed, which require an object class specifically designed for time series data. Prior to analysis, it is often necessary to transform the data format to ensure compatibility with these packages.

Two popular time series packages for R users are zoo and xts. xts is an extension of zoo. Both zoo and xts are capable of creating time indexes with millisecond precision. However, the timestamps in our dataset are measured with nanosecond resolution. To handle this level of precision, we will utilize the nanotime package, which is compatible with zoo. Therefore, zoo is the optimal option for our analysis.

`zoo` for handling time series data

zoo provides infrastructure for fundamental tasks in time series analysis, such as data reading, handling, aggregation, and transformation. But it does not provide time series modeling functionality.

zoo is an interface to all other time series packages on CRAN, and therefore it is relatively easy to pass time series information between other time series classes and zoo.

Considering these two features of zoo, for time series modeling functionalities, it is encouraged to use add-on packages with zoo.

We will use the function zoo() to format the time series data. Time series in R consists of ordered observations that are stored internally in a vector or matrix with an index attribute. The index must have the same length as the number of rows in the matrix.

nanotime resolution

The variables that record timestamps of the transactions are DATE and TIME_M.

Both are character vectors.

head(stocks_br$DATE)

## [1] "20220103" "20220103" "20220103" "20220103" "20220103" "20220103"

class(stocks_br$DATE)

## [1] "character"

TIME_M is recorded to the nanosecond precision.

head(stocks_br$TIME_M)

## [1] "9:30:00.865423104" "9:30:00.865423104" "9:30:18.002543104" "9:30:52.405231104"
## [5] "9:31:14.199711488" "9:31:14.199952128"

class(stocks_br$TIME_M)

## [1] "character"

To create a zoo object to represent time series, we would think of formatting the index as a POSIXct object. However, POSIXct is around microsecond resolution.

x <- "2022-01-03 09:00:35.093560347"
y <- as.POSIXct(x, format = "%Y-%m-%d %H:%M:%OS")
format(y, "%Y-%m-%d %H:%M:%OS6")

## [1] "2022-01-03 09:00:35.093560"

To set the index to the nanotime precision, we introduce the nanotime package and the formatting helper nanotime().

library(zoo)
library(nanotime)

formatting time index to the nanotime precision

The format for the nanotime index is “1970-01-01T00:00:00.000000001+00:00”.

Below we set up the index format that nanotime can recognize.

Fix the hour components in TIME_M. The current hours before 12 am are represented with only one digit (e.g., “9:00:14”), and they need to be formatted to display two digits (e.g., “09:00:14”).

stocks_br$TIME <- substr(stocks_br$TIME_M, 1, nchar(stocks_br$TIME_M) - 10) 

head(stocks_br$TIME)

## [1] "9:30:00" "9:30:00" "9:30:18" "9:30:52" "9:31:14" "9:31:14"

We’ll use as.POSIXct() to create a new datetime variable and fix the format.

stocks_br$TRADETIME <- as.POSIXct(paste0(stocks_br$DATE, stocks_br$TIME), 
                               format = "%Y%m%d %H:%M:%S",
                               tz = "EST")

head(stocks_br$TRADETIME)

## [1] "2022-01-03 09:30:00 EST" "2022-01-03 09:30:00 EST" "2022-01-03 09:30:18 EST"
## [4] "2022-01-03 09:30:52 EST" "2022-01-03 09:31:14 EST" "2022-01-03 09:31:14 EST"

Paste the nanoseconds and the “+00:00” components to the datetime variable.

“+00:00” is a time zone offset that represents UTC. Since the time zone does not affect the analyses we are going to perform, we will simply include the UTC offset in the datetime string to ensure that the format matches in general.

stocks_br$DATETIME_str <- 
  paste0(stocks_br$TRADETIME, 
         substr(stocks_br$TIME_M, nchar(stocks_br$TIME_M)-9, nchar(stocks_br$TIME_M)), 
         "+00:00")

head(stocks_br$DATETIME_str)

## [1] "2022-01-03 09:30:00.865423104+00:00" "2022-01-03 09:30:00.865423104+00:00"
## [3] "2022-01-03 09:30:18.002543104+00:00" "2022-01-03 09:30:52.405231104+00:00"
## [5] "2022-01-03 09:31:14.199711488+00:00" "2022-01-03 09:31:14.199952128+00:00"

The nanoseconds are extracted from TIME_M using substr().

Use nanotime() to represent time series to the nanotime precision.

stocks_br$NANOTIME <- nanotime(stocks_br$DATETIME_str)

head(stocks_br$NANOTIME)

## [1] 2022-01-03T09:30:00.865423104+00:00 2022-01-03T09:30:00.865423104+00:00
## [3] 2022-01-03T09:30:18.002543104+00:00 2022-01-03T09:30:52.405231104+00:00
## [5] 2022-01-03T09:31:14.199711488+00:00 2022-01-03T09:31:14.199952128+00:00

creating time intervals

To build the state-space model, we also need to create a variable for time interval at the level of seconds to estimate changes in the components of price.

stocks_br$SEC <- as.POSIXct(floor(as.numeric(stocks_br$NANOTIME) / 1e9), 
                            origin = "1970-01-01", 
                            tz = "UTC")

head(stocks_br)

## # A tibble: 6 × 19
##   DATE     TIME_M    EX    SYM_ROOT SYM_SUFFIX TR_SCOND SIZE  PRICE TR_STOP_IND TR_CORR
##   <chr>    <chr>     <chr> <chr>    <chr>      <chr>    <chr> <chr> <chr>       <chr>  
## 1 20220103 9:30:00.… N     BR       <NA>       O        3416  183.… N           00     
## 2 20220103 9:30:00.… N     BR       <NA>       Q        3416  183.… N           00     
## 3 20220103 9:30:18.… N     BR       <NA>       F I      1     182.7 N           00     
## 4 20220103 9:30:52.… N     BR       <NA>       I        44    182.… N           00     
## 5 20220103 9:31:14.… N     BR       <NA>       F I      13    183   N           00     
## 6 20220103 9:31:14.… N     BR       <NA>       F I      7     183   N           00     
## # ℹ 9 more variables: TR_SEQNUM <chr>, TR_ID <chr>, TR_SOURCE <chr>, TR_RF <chr>,
## #   TIME <chr>, TRADETIME <dttm>, DATETIME_str <chr>, NANOTIME <nanotime>, SEC <dttm>

23.5 Building state-space model

Now we are ready to create the state-space model to decompose stock prices. We will create a local level model following the original study. The authors provided SAS code for this step.

The local level model is a linear regression model that models the unobserved stochastic trend and irregular component. Because the local level model contains an unobserved component, it fits nicely into the state-space framework. Both the unobserved component and the unknown parameters can be estimated using the Kalman filter and maximum likelihood estimation.

We use the package KFAS to build a local level model in the state-space framework.

Below we use a tidyverse approach to compute the changes in price variances for every stock in every time interval in the entire time span of the project.

library(KFAS)

stocks %>%
  # fit state-space model by ticker and time interval
  group_by(SYM_ROOT, SEC) %>%
  group_modify(~ {
    
    # create a zoo object to represent the stock price series
    dfzoo <- zoo(.x, order.by = .x$NANOTIME)
    
    # obtain changes in price variances following a tweet
    Y <- dfzoo[, "LOGPRICE"]
    model_structural <- SSModel(Y ~ SSMtrend(degree = 1, Q = list(NA)), H = NA)
    fit_structural <- fitSSM(model_structural,
                             inits = c(Q = 0.9, H = 0.1),
                             method = "BFGS")
    
    # generate two new variables var_h, var_q
    var_h <- fit_structural$model$H[1, 1, 1]  # Extract observation variance (H)
    var_q <- fit_structural$model$Q[1, 1, 1]  # Extract state variance (Q)
    
    # add these variables to the original data frame
    .x %>% mutate(var_h = var_h, var_q = var_q)
  }) %>%
  ungroup()

The function SSModel() builds the model.

The first argument to the SSModel(Y ~ SSMtrend()) function is the formula which defines the observations (left side of tilde operator ~) and the structure of the state equation (right side of tilde ~).

In SSMtrend(degree = 1, Q = list(NA)), degree defines the degree of the polynomial component, where 1 corresponds to a local level model (see page 14 in KFAS: Exponential Family State Space Models in R).
SSModel() does not perform estimation of unknown parameters, which can be estimated by fitSSM() using maximum likelihood.

fitSSM() estimates the NA values in the time invariant covariance matrices H and Q. The NA values represent the unknown variance parameters $\sigma_\epsilon^{2}$ and $\sigma_\eta^{2}$ (see page 10 in KFAS reference manual on fitSSM()).

Estimates of the variance parameters can be extracted by fit_structural$model$H (temporary component) and fit_structural$model$Q (permanent component).

23.6 Code summary

In summary, here are the steps we’ve taken to prepare data and build the state-space model.

This task can be computationally intensive and would require a high-performance computing service to execute the codes.

library(readr)
library(dplyr)
library(purrr)
library(zoo)
library(nanotime)
library(KFAS)

# a. Read file
stocks <- read_delim("20220103.txt", 
                     delim = "\t",
                     col_types = cols(.default = col_character()))

# b. Filter and clean data
stocks <- stocks %>%
  select(-SYM_SUFFIX) %>% 
  filter(EX == "N") %>% 
  select(-EX) %>% 
  filter(!TR_SCOND %in% c("A", "B", "D", "N", "R", "Z")) %>% 
  filter(PRICE <= 999) 

# c. Transform data 
# Create a nanotime object to represent time series
stocks <- stocks %>%
  mutate(
    LOGPRICE = log(as.numeric(PRICE)),
    TIME = substr(TIME_M, 1, nchar(TIME_M) - 10),
    TRADETIME = as.POSIXct(paste0(DATE, TIME), format = "%Y%m%d %H:%M:%S"),
    DATETIME_str = paste0(TRADETIME, substr(TIME_M, nchar(TIME_M) - 9, nchar(TIME_M)), "+00:00"),
    NANOTIME = nanotime(DATETIME_str),
    SEC = as.POSIXct(floor(as.numeric(NANOTIME) / 1e9), origin = "1970-01-01", tz = "UTC"),
    SIZE = as.numeric(SIZE)
  ) %>%
  select(-DATE, -TIME_M, -TIME, -TRADETIME, -DATETIME_str, -TR_SCOND) %>%
  arrange(SYM_ROOT, SEC, NANOTIME)

# d. Fit State-Space Model (SSM) by SYM_ROOT and SEC
qh <- stocks %>%
  
  # fit state-space model by ticker and time interval
  group_by(SYM_ROOT, SEC) %>%
  group_modify(~ {
    
    # create a zoo object to represent the stock price series
    dfzoo <- zoo(.x, order.by = .x$NANOTIME)
    
    # obtain changes in price variances following a tweet
    Y <- dfzoo[, "LOGPRICE"]
    model_structural <- SSModel(Y ~ SSMtrend(degree = 1, Q = list(NA)), H = NA)
    fit_structural <- fitSSM(model_structural,
                             inits = c(Q = 0.9, H = 0.1),
                             method = "BFGS")
    
    # generate two new variables var_h, var_q
    var_h <- fit_structural$model$H[1, 1, 1]  # Extract observation variance (H)
    var_q <- fit_structural$model$Q[1, 1, 1]  # Extract state variance (Q)
    
    # add these variables to the original data frame
    .x %>% mutate(var_h = var_h, var_q = var_q)
  }) %>%
  ungroup()