23 Decomposing High-frequency Time Series

In this chapter, we will work with ultra-high-frequency trading data that consists of time series measured with nanosecond precision. Our task is to decompose stock price into temporary and permanent components using state-space modeling.

In the next chapter, we will link the temporary and permanent stock price components with tweets to generate the two outcome variables, temporary and permanent stock price impact of firm-generated content on Twitter.

23.1 Temporary and permanent price impacts

Price impact is defined as the impact on the variance of stock price.

A firm’s stock price reflects information relevant to the value of a firm, but is also distorted by noise. Estimating the proportion of stock price driven by information (permanent component) and the proportion driven by noise (temporary component) is a critical aspect of this analysis.

Temporary price impacts are short-term impacts that result in momentary changes in the price of a stock before it returns to its pre-FGC value. Temporary price impacts are often the result of uninformed trader activity.

These noises can include trading frictions due to low levels of liquidity, defined as the ability to trade large quantities of a firm’s stock quickly with little or no price impact, or the activity of traders who lack adequate information regarding the value of a stock, known as “uninformed traders”.

Permanent price impacts often result in the price attaining an enduring new value after an event (e.g., FGC). This occurs when the event provides information that updates informed investor/trader expectations related to a firm’s long-term performance.

A crucial element in such analysis is knowing the event time (i.e., timestamp), which refers to the time at which an event occurs, such as the FGC dissemination time. An event (e.g., firm-generated content on Twitter) can generate a permanent price impact and a temporary price impact.

This study estimates the permanent and temporary price impacts of FGC by first conducting a state-space decomposition of firm stock price into its permanent and temporary components and then linking the changes in these components to individual pieces of FGC (i.e., tweets).

23.2 State-space model

State-space modeling is commonly used for decomposition of price.

It is a tool for describing a phenomenon that has an underlying system with a time-varying dynamical relationship. The state of the system at time \(t\) is related to the state of the system at time \(t-1\). If the state of the system at time \(t-1\) is known, then the state at \(t\) can be inferred.

We cannot observe the true underlying state of the system, but rather we observe a noisy version of it. In its simplest form, we can specify an state equation and an observation equation to summarize this characteristic.

The observation equation describes how the underlying state is transformed (with noise added) into something that we directly measure.

\[y(t) = \alpha(t) + \epsilon(t)\]

The state equation describes how the system evolves from one time point to the next.

\[\alpha(t) = \alpha(t-1) + \eta(t)\]

It is assumed that \(\epsilon(t)\) is normally distributed with mean 0 and covariance \(H(t)\), and that \(\eta(t)\) is normally distributed with mean 0 and covariance \(Q(t)\).

In this study, the structure of the state-space model for price is expressed through the observation and state equations below.

\[v_{s,t,\tau} = m_{s,t,\tau} + i_{s,t,\tau}\]

\[m_{s,t,\tau} = m_{s,t,\tau-1} + u_{s,t,\tau}\]

\[v_{s,t,\tau} = ln(p_{s,t,\tau})\]

It consists of a multiple of \(S\) stock prices, \(T\) intraday periods, and \(N\) intervals. \(s = 1,...,S\), \(\tau = 1,...,T\), and \(t = 1,...,N\). \(t\) is equal to one second (time interval of the analysis). \(N\) equals the number of one-second intervals during a stock trading day. \(p_{s,t,\tau}\) is the price of stock \(s\) at interval \(t\) and period \(\tau\).

\(m_{s,t,\tau}\) is a nonstationary permanent component of the price of stock \(s\) at interval \(t\) and period \(\tau\). \(i_{s,t,\tau}\) is a stationary transitory component of the price of stock \(s\) at interval \(t\) and period \(\tau\). \(u_{s,t,\tau}\) is an idiosyncratic disturbance error in the permanent price component of stock \(s\) at interval \(t\) and period \(\tau\).

We will estimate \(\sigma_{s,t}^{2_u}\) (permanent component) and \(\sigma_{s,t}^{2_i}\) (temporary component) using maximum likelihood constructed with the Kalman filter.

filtering

The main goal of the state-space modelling is to gain knowledge of the unknown latent state \(\alpha\) given the observations \(y\). This is achieved by using two important recursive algorithms, the Kalman filtering and smoothing. The purpose of filtering is to ensure that estimates of the true state of nature are updated with the introduction of every additional observation. An example of filtering could be adjusting the estimate of the unemployment level in the United States based on the latest data release from the Bureau of Labor Statistics, given the history of monthly unemployment data.

For our purpose, state-space modeling with a Kalman filter offers a solution to dealing with the unequal time intervals or irregular frequency inherent in intraday high-frequency transaction data like stock price movements. The Kalman filter facilitates the decomposition of any change in the time series (e.g., variance in the stock prices).

For an illustrative introduction of Kalman filtering, watch the video Kalman Filtering with Applications in Finance.

We will use the package KFAS for state-space modeling, which operates on time series data. Therefore, the first step of building the state-space model is to represent stock price in the format of time series.

23.3 Time series

Time series data are observations or measurements that are indexed according to time. The time index has a special ordering, which is a key property of time series data, and which distinguishes it from other types of data.

representing time series

Choice of time series representation in R is critical, because the choice affects not only how the data is stored, but also which functions will be available for processing, analyzing, and plotting our time series data.

To conduct time series analysis in R, specialized packages are often employed, which require an object class specifically designed for time series data. Prior to analysis, it is often necessary to transform the data format to ensure compatibility with these packages.

Two popular time series packages for R users are zoo and xts. xts is an extension of zoo. Both zoo and xts are capable of creating time indexes with millisecond precision. However, the timestamps in our dataset are measured with nanosecond resolution. To handle this level of precision, we will utilize the nanotime package, which is compatible with zoo. Therefore, zoo is the optimal option for our analysis.

zoo for handling time series data

zoo provides infrastructure for fundamental tasks in time series analysis, such as data reading, handling, aggregation, and transformation. But it does not provide time series modeling functionality.

zoo is an interface to all other time series packages on CRAN, and therefore it is relatively easy to pass time series information between other time series classes and zoo.

Considering these two features of zoo, for time series modeling functionalities, it is encouraged to use add-on packages with zoo.

We will use the function zoo() to format the time series data. Time series in R consists of ordered observations that are stored internally in a vector or matrix with an index attribute. The index must have the same length as the number of rows in the matrix.

Below is a sample of our trades data. It includes the data of stock ADBE (Adobe) on 2022-01-03. There are 86624 records in it.

Let’s take a look at the ADBE subset.

head(adbe)
##         DATE            TIME_M EX SYM_ROOT TR_SCOND SIZE  PRICE TR_SEQNUM
## 1 2022-01-03 4:00:00.093560347  P     ADBE     @ TI   15 570.00      1896
## 2 2022-01-03 4:00:00.132580283  K     ADBE     @FTI   30 568.86      1911
## 3 2022-01-03 4:00:01.951219659  P     ADBE     @ TI    1 570.00      2008
## 4 2022-01-03 4:00:07.693415312  P     ADBE     @FTI    4 570.00      2054
## 5 2022-01-03 4:01:23.869859763  P     ADBE     @ TI   19 570.00      2335
## 6 2022-01-03 4:01:31.439216269  P     ADBE     @FTI    2 570.00      2358

Previously, we acquired this ultra-high-frequency trading data of ADBE and other stocks from the database TAQ - Millisecond Consolidated Trades in WRDS. Below is the codebook of the variables in these datasets.

  • DATE: Date of trade
  • TIME_M: Time of trade
  • EX: Exchange that issued the trade
  • SYM_ROOT: Security symbol root
  • TR_SCOND: Trade Sale Condition
  • SIZE: Volume of trade
  • PRICE: Price of trade
  • TR_SEQNUM: Trade Sequence Number

nanotime resolution

The variables that record timestamps of the transactions are DATE and TIME_M.

Both are character vectors.

head(adbe$DATE)
## [1] "2022-01-03" "2022-01-03" "2022-01-03" "2022-01-03" "2022-01-03" "2022-01-03"
class(adbe$DATE)
## [1] "character"

TIME_M is recorded to the nanosecond precision.

head(adbe$TIME_M)
## [1] "4:00:00.093560347" "4:00:00.132580283" "4:00:01.951219659" "4:00:07.693415312" "4:01:23.869859763"
## [6] "4:01:31.439216269"
class(adbe$TIME_M)
## [1] "character"

To create a zoo object to represent time series, we would think of formatting the index as a POSIXct object. However, POSIXct is around microsecond resolution.

x <- "2022-01-03 09:00:35.093560347"
y <- as.POSIXct(x, format = "%Y-%m-%d %H:%M:%OS")
format(y, "%Y-%m-%d %H:%M:%OS6")
## [1] "2022-01-03 09:00:35.093560"

To set the index to the nanotime precision, we introduce the nanotime package and the formatting helper nanotime().

library(zoo)
library(nanotime)

formatting time index to the nanotime precision

The format for the nanotime index is “1970-01-01T00:00:00.000000001+00:00”. Below we set up the index format that nanotime can recognize.

  1. Fix the hour components in TIME_M. The current hours before 12 am are represented with only one digit (e.g., “9:00:14”), and they need to be formatted to display two digits (e.g., “09:00:14”).
adbe$TIME <- substr(adbe$TIME_M, 1, nchar(adbe$TIME_M) - 10) 

head(adbe$TIME)
## [1] "4:00:00" "4:00:00" "4:00:01" "4:00:07" "4:01:23" "4:01:31"

We’ll use as.POSIXct() to create a new datetime variable and fix the format.

adbe$DATETIME1 <- as.POSIXct(paste0(adbe$DATE, adbe$TIME), 
                               format = "%Y-%m-%d %H:%M:%S",
                               tz = "EST")

head(adbe$DATETIME1)
## [1] "2022-01-03 04:00:00 EST" "2022-01-03 04:00:00 EST" "2022-01-03 04:00:01 EST" "2022-01-03 04:00:07 EST"
## [5] "2022-01-03 04:01:23 EST" "2022-01-03 04:01:31 EST"
  1. Paste the nanoseconds and the “+00:00” components to the datetime variable.

“+00:00” is a time zone offset that represents UTC. Since the time zone does not affect the analyses we are going to perform, we will simply include the UTC offset in the datetime string to ensure that the format matches in general.

adbe$DATETIME2 <- 
  paste0(adbe$DATETIME1, 
         substr(adbe$TIME_M, nchar(adbe$TIME_M)-9, nchar(adbe$TIME_M)), 
         "+00:00")

head(adbe$DATETIME2)
## [1] "2022-01-03 04:00:00.093560347+00:00" "2022-01-03 04:00:00.132580283+00:00"
## [3] "2022-01-03 04:00:01.951219659+00:00" "2022-01-03 04:00:07.693415312+00:00"
## [5] "2022-01-03 04:01:23.869859763+00:00" "2022-01-03 04:01:31.439216269+00:00"

The nanoseconds are extracted from TIME_M using substr().

  1. Use nanotime() to represent time series to the nanotime precision.
adbe$DATETIME <- nanotime(adbe$DATETIME2)

head(adbe$DATETIME)
## [1] 2022-01-03T04:00:00.093560347+00:00 2022-01-03T04:00:00.132580283+00:00 2022-01-03T04:00:01.951219659+00:00
## [4] 2022-01-03T04:00:07.693415312+00:00 2022-01-03T04:01:23.869859763+00:00 2022-01-03T04:01:31.439216269+00:00

23.4 Cleaning high-frequency trading data

To prepare the time series data for state-space modeling, we will follow the original paper and perform data cleaning using the criteria in Chordia, Roll, and Subrahmanyam (2001) and Ibikunle (2015), documented in the Data sections.

Chordia, T., Roll, R., & Subrahmanyam, A. (2001). Market Liquidity and Trading Activity. The Journal of Finance, 56(2), 501-530. https://doi.org/10.1111/0022-1082.00335.

Ibikunle, G. (2015). Opening and Closing Price Efficiency: Do Financial Markets Need the Call Auction? Journal of International Financial Markets, Institutions and Money, 34, 208-227. https://doi.org/10.1016/j.intfin.2014.11.014.

We will refer to DAILY TAQ CLIENT SPECIFICATION (2017) and the variable descriptions provided by WRDS as our codebooks.

Below we will follow the two papers for the parts relevant to our project to clean the data.

(1) Trades out of sequence and trades with special settlement conditions are purged.

A trade out of sequence is a transaction that printed late, which may include the following types of transactions: Cash (only) Market, Average Price Trade, Next Day (only) Market, and Sold.

The variable TR_SCOND defines trade sale condition. More than one code can be displayed in the field (up to 4 codes).

table(adbe$TR_SCOND)
## 
##     @  @  I  @  M  @  Q   @ T  @ TI  @ TP  @ TW  @ ZI    @4  @4 I  @4 W  @6 X  @7 V    @F  @F I   @FT  @FTI  @O X 
##  6191 44133     2     2    31  1032     7     3   263     1   627   126     1    15  2815 30768     4   588     1 
##  C  I  C TI  N  I   N T  N TW 
##     5     2     4     1     2

We keep a trade if it is not a trade out of sequence and if it is not a trade with special settlement conditions. We go through each condition to filter the trades.

sale_cond <- c("A", "B", "N", "R", "Z")
exclude_trades <- apply(sapply(sale_cond, grepl, adbe$TR_SCOND), 1, any)
adbe <- adbe[!exclude_trades, ]

table(adbe$TR_SCOND)
## 
##     @  @  I  @  M  @  Q   @ T  @ TI  @ TP  @ TW    @4  @4 I  @4 W  @6 X  @7 V    @F  @F I   @FT  @FTI  @O X  C  I 
##  6191 44133     2     2    31  1032     7     3     1   627   126     1    15  2815 30768     4   588     1     5 
##  C TI 
##     2

These conditions are denoted by the codes below.

  • A: Cash-Only Basis
  • B: Average Price Trade
  • N: Next day - Calls for delivery of securities on the first business day following the day of the contract
  • R: Seller - Delivery date is specified by the seller and must be between two and sixty calendar days following the day of the contract
  • Z: Sold Sale – A transaction that is reported to the tape at a time later than it occurred and when other trades occurred between the time of the transaction its report time

(2) Use only NYSE stocks to avoid any possibility of the results being influenced by differences in trading protocols.

ADBE was traded on the following exchanges on Jan 03.

table(adbe$EX)
## 
##     A     B     C     D     H     J     K     M     N     P     Q     U     V     X     Y     Z 
##   212  1132   447 27796   452  1137  6924   326   867  8479 26826   749  3517   821   758  5911

EX is the exchange that issued the trade. N stands for “New York Stock Exchange (NYSE)”. New York Stock Exchange (NYSE) is one of the major US Stock Exchanges.

adbe <- subset(adbe, EX == "N")

table(adbe$EX)
## 
##   N 
## 867

Below is a sample list of code for exchange on which the trade occurred:

  • A: American Stock Exchange (AMEX)
  • N: New York Stock Exchange (NYSE)
  • C: National Stock Exchange (NSX)
  • T/Q: NASDAQ
  • M: Chicago Stock Exchange (CHX)

Exchange is a marketplace where securities, commodities, derivatives and other financial instruments are traded. The core function of an exchange is to ensure fair and orderly trading and the efficient dissemination of price information for any securities trading on that exchange.

The great majority of trades are completed through electronic means without regard to a physical location. This process has resulted in a substantial increase in high-frequency trading programs and the use of complex algorithms by traders on exchanges.

(3) To avoid the influence of unduly high-priced stocks, if the price is greater than $999, the stock is deleted from the sample.

adbe <- subset(adbe, subset = PRICE <= 999)

PRICE is price of trade.

(4) Remove any duplicates.

The trades file is sorted by symbol, time, and trade sequence number.

adbe <- adbe[!duplicated(adbe[, c("DATETIME","SYM_ROOT","TR_SEQNUM")]), ]

SYM_ROOT is the security symbol root. TR_SEQNUM is the trade sequence number.

23.5 Building state-space model

Now we are ready to create the state-space model to decompose stock prices. We will create a local level model following the original study. The authors provided SAS code for this step.

The local level model is a linear regression model that models the unobserved stochastic trend and irregular component. Because the local level model contains an unobserved component, it fits nicely into the state-space framework. Both the unobserved component and the unknown parameters can be estimated using the Kalman filter and maximum likelihood estimation.

creating time intervals

To build the model, first, we create a variable for time interval at the level of seconds to estimate changes in the components of price.

adbe$SEC <- substr(adbe$DATETIME2, 1, 19)

adbe[1:10, c("SEC", "DATETIME", "SYM_ROOT", "PRICE")]
##                      SEC                            DATETIME SYM_ROOT  PRICE
## 1340 2022-01-03 09:30:00 2022-01-03T09:30:00.193260212+00:00     ADBE 567.03
## 1341 2022-01-03 09:30:00 2022-01-03T09:30:00.193262590+00:00     ADBE 567.67
## 1564 2022-01-03 09:30:04 2022-01-03T09:30:04.016984663+00:00     ADBE 567.44
## 1604 2022-01-03 09:30:04 2022-01-03T09:30:04.487710774+00:00     ADBE 567.45
## 1791 2022-01-03 09:30:08 2022-01-03T09:30:08.152652440+00:00     ADBE 568.37
## 2006 2022-01-03 09:30:17 2022-01-03T09:30:17.098855331+00:00     ADBE 569.50
## 2458 2022-01-03 09:30:43 2022-01-03T09:30:43.060779961+00:00     ADBE 571.48
## 2488 2022-01-03 09:30:44 2022-01-03T09:30:44.442738051+00:00     ADBE 570.68
## 2489 2022-01-03 09:30:44 2022-01-03T09:30:44.442840378+00:00     ADBE 570.60
## 2524 2022-01-03 09:30:50 2022-01-03T09:30:50.653853291+00:00     ADBE 571.42

creating groups by datetime, ticker, and time interval

Next, we create a grouping variable using datetime, ticker and time interval. The state-space model will run on each group.

id <- transform(adbe, ID = as.numeric(interaction(SYM_ROOT, SEC, drop = TRUE)))

head(id)
##      SYM_ROOT                 SEC  PRICE                            DATETIME ID
## 1340     ADBE 2022-01-03 09:30:00 567.03 2022-01-03T09:30:00.193260212+00:00  1
## 1341     ADBE 2022-01-03 09:30:00 567.67 2022-01-03T09:30:00.193262590+00:00  1
## 1564     ADBE 2022-01-03 09:30:04 567.44 2022-01-03T09:30:04.016984663+00:00  2
## 1604     ADBE 2022-01-03 09:30:04 567.45 2022-01-03T09:30:04.487710774+00:00  2
## 1791     ADBE 2022-01-03 09:30:08 568.37 2022-01-03T09:30:08.152652440+00:00  3
## 2006     ADBE 2022-01-03 09:30:17 569.50 2022-01-03T09:30:17.098855331+00:00  4

representing stock price series

We create a zoo object to represent the stock price series.

id$LOGPRICE <- log(id$PRICE)
id$PRICE <- NULL

dfzoo <- zoo(id[,-4], order.by = id$DATETIME)

head(dfzoo)
##                                     SYM_ROOT SEC                 DATETIME                            LOGPRICE
## 2022-01-03T09:30:00.193260212+00:00 ADBE     2022-01-03 09:30:00 2022-01-03T09:30:00.193260212+00:00 6.340412
## 2022-01-03T09:30:00.193262590+00:00 ADBE     2022-01-03 09:30:00 2022-01-03T09:30:00.193262590+00:00 6.341540
## 2022-01-03T09:30:04.016984663+00:00 ADBE     2022-01-03 09:30:04 2022-01-03T09:30:04.016984663+00:00 6.341135
## 2022-01-03T09:30:04.487710774+00:00 ADBE     2022-01-03 09:30:04 2022-01-03T09:30:04.487710774+00:00 6.341153
## 2022-01-03T09:30:08.152652440+00:00 ADBE     2022-01-03 09:30:08 2022-01-03T09:30:08.152652440+00:00 6.342773
## 2022-01-03T09:30:17.098855331+00:00 ADBE     2022-01-03 09:30:17 2022-01-03T09:30:17.098855331+00:00 6.344759

obtaining changes in price variances following a tweet

We will use the package KFAS to build a local level model in the state-space framework.

library(KFAS)

Y <- dfzoo[,"LOGPRICE"]
model_structural <- SSModel(Y ~ SSMtrend(degree = 1, Q = list(matrix(NA))), H = matrix(NA))
fit_structural <- fitSSM(model_structural, inits = c(0, 0), method = "BFGS")
  1. The function SSModel() builds the model.

The first argument to the SSModel(Y ~ SSMtrend()) function is the formula which defines the observations (left side of tilde operator ~) and the structure of the state equation (right side of tilde ~).

  1. In SSMtrend(degree = 1, Q = list(matrix(NA))), degree defines the degree of the polynomial component, where 1 corresponds to a local level model (see page 14 in KFAS: Exponential Family State Space Models in R).

  2. SSModel() does not perform estimation of unknown parameters, which can be estimated by fitSSM() using maximum likelihood.

fitSSM() estimates the NA values in the time invariant covariance matrices H and Q. The NA values represent the unknown variance parameters \(\sigma_\epsilon^{2}\) and \(\sigma_\eta^{2}\) (see page 10 in KFAS reference manual on fitSSM()).

  1. Estimates of the variance parameters can be extracted by fit_structural$model$H (temporary component) and fit_structural$model$Q (permanent component).
fit_structural$model$H
## , , 1
## 
##              [,1]
## [1,] 1.415346e-10
fit_structural$model$Q
## , , 1
## 
##              [,1]
## [1,] 4.819639e-07

applying to all stocks and time intervals

So far we’ve been working with ADBE, and we applied the local level model to it’s one day’s data.

To compute the changes in price variances for every stock in every time interval in the entire time span of the project, we can use a loop and create a data frame to store the results for later analysis.

In the sample code below, dfvar would be a container to store output variances. id would be a grouping variable by datetime, ticker, and time interval. dfzoo would be a time series representation for each group in the sample.

dfvar <- data.frame(sec = NA, symbol = NA, var_h = NA, var_q = NA)

for (i in unique(id$ID)) {
  print(i)
  df <- subset(id, ID == i)
  
  dfzoo <- zoo(df[,-4], order.by = df$DATETIME)
  Y <- dfzoo[,"LOGPRICE"]
  model_structural <- SSModel(Y ~ SSMtrend(degree = 1, Q = list(matrix(NA))), H = matrix(NA))
  fit_structural <- fitSSM(model_structural, inits = c(0, 0), method = "BFGS")
  
  dfvar[i,"sec"] <- dfzoo[1, "SEC"]
  dfvar[i,"symbol"] <- dfzoo[1, "SYM_ROOT"]
  dfvar[i,"var_h"] <- fit_structural$model$H[,,1]
  dfvar[i,"var_q"] <- fit_structural$model$Q[,,1]
}

This task is computationally intensive and would require a high-performance computing service to execute the codes.