23 Decomposing High-frequency Time Series
In this chapter, we will work with ultra-high-frequency trading data that consists of time series measured with nanosecond precision. Our task is to decompose stock price into temporary and permanent components using state-space modeling.
In the next chapter, we will link the temporary and permanent stock price components with tweets to generate the two outcome variables, temporary and permanent stock price impact of firm-generated content on Twitter.
23.1 Temporary and permanent price impacts
Price impact is defined as the impact on the variance of stock price.
A firm’s stock price reflects information relevant to the value of a firm, but is also distorted by noise. Estimating the proportion of stock price driven by information (permanent component) and the proportion driven by noise (temporary component) is a critical aspect of this analysis.
Temporary price impacts are short-term impacts that result in momentary changes in the price of a stock before it returns to its pre-FGC value. Temporary price impacts are often the result of uninformed trader activity.
These noises can include trading frictions due to low levels of liquidity, defined as the ability to trade large quantities of a firm’s stock quickly with little or no price impact, or the activity of traders who lack adequate information regarding the value of a stock, known as “uninformed traders”.
Permanent price impacts often result in the price attaining an enduring new value after an event (e.g., FGC). This occurs when the event provides information that updates informed investor/trader expectations related to a firm’s long-term performance.
A crucial element in such analysis is knowing the event time (i.e., timestamp), which refers to the time at which an event occurs, such as the FGC dissemination time. An event (e.g., firm-generated content on Twitter) can generate a permanent price impact and a temporary price impact.
This study estimates the permanent and temporary price impacts of FGC by first conducting a state-space decomposition of firm stock price into its permanent and temporary components and then linking the changes in these components to individual pieces of FGC (i.e., tweets).
23.2 State-space model
State-space modeling is commonly used for decomposition of price.
It is a tool for describing a phenomenon that has an underlying system with a time-varying dynamical relationship. The state of the system at time \(t\) is related to the state of the system at time \(t-1\). If the state of the system at time \(t-1\) is known, then the state at \(t\) can be inferred.
We cannot observe the true underlying state of the system, but rather we observe a noisy version of it. In its simplest form, we can specify an state equation and an observation equation to summarize this characteristic.
The observation equation describes how the underlying state is transformed (with noise added) into something that we directly measure.
\[y(t) = \alpha(t) + \epsilon(t)\]
The state equation describes how the system evolves from one time point to the next.
\[\alpha(t) = \alpha(t-1) + \eta(t)\]
It is assumed that \(\epsilon(t)\) is normally distributed with mean 0 and covariance \(H(t)\), and that \(\eta(t)\) is normally distributed with mean 0 and covariance \(Q(t)\).
In this study, the structure of the state-space model for price is expressed through the observation and state equations below.
\[v_{s,t,\tau} = m_{s,t,\tau} + i_{s,t,\tau}\]
\[m_{s,t,\tau} = m_{s,t,\tau-1} + u_{s,t,\tau}\]
\[v_{s,t,\tau} = ln(p_{s,t,\tau})\]
It consists of a multiple of \(S\) stock prices, \(T\) intraday periods, and \(N\) intervals. \(s = 1,...,S\), \(\tau = 1,...,T\), and \(t = 1,...,N\). \(t\) is equal to one second (time interval of the analysis). \(N\) equals the number of one-second intervals during a stock trading day. \(p_{s,t,\tau}\) is the price of stock \(s\) at interval \(t\) and period \(\tau\).
\(m_{s,t,\tau}\) is a nonstationary permanent component of the price of stock \(s\) at interval \(t\) and period \(\tau\). \(i_{s,t,\tau}\) is a stationary transitory component of the price of stock \(s\) at interval \(t\) and period \(\tau\). \(u_{s,t,\tau}\) is an idiosyncratic disturbance error in the permanent price component of stock \(s\) at interval \(t\) and period \(\tau\).
We will estimate \(\sigma_{s,t}^{2_u}\) (permanent component) and \(\sigma_{s,t}^{2_i}\) (temporary component) using maximum likelihood constructed with the Kalman filter.
filtering
The main goal of the state-space modelling is to gain knowledge of the unknown latent state \(\alpha\) given the observations \(y\). This is achieved by using two important recursive algorithms, the Kalman filtering and smoothing. The purpose of filtering is to ensure that estimates of the true state of nature are updated with the introduction of every additional observation. An example of filtering could be adjusting the estimate of the unemployment level in the United States based on the latest data release from the Bureau of Labor Statistics, given the history of monthly unemployment data.
For our purpose, state-space modeling with a Kalman filter offers a solution to dealing with the unequal time intervals or irregular frequency inherent in intraday high-frequency transaction data like stock price movements. The Kalman filter facilitates the decomposition of any change in the time series (e.g., variance in the stock prices).
For an illustrative introduction of Kalman filtering, watch the video Kalman Filtering with Applications in Finance.
We will use the package KFAS
for state-space modeling, which operates on time series data. Therefore, the first step of building the state-space model is to represent stock price in the format of time series.
23.3 Time series
Time series data are observations or measurements that are indexed according to time. The time index has a special ordering, which is a key property of time series data, and which distinguishes it from other types of data.
representing time series
Choice of time series representation in R is critical, because the choice affects not only how the data is stored, but also which functions will be available for processing, analyzing, and plotting our time series data.
To conduct time series analysis in R, specialized packages are often employed, which require an object class specifically designed for time series data. Prior to analysis, it is often necessary to transform the data format to ensure compatibility with these packages.
Two popular time series packages for R users are zoo
and xts
. xts
is an extension of zoo
. Both zoo
and xts
are capable of creating time indexes with millisecond precision. However, the timestamps in our dataset are measured with nanosecond resolution. To handle this level of precision, we will utilize the nanotime
package, which is compatible with zoo
. Therefore, zoo
is the optimal option for our analysis.
zoo
for handling time series data
zoo
provides infrastructure for fundamental tasks in time series analysis, such as data reading, handling, aggregation, and transformation. But it does not provide time series modeling functionality.
zoo
is an interface to all other time series packages on CRAN, and therefore it is relatively easy to pass time series information between other time series classes and zoo
.
Considering these two features of zoo
, for time series modeling functionalities, it is encouraged to use add-on packages with zoo
.
We will use the function zoo()
to format the time series data. Time series in R consists of ordered observations that are stored internally in a vector or matrix with an index attribute. The index must have the same length as the number of rows in the matrix.
Below is a sample of our trades data. It includes the data of stock ADBE (Adobe) on 2022-01-03. There are 86624 records in it.
Let’s take a look at the ADBE subset.
## DATE TIME_M EX SYM_ROOT TR_SCOND SIZE PRICE TR_SEQNUM
## 1 2022-01-03 4:00:00.093560347 P ADBE @ TI 15 570.00 1896
## 2 2022-01-03 4:00:00.132580283 K ADBE @FTI 30 568.86 1911
## 3 2022-01-03 4:00:01.951219659 P ADBE @ TI 1 570.00 2008
## 4 2022-01-03 4:00:07.693415312 P ADBE @FTI 4 570.00 2054
## 5 2022-01-03 4:01:23.869859763 P ADBE @ TI 19 570.00 2335
## 6 2022-01-03 4:01:31.439216269 P ADBE @FTI 2 570.00 2358
Previously, we acquired this ultra-high-frequency trading data of ADBE and other stocks from the database TAQ - Millisecond Consolidated Trades in WRDS. Below is the codebook of the variables in these datasets.
- DATE: Date of trade
- TIME_M: Time of trade
- EX: Exchange that issued the trade
- SYM_ROOT: Security symbol root
- TR_SCOND: Trade Sale Condition
- SIZE: Volume of trade
- PRICE: Price of trade
- TR_SEQNUM: Trade Sequence Number
nanotime resolution
The variables that record timestamps of the transactions are DATE
and TIME_M
.
Both are character vectors.
## [1] "2022-01-03" "2022-01-03" "2022-01-03" "2022-01-03" "2022-01-03" "2022-01-03"
## [1] "character"
TIME_M
is recorded to the nanosecond precision.
## [1] "4:00:00.093560347" "4:00:00.132580283" "4:00:01.951219659" "4:00:07.693415312" "4:01:23.869859763"
## [6] "4:01:31.439216269"
## [1] "character"
To create a zoo
object to represent time series, we would think of formatting the index as a POSIXct
object. However, POSIXct
is around microsecond resolution.
x <- "2022-01-03 09:00:35.093560347"
y <- as.POSIXct(x, format = "%Y-%m-%d %H:%M:%OS")
format(y, "%Y-%m-%d %H:%M:%OS6")
## [1] "2022-01-03 09:00:35.093560"
To set the index to the nanotime precision, we introduce the nanotime
package and the formatting helper nanotime()
.
formatting time index to the nanotime precision
The format for the nanotime
index is “1970-01-01T00:00:00.000000001+00:00”. Below we set up the index format that nanotime
can recognize.
- Fix the hour components in
TIME_M
. The current hours before 12 am are represented with only one digit (e.g., “9:00:14”), and they need to be formatted to display two digits (e.g., “09:00:14”).
## [1] "4:00:00" "4:00:00" "4:00:01" "4:00:07" "4:01:23" "4:01:31"
We’ll use as.POSIXct()
to create a new datetime variable and fix the format.
adbe$DATETIME1 <- as.POSIXct(paste0(adbe$DATE, adbe$TIME),
format = "%Y-%m-%d %H:%M:%S",
tz = "EST")
head(adbe$DATETIME1)
## [1] "2022-01-03 04:00:00 EST" "2022-01-03 04:00:00 EST" "2022-01-03 04:00:01 EST" "2022-01-03 04:00:07 EST"
## [5] "2022-01-03 04:01:23 EST" "2022-01-03 04:01:31 EST"
- Paste the nanoseconds and the “+00:00” components to the datetime variable.
“+00:00” is a time zone offset that represents UTC. Since the time zone does not affect the analyses we are going to perform, we will simply include the UTC offset in the datetime string to ensure that the format matches in general.
adbe$DATETIME2 <-
paste0(adbe$DATETIME1,
substr(adbe$TIME_M, nchar(adbe$TIME_M)-9, nchar(adbe$TIME_M)),
"+00:00")
head(adbe$DATETIME2)
## [1] "2022-01-03 04:00:00.093560347+00:00" "2022-01-03 04:00:00.132580283+00:00"
## [3] "2022-01-03 04:00:01.951219659+00:00" "2022-01-03 04:00:07.693415312+00:00"
## [5] "2022-01-03 04:01:23.869859763+00:00" "2022-01-03 04:01:31.439216269+00:00"
The nanoseconds are extracted from TIME_M
using substr()
.
- Use
nanotime()
to represent time series to the nanotime precision.
## [1] 2022-01-03T04:00:00.093560347+00:00 2022-01-03T04:00:00.132580283+00:00 2022-01-03T04:00:01.951219659+00:00
## [4] 2022-01-03T04:00:07.693415312+00:00 2022-01-03T04:01:23.869859763+00:00 2022-01-03T04:01:31.439216269+00:00
23.4 Cleaning high-frequency trading data
To prepare the time series data for state-space modeling, we will follow the original paper and perform data cleaning using the criteria in Chordia, Roll, and Subrahmanyam (2001) and Ibikunle (2015), documented in the Data sections.
Chordia, T., Roll, R., & Subrahmanyam, A. (2001). Market Liquidity and Trading Activity. The Journal of Finance, 56(2), 501-530. https://doi.org/10.1111/0022-1082.00335.
Ibikunle, G. (2015). Opening and Closing Price Efficiency: Do Financial Markets Need the Call Auction? Journal of International Financial Markets, Institutions and Money, 34, 208-227. https://doi.org/10.1016/j.intfin.2014.11.014.
We will refer to DAILY TAQ CLIENT SPECIFICATION (2017) and the variable descriptions provided by WRDS as our codebooks.
Below we will follow the two papers for the parts relevant to our project to clean the data.
(1) Trades out of sequence and trades with special settlement conditions are purged.
A trade out of sequence is a transaction that printed late, which may include the following types of transactions: Cash (only) Market, Average Price Trade, Next Day (only) Market, and Sold.
The variable TR_SCOND
defines trade sale condition. More than one code can be displayed in the field (up to 4 codes).
##
## @ @ I @ M @ Q @ T @ TI @ TP @ TW @ ZI @4 @4 I @4 W @6 X @7 V @F @F I @FT @FTI @O X
## 6191 44133 2 2 31 1032 7 3 263 1 627 126 1 15 2815 30768 4 588 1
## C I C TI N I N T N TW
## 5 2 4 1 2
We keep a trade if it is not a trade out of sequence and if it is not a trade with special settlement conditions. We go through each condition to filter the trades.
sale_cond <- c("A", "B", "N", "R", "Z")
exclude_trades <- apply(sapply(sale_cond, grepl, adbe$TR_SCOND), 1, any)
adbe <- adbe[!exclude_trades, ]
table(adbe$TR_SCOND)
##
## @ @ I @ M @ Q @ T @ TI @ TP @ TW @4 @4 I @4 W @6 X @7 V @F @F I @FT @FTI @O X C I
## 6191 44133 2 2 31 1032 7 3 1 627 126 1 15 2815 30768 4 588 1 5
## C TI
## 2
These conditions are denoted by the codes below.
- A: Cash-Only Basis
- B: Average Price Trade
- N: Next day - Calls for delivery of securities on the first business day following the day of the contract
- R: Seller - Delivery date is specified by the seller and must be between two and sixty calendar days following the day of the contract
- Z: Sold Sale – A transaction that is reported to the tape at a time later than it occurred and when other trades occurred between the time of the transaction its report time
(2) Use only NYSE stocks to avoid any possibility of the results being influenced by differences in trading protocols.
ADBE was traded on the following exchanges on Jan 03.
##
## A B C D H J K M N P Q U V X Y Z
## 212 1132 447 27796 452 1137 6924 326 867 8479 26826 749 3517 821 758 5911
EX
is the exchange that issued the trade. N
stands for “New York Stock Exchange (NYSE)”. New York Stock Exchange (NYSE) is one of the major US Stock Exchanges.
##
## N
## 867
Below is a sample list of code for exchange on which the trade occurred:
- A: American Stock Exchange (AMEX)
- N: New York Stock Exchange (NYSE)
- C: National Stock Exchange (NSX)
- T/Q: NASDAQ
- M: Chicago Stock Exchange (CHX)
Exchange is a marketplace where securities, commodities, derivatives and other financial instruments are traded. The core function of an exchange is to ensure fair and orderly trading and the efficient dissemination of price information for any securities trading on that exchange.
The great majority of trades are completed through electronic means without regard to a physical location. This process has resulted in a substantial increase in high-frequency trading programs and the use of complex algorithms by traders on exchanges.
(3) To avoid the influence of unduly high-priced stocks, if the price is greater than $999, the stock is deleted from the sample.
PRICE
is price of trade.
(4) Remove any duplicates.
The trades file is sorted by symbol, time, and trade sequence number.
SYM_ROOT
is the security symbol root. TR_SEQNUM
is the trade sequence number.
23.5 Building state-space model
Now we are ready to create the state-space model to decompose stock prices. We will create a local level model following the original study. The authors provided SAS code for this step.
The local level model is a linear regression model that models the unobserved stochastic trend and irregular component. Because the local level model contains an unobserved component, it fits nicely into the state-space framework. Both the unobserved component and the unknown parameters can be estimated using the Kalman filter and maximum likelihood estimation.
creating time intervals
To build the model, first, we create a variable for time interval at the level of seconds to estimate changes in the components of price.
## SEC DATETIME SYM_ROOT PRICE
## 1340 2022-01-03 09:30:00 2022-01-03T09:30:00.193260212+00:00 ADBE 567.03
## 1341 2022-01-03 09:30:00 2022-01-03T09:30:00.193262590+00:00 ADBE 567.67
## 1564 2022-01-03 09:30:04 2022-01-03T09:30:04.016984663+00:00 ADBE 567.44
## 1604 2022-01-03 09:30:04 2022-01-03T09:30:04.487710774+00:00 ADBE 567.45
## 1791 2022-01-03 09:30:08 2022-01-03T09:30:08.152652440+00:00 ADBE 568.37
## 2006 2022-01-03 09:30:17 2022-01-03T09:30:17.098855331+00:00 ADBE 569.50
## 2458 2022-01-03 09:30:43 2022-01-03T09:30:43.060779961+00:00 ADBE 571.48
## 2488 2022-01-03 09:30:44 2022-01-03T09:30:44.442738051+00:00 ADBE 570.68
## 2489 2022-01-03 09:30:44 2022-01-03T09:30:44.442840378+00:00 ADBE 570.60
## 2524 2022-01-03 09:30:50 2022-01-03T09:30:50.653853291+00:00 ADBE 571.42
creating groups by datetime, ticker, and time interval
Next, we create a grouping variable using datetime, ticker and time interval. The state-space model will run on each group.
## SYM_ROOT SEC PRICE DATETIME ID
## 1340 ADBE 2022-01-03 09:30:00 567.03 2022-01-03T09:30:00.193260212+00:00 1
## 1341 ADBE 2022-01-03 09:30:00 567.67 2022-01-03T09:30:00.193262590+00:00 1
## 1564 ADBE 2022-01-03 09:30:04 567.44 2022-01-03T09:30:04.016984663+00:00 2
## 1604 ADBE 2022-01-03 09:30:04 567.45 2022-01-03T09:30:04.487710774+00:00 2
## 1791 ADBE 2022-01-03 09:30:08 568.37 2022-01-03T09:30:08.152652440+00:00 3
## 2006 ADBE 2022-01-03 09:30:17 569.50 2022-01-03T09:30:17.098855331+00:00 4
representing stock price series
We create a zoo
object to represent the stock price series.
id$LOGPRICE <- log(id$PRICE)
id$PRICE <- NULL
dfzoo <- zoo(id[,-4], order.by = id$DATETIME)
head(dfzoo)
## SYM_ROOT SEC DATETIME LOGPRICE
## 2022-01-03T09:30:00.193260212+00:00 ADBE 2022-01-03 09:30:00 2022-01-03T09:30:00.193260212+00:00 6.340412
## 2022-01-03T09:30:00.193262590+00:00 ADBE 2022-01-03 09:30:00 2022-01-03T09:30:00.193262590+00:00 6.341540
## 2022-01-03T09:30:04.016984663+00:00 ADBE 2022-01-03 09:30:04 2022-01-03T09:30:04.016984663+00:00 6.341135
## 2022-01-03T09:30:04.487710774+00:00 ADBE 2022-01-03 09:30:04 2022-01-03T09:30:04.487710774+00:00 6.341153
## 2022-01-03T09:30:08.152652440+00:00 ADBE 2022-01-03 09:30:08 2022-01-03T09:30:08.152652440+00:00 6.342773
## 2022-01-03T09:30:17.098855331+00:00 ADBE 2022-01-03 09:30:17 2022-01-03T09:30:17.098855331+00:00 6.344759
obtaining changes in price variances following a tweet
We will use the package KFAS
to build a local level model in the state-space framework.
library(KFAS)
Y <- dfzoo[,"LOGPRICE"]
model_structural <- SSModel(Y ~ SSMtrend(degree = 1, Q = list(matrix(NA))), H = matrix(NA))
fit_structural <- fitSSM(model_structural, inits = c(0, 0), method = "BFGS")
- The function
SSModel()
builds the model.
The first argument to the SSModel(Y ~ SSMtrend())
function is the formula which defines the observations (left side of tilde operator ~
) and the structure of the state equation (right side of tilde ~
).
In
SSMtrend(degree = 1, Q = list(matrix(NA)))
,degree
defines the degree of the polynomial component, where 1 corresponds to a local level model (see page 14 in KFAS: Exponential Family State Space Models in R).SSModel()
does not perform estimation of unknown parameters, which can be estimated byfitSSM()
using maximum likelihood.
fitSSM()
estimates the NA
values in the time invariant covariance matrices H
and Q
. The NA
values represent the unknown variance parameters \(\sigma_\epsilon^{2}\) and \(\sigma_\eta^{2}\) (see page 10 in KFAS
reference manual on fitSSM()
).
- Estimates of the variance parameters can be extracted by
fit_structural$model$H
(temporary component) andfit_structural$model$Q
(permanent component).
## , , 1
##
## [,1]
## [1,] 1.415346e-10
## , , 1
##
## [,1]
## [1,] 4.819639e-07
applying to all stocks and time intervals
So far we’ve been working with ADBE, and we applied the local level model to it’s one day’s data.
To compute the changes in price variances for every stock in every time interval in the entire time span of the project, we can use a loop and create a data frame to store the results for later analysis.
In the sample code below, dfvar
would be a container to store output variances. id
would be a grouping variable by datetime, ticker, and time interval. dfzoo
would be a time series representation for each group in the sample.
dfvar <- data.frame(sec = NA, symbol = NA, var_h = NA, var_q = NA)
for (i in unique(id$ID)) {
print(i)
df <- subset(id, ID == i)
dfzoo <- zoo(df[,-4], order.by = df$DATETIME)
Y <- dfzoo[,"LOGPRICE"]
model_structural <- SSModel(Y ~ SSMtrend(degree = 1, Q = list(matrix(NA))), H = matrix(NA))
fit_structural <- fitSSM(model_structural, inits = c(0, 0), method = "BFGS")
dfvar[i,"sec"] <- dfzoo[1, "SEC"]
dfvar[i,"symbol"] <- dfzoo[1, "SYM_ROOT"]
dfvar[i,"var_h"] <- fit_structural$model$H[,,1]
dfvar[i,"var_q"] <- fit_structural$model$Q[,,1]
}
This task is computationally intensive and would require a high-performance computing service to execute the codes.