24 Investigating the Temporary and Permanent Price Impacts

In this chapter, we will generate our two outcome variables, link tweets to trades data, and estimate the temporary and permanent stock price impacts of firm-generated content.

24.1 Computing outcome variables

First, we create the outcome variables that determine how a tweet changes the composition of price with regard to \(\sigma_{s,t}^{2_u}\) (i.e., permanent component) and \(\sigma_{s,t}^{2_i}\) (i.e., temporary component).

We will compute 30-second percentage absolute changes for both \(\sigma_{s,t}^{2_u}\) and \(\sigma_{s,t}^{2_i}\) following the definitions given in the paper.

Permanent price impact \(\Delta\sigma_{s,t}^{2_u}\)

\[ \Delta\sigma_{s,t}^{2_u} = \left\lvert{\frac{\sigma_{s,t+30s}^{2_u}-\sigma_{s,t-1}^{2_u}}{\sigma_{s,t-1}^{2_u}}}\right\rvert \]

Temporary price impact \(\Delta\sigma_{s,t}^{2_i}\)

\[ \Delta\sigma_{s,t}^{2_i} = \left\lvert{\frac{\sigma_{s,t+30s}^{2_i}-\sigma_{s,t-1}^{2_i}}{\sigma_{s,t-1}^{2_i}}}\right\rvert \]

We first load the sample price variances dataset var_daily. sec is second-level time interval in string format, and sec_dt is time interval in datetime format.

load("var_daily.Rdata")

head(var_daily)
##                   sec symbol        var_h        var_q              sec_dt
## 1 2022-01-03 09:30:00      A 1.698064e-13 1.490271e-08 2022-01-03 09:30:00
## 2 2022-01-03 09:30:00   ADBE 6.778815e-11 1.272249e-06 2022-01-03 09:30:00
## 3 2022-01-03 09:30:00    ADP 1.000000e+00 1.000000e+00 2022-01-03 09:30:00
## 4 2022-01-03 09:30:00   ANET 1.484803e-10 1.750938e-08 2022-01-03 09:30:00
## 5 2022-01-03 09:30:00     BR 1.698064e-13 1.490271e-08 2022-01-03 09:30:00
## 6 2022-01-03 09:30:00   CDAY 1.698064e-13 1.490271e-08 2022-01-03 09:30:00

Below we compute 30-second percentage absolute changes for permanent and temporary price components for each ticker.

library(dplyr)

var_daily <- var_daily %>% 
  arrange(symbol, sec) %>% 
  group_by(symbol) %>%
  mutate(var_q_lag = lag(var_q),
         var_h_lag = lag(var_h),
         new_time  = sec_dt + 30,
         var_q_30 = if_else(new_time %in% sec_dt, var_q[match(new_time, sec_dt)], NA),
         var_h_30 = if_else(new_time %in% sec_dt, var_h[match(new_time, sec_dt)], NA),
         var_q_delta = abs((var_q_30 - var_q_lag) / var_q_lag),
         var_h_delta = abs((var_h_30 - var_h_lag) / var_h_lag)) %>%
  select(symbol, sec_dt, var_h_delta, var_q_delta)

\(\sigma_{s,t-1}^{2_u}\) is computed by lag(var_q); \(\sigma_{s,t-1}^{2_i}\) is computed by lag(var_h).

\(\sigma_{s,t+30s}^{2_u}\) is computed by if_else(new_time %in% sec_dt, var_q[match(new_time, sec_dt)], NA); \(\sigma_{s,t+30s}^{2_i}\) is computed by if_else(new_time %in% sec_dt, var_h[match(new_time, sec_dt)], NA).

Essentially, if_else(new_time %in% sec_dt, var_h[match(new_time, sec_dt)], NA) tries to find pairs of cases in var_h that are 30 seconds away from each other. This can be seen more clearly in the sample code below.

First we have temporary price components in var_h. Then we search pairs of cases in var_h based on their timestamps. sec_dt is the column of time intervals. new_time is a column of timestamps with values shifted 30 seconds forward from sec_dt.

Next, if a case in new_time is found in sec_dt, match(new_time, sec_dt) returns the positions of first matches of new_time in sec_dt. var_h[match()] grabs the corresponding values in var_h and store them in var_h_30.

var_h <- c(5, 8, 6, 9, 10, 7, 12, 15)

sec_dt <- c(5, 10, 35, 40, 60, 70, 100, 110)
new_time <- sec_dt + 30 
var_h_30 <- if_else(new_time %in% sec_dt, var_h[match(new_time, sec_dt)], NA)

data.frame(var_h, sec_dt, new_time, var_h_30)
##   var_h sec_dt new_time var_h_30
## 1     5      5       35        6
## 2     8     10       40        9
## 3     6     35       65       NA
## 4     9     40       70        7
## 5    10     60       90       NA
## 6     7     70      100       12
## 7    12    100      130       NA
## 8    15    110      140       NA

24.2 Linking tweets with trades with inequality and rolling joins

Then, we link each individual intraday tweet to a corresponding transaction.

Each tweet-linked trade is a “tweet-trade”: the first trade to occur immediately after a tweet within 60 seconds.

Let’s bring back our tweets dataset tweets.

load("tweets.Rdata")

Before we take the steps below to link tweets with trades, we first subset tweets within opening hours of NYSE between 9:30 am and 16:00 pm.

# weekdays
tweets$date <- substr(tweets$timestamp, 1, 10)
tweets$day <- weekdays(as.Date(tweets$date))
tweets <- subset(tweets, subset = !day %in% c("Saturday", "Sunday")) 

# NYSE core trading session: 9:30-16:00
tweets$hour <- substr(tweets$timestamp, 12, 19)
tweets <- subset(tweets, subset = hour >= "09:29:00" & hour <= "16:05:00") 

The last daily trade is assumed to occur no later than 16:05 pm. Transactions are commonly reported up to five minutes after the official close at 16:00 pm.

matching tweets with trades using inequality and rolling joins

We merge tweets with the price variances dataset var_daily in this step.

The id of the dataset tweets is Twitter handle.

names(tweets)
## [1] "timestamp"      "handle"         "text"           "retweet_count"  "favorite_count" "pos"           
## [7] "neg"            "compet"         "cust"

The id of the price variances dataset var_daily is symbol (ticker).

head(var_daily)
## # A tibble: 6 × 4
## # Groups:   symbol [1]
##   symbol sec_dt              var_h_delta var_q_delta
##   <chr>  <dttm>                    <dbl>       <dbl>
## 1 A      2022-01-03 09:30:00          NA          NA
## 2 A      2022-01-03 09:30:02          NA          NA
## 3 A      2022-01-03 09:30:03          NA          NA
## 4 A      2022-01-03 09:30:06          NA          NA
## 5 A      2022-01-03 09:31:05          NA          NA
## 6 A      2022-01-03 09:31:06          NA          NA

We will bridge the two datasets tweets and var_daily using the table ticker_handle.

ticker_handle <- read.csv("ticker_handle.csv")

head(ticker_handle)
##   symbol    handle
## 1      A   Agilent
## 2    ACN Accenture
## 3   ADBE     Adobe
## 4    ADI  ADI_News
## 5    ADP       ADP
## 6   ADSK  autodesk
  1. We merge tweets with ticker_handle to add the column symbol to the dataset tweets.
tweets <- tweets %>% left_join(ticker_handle, by = "handle")

head(tweets)
##             timestamp    handle
## 1 2022-01-31 13:00:08 Accenture
## 2 2022-01-31 12:00:03 Accenture
## 3 2022-01-28 15:00:07 Accenture
## 4 2022-01-28 14:00:00 Accenture
## 5 2022-01-28 11:00:13 Accenture
## 6 2022-01-27 15:45:33 Accenture
##                                                                                                                                                                                                                                                     text
## 1                                           Did you know? COVID-19 has eroded years of progress towards workplace gender equality. 😱 🤔 \n\nTune in to Change Conversations, where our hosts chat with leaders from and about ending the 'Shecession': 
## 2                                               In Norway, one innovative waste management company is breaking down plastic using enzymes, sorting papers with sensors—and revolutionizing recycling. The full story is in our podcast #BuiltForChange: 
## 3                                            Why are nearly 5,300,000 women out of work? 😱 🤔 \n\nTune in to part 1 of a special 2-part episode of Change Conversations, where our hosts chat with leaders from and to discuss Women, Work, and COVID: 
## 4                                                                                                           Not Wordle, just Accenture\n\n⬜️⬜️🟪⬜️⬜️⬜️⬜️\n⬜️⬜️⬜️🟪⬜️⬜️⬜️\n⬜️⬜️⬜️⬜️🟪⬜️⬜️\n⬜️⬜️⬜️⬜️⬜️🟪⬜️\n⬜️⬜️⬜️⬜️🟪⬜️⬜️\n⬜️⬜️⬜️🟪⬜️⬜️⬜️\n⬜️⬜️🟪⬜️⬜️⬜️⬜️
## 5                                                                                                                                             Our new report reveals 4 key tenets Continuum Competitors share to unleash competitiveness on the #cloud: 
## 6 We’re honored to receive a perfect score on Corporate Equality Index and be named a 2022 Best Place to Work for LGBTQ+ Equality. Thank you to our people helping us foster an inclusive culture where everyone can thrive. #PrideAtAccenture #CEI2022 
##   retweet_count favorite_count pos neg compet  cust symbol
## 1             1             19   0   1  FALSE FALSE    ACN
## 2             5             23   0   0  FALSE FALSE    ACN
## 3             5             17   0   1  FALSE FALSE    ACN
## 4            30            215   0   0  FALSE FALSE    ACN
## 5             0             10   1   0   TRUE FALSE    ACN
## 6             8             34   1   0  FALSE FALSE    ACN
  1. We merge the updated tweets and var_daily based on the criteria of a tweet-trade.

Recall that the criteria of a tweet-trade is the first trade that occurred immediately after a tweet and within 60 seconds of the tweet.

This is not an equality join where the rows match if the key from the left-hand table is equal to the key in the right-hand table. Instead, we will use inequality and rolling joins available with dplyr::join_by().

Inequality joins match on an inequality, such as >, >=, <, or <=, and are common in time series analysis. Inequality joins will match a single row in x to a potentially large number of rows in y.

Rolling joins are a variant of inequality joins that limit the results returned from an inequality join condition. They are useful for “rolling” the closest match forward or backwards when there isn’t an exact match.

Go back to our case, we use join_by() to specify an inequality join upper bound >= lower bound and a rolling join closest(starting point <= lower bound) to meet the merge criteria. closest() constructs a rolling join.

by <- join_by(symbol, closest(starting point <= lower bound), upper bound >= lower bound)

First, we set the boundary of the returned result within 60 seconds. The time window starts at the timestamp of a tweet tweets$timestamp_dt. tweets is the primary table. tweets$timestamp_dt are the rows that we use to match cases in the right-hand table var_daily.

# starting point
tweets$timestamp_dt <- as.POSIXct(tweets$timestamp, format = "%Y-%m-%d %H:%M:%S", tz = "EST") 

The upper bound of this 60 seconds range is the timestamp of a tweet plus 60 seconds.

# upper bound
tweets$upper <- tweets$timestamp_dt + 60

We set the lower bound of this 60 seconds range to be the time interval in the price variance dataset var_daily, denoted by sec_dt. In the inequality join upper bound >= lower bound, there can be several matches.

To limit the results returned from the inequality join upper bound >= lower bound, we “roll” the closest match forward with closest(starting point <= lower bound) to find the first trade occurring immediately after a tweet.

closest() uses the left-hand table x as the primary table, and the right-hand table y as the one to find the closest match in, regardless of how the inequality is specified. With closest(), if we need to perform a join on a computed variable, we need to precompute and store it in a separate column. That’s why we did tweets$upper <- tweets$start + 60.

After we use dplyr::join_by() to specify the conditions in join operations, we then use left_join() to link the two datasets, and create a subset to keep the successfully linked tweet-trades with na.omit().

by <- join_by(symbol, closest(timestamp_dt <= sec_dt), upper >= sec_dt)

tweet_trades <- left_join(tweets, var_daily, by) %>%
  na.omit %>%
  group_by(symbol) %>%
  arrange(symbol, timestamp_dt) %>%
  select(-handle, -upper, -sec_dt, -timestamp)

We’ve just linked a tweet to a corresponding pair of \(\sigma_{s,t}^{2_u}\) (permanent component) and \(\sigma_{s,t}^{2_i}\) (temporary component).

24.3 Addressing omitted variable bias

At this moment, we are almost there to build a model to estimate changes in price variances following a tweet. However, based on past research, there are several known determinants of price impact, which we should take into account to avoid omitted variable bias. These determinants of price impact will be the control variables in the final model.

The original study considered seven aspects that could affect price impact: the number of an account’s followers, the natural logarithm of trading volume, the natural logarithm of average trade size, volatility, effective spread, the natural logarithm of a high-frequency trading proxy, and order imbalance.

Here we cover the natural logarithm of trading volume and the natural logarithm of average trade size. The other determinants volatility, effective spread, the natural logarithm of a high-frequency trading proxy, and order imbalance can be generated with additional quotes data from the TAQ database from WRDS and by using similar data manipulation methods to be introduced below.

In terms of the number of followers at each timestamp, it is not made available by the standard Twitter timeline endpoint. Alternatively, we use retweet counts and favorite counts as proxies.

aggregating volume of trade by time interval

We will use the sample dataset trading_volume to generate the natural logarithm of trading volume and the natural logarithm of average trade size.

load("trading_volume.Rdata")

We created the dataset trading_volume by aggregating the volume of trade on the level of seconds using the raw transactions data from WRDS, which was the same one we used to compute price variances. The variable SIZE was used to compute the volume of trade.

head(trading_volume)
##   symbol                 sec volume_sec              sec_dt
## 1      A 2022-01-03 09:30:00      34930 2022-01-03 09:30:00
## 2      A 2022-01-03 09:30:02        130 2022-01-03 09:30:02
## 3      A 2022-01-03 09:30:03        428 2022-01-03 09:30:03
## 4      A 2022-01-03 09:30:06        200 2022-01-03 09:30:06
## 5      A 2022-01-03 09:31:05        386 2022-01-03 09:31:05
## 6      A 2022-01-03 09:31:06         68 2022-01-03 09:31:06

computing control variables

Natural logarithm of trading volume is denoted by \(lnvolume_{s,t}\). Trading volume is measured as the dollar volume of transactions executed in stock \(s\) prior to a corresponding tweet-trade \(t\).

Natural logarithm of average trade size is denoted by \(lntradesize_{s,t}\). Average trade size is computed as the trading volume prior to tweet-trade \(t\) divided by the number of transactions just prior to a corresponding tweet-trade \(t\) in stock \(s\).

by <- join_by(symbol, closest(timestamp_dt > sec_dt))

tweet_trades <- left_join(tweet_trades, trading_volume, by) %>%
  group_by(symbol) %>%
  mutate(n_case = 1,
         volume_tt = log(cumsum(volume_sec)),
         volume_tt_avg = log(cumsum(volume_sec) / cumsum(n_case)))

24.4 Estimating panel least squares model

At last, to investigate whether tweet valence and subject matter drive the price impact of tweet-trades, we estimate

\[ PriceImpact_{s,t} = \alpha_{s} + \beta_{t} + \nonumber \\ \gamma_{1} consumer_{s,t} + \gamma_{2} competitor_{s,t} + \nonumber \\ \gamma_{3} consumer*-ve_{s,t} + \gamma_{4} competitor*-ve_{s,t} + \nonumber \\ \gamma_{5} consumer*+ve_{s,t} + \gamma_{6} competitor*+ve_{s,t} + \nonumber \\ \gamma_{7}-ve_{s,t} + \gamma_{8}+ve_{s,t} + \nonumber \\ \sum_{k = 1}^{4}\varphi_{k}C_{k,s,t} + \epsilon_{s,t} \]

\(\alpha_{s}\) and \(\beta_{t}\) are stock and time fixed effects. \(+ve_{s,t}\) refers to positive-valence tweets; \(-ve_{s,t}\) refers to negative-valence tweets. \(*\) indicates interaction effects. For instance, \(consumer*-ve_{s,t}\) refers to negative-valence tweets related to consumers. \(C_{k,s,t}\) reflects a vector of known determinants of price impact.

The original study used two approaches to investigate the temporary and permanent price impacts of tweet valence and subject matter: panel least squares and 2SLS instrumental variable (IV). Here we will estimate the panel least squares model.

We use the package plm to construct the panel least squares models. model_h estimates the temporary price impact, and model_q estimates the permanent price impact.

library(plm)

# convert the data to a panel data frame
pdata <- pdata.frame(tweet_trades, index = c("symbol", "timestamp_dt"))

# estimate PLS models
model_h <- plm(var_h_delta ~ cust + compet + 
                 cust*neg + compet*neg + 
                 cust*pos + compet*pos +
               volume_tt + volume_tt_avg + log(retweet_count) + log(favorite_count),
               data = pdata, 
             index = c("symbol", "timestamp_dt"), 
             model = "within", 
             effect = "twoways", 
             vcov = "pcse")

model_q <- plm(var_q_delta ~ cust + compet + 
                 cust*neg + compet*neg + 
                 cust*pos + compet*pos +
               volume_tt + volume_tt_avg + log(retweet_count) + log(favorite_count),
               data = pdata, 
             index = c("symbol", "timestamp_dt"), 
             model = "within", 
             effect = "twoways", 
             vcov = "pcse")