24 Investigating the Temporary and Permanent Price Impacts

In this chapter, link the temporary and permanent stock price components with tweets to generate the two outcome variables, and estimate the temporary and permanent stock price impacts of firm-generated content.

24.1 Linking tweets to trades data

First, we link tweets to trades data to create “tweet-trades” using inequality and rolling joins.

As defined in the paper, a tweet-trade is the first trade to occur immediately after a tweet within 60 seconds.

Before we take the steps to link tweets to trades, we first subset tweets within opening hours of NYSE between 9:30 am and 16:00 pm.

The last daily trade is assumed to occur no later than 16:05 pm. Transactions are commonly reported up to five minutes after the official close at 16:00 pm.

library(tidyverse)

load("tweets.Rdata")

# Filter weekdays
tweets <- tweets %>%
  mutate(date = substr(timestamp, 1, 10),
         day = weekdays(as.Date(date))) %>%
  filter(!day %in% c("Saturday", "Sunday"))

# Filter NYSE core trading session: 9:30-16:00
tweets <- tweets %>%
  mutate(hour = substr(timestamp, 12, 19)) %>%
  filter(hour >= "09:29:00" & hour <= "16:05:00")

We then convert the tweet timestamp in the tweets data to nanotime resolution to ensure compatibility with the timestamp format in the trades data.

Additionally, we add 60 seconds to the tweet timestamp to set up the 60-second time window for creating tweet-trades.

library(nanotime)

tweets <- tweets %>%
  mutate(
    tweet_nanotime = as.nanotime(as.POSIXct(timestamp)), # nanotime resolution
    tweet_nanotime_plus_60s = tweet_nanotime + 60e9  # 60 seconds in nanoseconds
  )

Next, we link each tweet from the dataset tweets to a corresponding pair of price variances from the dataset qh, which we got from fitting the state-space model in the last chapter.

The id of the dataset tweets is handle. The id of the price variances dataset qh is SYM_ROOT, which indicates ticker. To merge two datasets properly, we use the third dataset ticker_handle as a bridge, which provides one-to-one match between ticker and handle.

ticker_handle <- read_csv("ticker_handle.csv")

tweets <- tweets %>% 
  left_join(ticker_handle, by = "handle") %>%
  select(timestamp_tweet = timestamp, full_text, ticker,
         retweet_count, favorite_count, comp_score, pos, neg, compet, cust,
         tweet_nanotime, tweet_nanotime_plus_60s)  %>% 
  arrange(timestamp_tweet)

We merge the dataset tweets with the dataset ticker_handle in order to add the column symbol (ticker) to the dataset tweets for use in the next step.

matching tweets with trades using inequality and rolling joins

Finally, we join the two datasets tweets and qh to create tweet-trades using inequality and rolling joins.

Recall that the criteria of a tweet-trade is the first trade that occurred immediately after a tweet and within 60 seconds of the tweet.

tweet_trades <- tweets %>%
  full_join(
    qh,
    join_by(
      handle == SYM_ROOT,
      tweet_nanotime < NANOTIME,
      tweet_nanotime_plus_60s >= NANOTIME,
      closest(tweet_nanotime <= NANOTIME)
    )
  )

This is not an equality join, where the rows match if the key from the left-hand table is equal to the key in the right-hand table. Instead, we will use inequality and rolling joins.

Inequality joins match on an inequality, such as >, >=, <, or <=, and are common in time series analysis. Inequality joins will match a single row in x to a potentially large number of rows in y.

Rolling joins are a variant of inequality join that limits the results returned by an inequality join condition. They are useful for “rolling” the closest match forward or backward when an exact match is unavailable.

In R, we can use dplyr::join_by() to specify an inequality join and closest() to construct a rolling join.

For any tweet, we set the boundary of its time window to be 60 seconds, bounded by the tweet timestamp tweet_nanotime and trade timestamp NANOTIME. tweet_nanotime and tweet_nanotime_plus_60s in the left-hand table tweets are the rows that we use to match cases of NANOTIME in the right-hand table qh.

tweet_nanotime < NANOTIME,
tweet_nanotime_plus_60s >= NANOTIME,
closest(tweet_nanotime <= NANOTIME)

To limit the results returned from the inequality join, we “roll” the closest match forward with closest() to find the first trade occurring immediately after a tweet.

closest() uses the left-hand table as the primary table, and the right-hand table as the one to find the closest match in, regardless of how the inequality is specified. With closest(), if we need to perform a join on a computed variable, we need to precompute and store it in a separate column. That’s why we created tweet_nanotime_plus_60s.

Now let’s create a tweet-trade indicator is_tweet_trade.

tweet_trades <- tweet_trades %>%
  group_by(handle, timestamp_tweet) %>%
  arrange(handle, timestamp_tweet, NANOTIME) %>%
  mutate(
    is_tweet_trade = !is.na(timestamp_tweet) & !is.na(NANOTIME) & 
      row_number(NANOTIME) == 1
  ) %>%
  ungroup()

24.2 Computing outcome variables

We’ve just linked a tweet to a corresponding pair of $\sigma_{s,t}^{2_u}$ (permanent component) and $\sigma_{s,t}^{2_i}$ (temporary component).

Now, we’re ready to create the outcome variables that determine how a tweet changes the composition of price with regard to $\sigma_{s,t}^{2_u}$ (i.e., permanent component) and $\sigma_{s,t}^{2_i}$ (i.e., temporary component).

We will compute 30-second percentage absolute changes for both $\sigma_{s,t}^{2_u}$ and $\sigma_{s,t}^{2_i}$ following the definitions given in the paper.

Permanent price impact $\Delta\sigma_{s,t}^{2_u}$

$\Delta\sigma_{s,t}^{2_u} = \left\lvert{\frac{\sigma_{s,t+30s}^{2_u}-\sigma_{s,t-1}^{2_u}}{\sigma_{s,t-1}^{2_u}}}\right\rvert$

Temporary price impact $\Delta\sigma_{s,t}^{2_i}$

$\Delta\sigma_{s,t}^{2_i} = \left\lvert{\frac{\sigma_{s,t+30s}^{2_i}-\sigma_{s,t-1}^{2_i}}{\sigma_{s,t-1}^{2_i}}}\right\rvert$

Below we compute 30-second percentage absolute changes for permanent and temporary price components for each ticker. SEC is second-level time interval.

tweet_impact <- tweet_trades %>%
  group_by(ticker) %>%
  mutate(
    
    # Get the variance 30 seconds later
    var_q_plus_30s = lead(var_q, 30),
    var_h_plus_30s = lead(var_h, 30),
    
    # Calculate the percentage change (absolute value)
    delta_var_q = abs((var_q_plus_30s - lag(var_q)) / lag(var_q)),
    delta_var_h = abs((var_h_plus_30s - lag(var_h)) / lag(var_h))
  ) %>%
  ungroup()

$\sigma_{s,t-1}^{2_u}$ is computed by lag(var_q); $\sigma_{s,t-1}^{2_i}$ is computed by lag(var_h).

$\sigma_{s,t+30s}^{2_u}$ is computed by lead(var_q, 30); $\sigma_{s,t+30s}^{2_i}$ is computed by lead(var_h, 30).

24.3 Addressing omitted variable bias

At this moment, we are almost there to build a model to estimate changes in price variances following a tweet. However, based on past research, there are several known determinants of price impact, which we should take into account to avoid omitted variable bias. These determinants of price impact will be the control variables in the final model.

The original study considered seven aspects that could affect price impact: the number of an account’s followers, the natural logarithm of trading volume, the natural logarithm of average trade size, volatility, effective spread, the natural logarithm of a high-frequency trading proxy, and order imbalance.

Here we cover the natural logarithm of trading volume and the natural logarithm of average trade size. Natural logarithm of trading volume is denoted by $lnvolume_{s,t}$ . Trading volume is measured as the dollar volume of transactions executed in stock $s$ prior to a corresponding tweet-trade $t$ .

Natural logarithm of average trade size is denoted by $lntradesize_{s,t}$ . Average trade size is computed as the trading volume prior to tweet-trade $t$ divided by the number of transactions just prior to a corresponding tweet-trade $t$ in stock $s$ .

tweet_trades_with_metrics <- tweet_trades %>%
  arrange(ticker, SEC, NANOTIME) %>%
  group_by(ticker) %>%
  mutate(
    
    # Dollar volume = price * size
    dollar_volume_per_trade = exp(LOGPRICE) * SIZE,
    
    # Calculate total dollar volume prior to tweet-trade
    trading_volume = lag(cumsum(dollar_volume_per_trade)),
    
    # Count number of transactions prior to tweet-trade
    n_case = 1, 
    num_prior_transactions = cumsum(n_case),
    
    # Calculate average trade size
    avg_trade_size = trading_volume / num_prior_transactions,
    
    # Calculate natural logarithm of average trade size
    ln_avg_trade_size = log(avg_trade_size)
  ) %>%
  ungroup() %>%
  filter(is_tweet_trade)

The other determinants volatility, effective spread, the natural logarithm of a high-frequency trading proxy, and order imbalance can be generated with additional quotes data from the TAQ database from WRDS and by using similar data manipulation methods to be introduced below.

In terms of the number of followers at each timestamp, it is not made available by the standard Twitter timeline endpoint. Alternatively, we use retweet counts and favorite counts as proxies.

24.4 Estimating panel least squares model

At last, to investigate whether tweet valence and subject matter drive the price impact of tweet-trades, we estimate

$PriceImpact_{s,t} = \alpha_{s} + \beta_{t} + \nonumber \\ \gamma_{1} consumer_{s,t} + \gamma_{2} competitor_{s,t} + \nonumber \\ \gamma_{3} consumer*-ve_{s,t} + \gamma_{4} competitor*-ve_{s,t} + \nonumber \\ \gamma_{5} consumer*+ve_{s,t} + \gamma_{6} competitor*+ve_{s,t} + \nonumber \\ \gamma_{7}-ve_{s,t} + \gamma_{8}+ve_{s,t} + \nonumber \\ \sum_{k = 1}^{4}\varphi_{k}C_{k,s,t} + \epsilon_{s,t}$

$\alpha_{s}$ and $\beta_{t}$ are stock and time fixed effects. $+ve_{s,t}$ refers to positive-valence tweets; $-ve_{s,t}$ refers to negative-valence tweets. $*$ indicates interaction effects. For instance, $consumer*-ve_{s,t}$ refers to negative-valence tweets related to consumers. $C_{k,s,t}$ reflects a vector of known determinants of price impact.

The original study used two approaches to investigate the temporary and permanent price impacts of tweet valence and subject matter: panel least squares and 2SLS instrumental variable (IV). Here we will estimate the panel least squares model.

We use the package plm to construct the panel least squares models. model_h estimates the temporary price impact, and model_q estimates the permanent price impact.

library(plm)

# convert the data to a panel data frame
pdata <- pdata.frame(tweet_impact, index = c("ticker", "timestamp_tweet"))

# estimate PLS models
model_h <- plm(delta_var_h ~ cust + compet + 
                 cust*neg + compet*neg + 
                 cust*pos + compet*pos +
                 log(trading_volume) + ln_avg_trade_size 
                # + log(retweet_count) + log(favorite_count),
               data = pdata, 
               index = c("ticker", "timestamp_tweet"), 
               model = "within", 
               effect = "twoways")

model_q <- plm(delta_var_q ~ cust + compet + 
                 cust*neg + compet*neg + 
                 cust*pos + compet*pos +
                 log(trading_volume) + ln_avg_trade_size 
                 # + log(retweet_count) + log(favorite_count),
               data = pdata, 
               index = c("ticker", "timestamp_tweet"), 
               model = "within", 
               effect = "twoways")