20 Introduction

Welcome to the new phase of our learning journey! In this module Tweets and Stock Price Impacts, we will put the concepts and skills we have learnt into practice by solving real-world problems. This will entail breaking down seemingly daunting tasks into smaller, manageable components and tackling them step by step.

To see how this process works, in the following sections, we will replicate a study on firm-generated content on Twitter and its impact on stock market using R.

Lacka, E., Boyd, D. E., Ibikunle, G., & Kannan, P. K. (2022). Measuring the Real-Time Stock Market Impact of Firm-Generated Content. Journal of Marketing, 86(5), 58-78. https://doi.org/10.1177/00222429211042848.

We will walk through the workflow of this quantitative/computational marketing research, and skip the methodological and theoretical discussions from the paper.

We will use this case to review many tasks that we have encountered in the past modules, and introduce several new topics that expand our knowledge and skill sets in data analysis with R. We’ll expand on the topic of analyzing unstructured textual data and discuss the concepts and procedure of text analysis.

  • Retrieving tweets
  • Representing, preprocessing, and extracting information from unstructured texts
  • Using other programming languages, such as Python, with R
  • Representing and dealing with time series
  • Merging various data sources using inequality and rolling joins
  • Handling repeated tasks

Let’s get started!

20.1 Overview of the study

This paper focuses on the information technology (IT) firms included in the S&P 500 index, and examines the stock price impacts of their marketing activity on Twitter.

The study has implications for real-time marketing in social media, where multiple pieces of firm-generated content (FGC) are disseminated throughout the day.

To measure the instantaneous impact of the FGC on firm value, we need information below from multiple sources to construct our dataset. ### firm-generated content (FGC) {-} The first concept to be operationalized is Firm-generated content (FGC). It is defined as a firm’s communications disseminated through its own online communication tools, such as Twitter.

In this case, FGC are the tweets generated by these firms’ corporate accounts. Each tweet and its timestamp is recorded to the second, and will be matched to the corresponding timestamp of a firm’s trading activity at subsecond intervals.

tweet valence and content

Two important attributes of FGC are content valence and subject matter. These will be the predictors of our models in the final stage.

Content valence measures whether a tweet is positive or negative. We’ll use text mining techniques to extract this piece of information from the tweets.

Subject matter measures whether a tweet has a consumer orientation or competitor orientation. We’ll use pattern matching techniques to extract this piece of information from the tweets, which rely on rules explicitly specified by humans.

The two attributes are associated with the short- and long-term firm-level financial performance in different ways.

stock price impacts

The outcome variable is stock price impact. It is defined as the impact on the variance of high-frequency changes in stock price.

The price impact is further decomposed into the permanent (i.e., long-term) price impact and temporary (i.e., short-term) price impact through state-space modeling with Kalman filtering.

The study utilizes large volumes of ultra-high-frequency trading data to measure the impact of the S&P 500 IT firms’ FGC dissemination activities. The high-frequency trading data is characterized by unequal time intervals, which can be handled by using state-space modeling with a Kalman filter.

20.2 Roadmap

We’ll follow the approach and steps documented in the paper to replicate the study, focusing on the main analyses and skipping a few auxiliary tests and robustness checks.

We’ll divide this project into four stages.

1. Collecting data from multiple sources to construct dataset

  • Getting the historical constituent list of the IT firms in the S&P 500 indices
  • Collecting tweets sent from the corporate accounts of the S&P 500 IT firms via Twitter API
  • Acquiring ultra-high-frequency trading data from the databases in Wharton Research Data Services (WRDS)

To collect data for this project, we’ll use several databases or methods that can be different from the original study. This is described in detail in the next chapter.

2. Preparing data for each task

  • Merging data from various sources
  • Cleaning and processing data for downstream tasks

Data cleaning is the “dirty work” of data analysis, a time-consuming but indispensable step before each task can be properly done.

3. Feature generation from existing information to create the predictors and outcome variables

  • Determining the content valence of each tweet using the VADER rule-based algorithm, and marking each tweet as positive or negative
  • Determining the subject matter of each tweet, and marking the relevant tweets as consumer-oriented or competitor-oriented
  • Decomposing stock price impacts into long-term and short-term effects using state-space modeling with Kalman filtering

4. Estimating the temporary and permanent price impacts of tweet valence and subject matter with panel least squares approach