20 Introduction
Welcome to the new phase of our learning journey! In this module Tweets and Stock Price Impacts, we will put the concepts and skills we have learnt into practice by solving real-world problems. This will entail breaking down seemingly daunting tasks into smaller, manageable components and tackling them step by step.
To see how this process works, we will replicate a study on firm-generated content on X (formerly Twitter) 17 and its impact on stock market using R.
Lacka, E., Boyd, D. E., Ibikunle, G., & Kannan, P. K. (2022). Measuring the Real-Time Stock Market Impact of Firm-Generated Content. Journal of Marketing, 86(5), 58-78. https://doi.org/10.1177/00222429211042848.
In the following sections, we will walk through this quantitative/computational marketing research without delving into the methodological and theoretical details.
We will use this case to review many tasks that we have encountered in the previous modules, and introduce several new topics that expand our knowledge and skill sets in data analysis with R. We’ll also expand on the topic of analyzing unstructured textual data and discuss the concepts and procedure of text analysis.
- Retrieving tweets
- Representing, preprocessing, and extracting information from unstructured texts
- Using other programming languages, such as Python, in R environment
- Representing and dealing with time series
- Merging various data sources using inequality and rolling joins
- Handling repeated tasks
Let’s get started!
20.1 Overview of the study
This paper focuses on the information technology (IT) firms included in the S&P 500 index, and examines the stock price impacts of their marketing activity on Twitter.
The study has implications for real-time marketing in social media, where multiple pieces of firm-generated content (FGC) are disseminated throughout the day.
To measure the instantaneous impact of the FGC on firm value, we need information below from multiple sources to construct our dataset.
firm-generated content (FGC)
The first concept to be operationalized is Firm-generated content (FGC). It is defined as a firm’s communications disseminated through its own online communication tools, such as Twitter.
In this case, FGC are the tweets generated by these firms’ corporate accounts. Each tweet and its timestamp is recorded to the second, and will be matched to the corresponding timestamp of a firm’s trading activity at subsecond intervals.
tweet valence and content
Each piece of FGC is characterized by two attributes: content valence and subject matter. These will be the predictors in our final model.
Content valence measures whether a tweet is positive or negative. We’ll use text mining techniques to extract this piece of information from the tweets.
Subject matter measures whether a tweet has a consumer orientation or competitor orientation. We’ll use pattern matching techniques to extract this piece of information from the tweets, which rely on rules explicitly specified by humans.
The two attributes of FGC are associated with the short- and long-term firm-level financial performance in different ways. Prior studies have shown that the subject matter of FGC is related to a company’s competitive edge, which is often not public and can be challenging to identify as it is ingrained in the company’s culture. The influence of negative valence seems to be more potent than positive valence, and therefore, it is more frequently linked with permanent price effects. The valence of FGC may also interact with its subject matter, leading to a permanent impact on the price.
stock price impacts
The outcome variable is stock price impact. It is defined as the impact on the variance of stock price.
Stock price impact is decomposed into the permanent (i.e., long-term) price impact and temporary (i.e., short-term) price impact using state-space modeling with Kalman filtering.
The trading actions by traders or investors who can accurately interpret the information content of FGC (i.e., informed traders) contribute to the permanent price impact. The permanent price impact is the efficient component of the price, which is information-driven.
On the other hand, those who cannot interpret the information content of FGC (i.e., uninformed traders) cause temporary price impacts because their trading activity is not related to the value of the firm. Temporary price impact is the noise, which is not related to information relevant to the firm.
This study uses large volumes of ultra-high-frequency trading data to measure the price impact of the S&P 500 IT firms’ FGC dissemination activities.
20.2 Roadmap
We’ll follow the approach and steps documented in the paper to replicate the study, focusing on the main analyses and skipping a few auxiliary tests and robustness checks.
For our purposes, we’ll divide the project into four stages.
1. Collecting data from multiple sources to construct dataset
- Getting the historical constituent list of the IT firms in the S&P 500 indices
- Collecting tweets sent from the corporate accounts of the S&P 500 IT firms via Twitter API
- Acquiring ultra-high-frequency trading data from the databases in the Wharton Research Data Services (WRDS)
Several data collection methods and data sources we use can be different from the ones in the original study. This is described in detail in the next chapter.
2. Preparing data for each task
- Merging data from various sources
- Cleaning and processing data for downstream tasks
Data cleaning is the “dirty work” of data analysis. It is time-consuming but indispensable before each task can be properly handled.
3. Feature generation from existing information to create the predictors and outcome variables
- Determining the content valence of each tweet using the VADER rule-based algorithm, and marking each tweet as positive or negative
- Determining the subject matter of each tweet, and marking the relevant tweets as consumer-oriented or competitor-oriented
- Decomposing stock price impacts into long-term and short-term effects using state-space modeling with Kalman filtering
4. Estimating the temporary and permanent price impacts of tweet valence and subject matter with panel least squares approach
In July 2023, Twitter was rebranded to X. In this tutorial, we will use “Twitter” to refer to the social media platform.↩︎