22 Tweet Valence and Subject Matter
In order to estimate the stock price impacts of intraday tweets on firm value, we need to construct our dataset by generating new variables using existing information from several data sources. These new variables include the outcome variables (temporary and permanent price impacts) and the predictors (tweet valence and subject matter).
We will tackle these tasks in this and the following chapters. In this chapter, we’ll compute valence and extract subject matter from the tweets that we have collected. In the next chapter, we will decompose firm stock price impacts into temporary and permanent components using state-space modeling, a powerful tool for analyzing time series data. Tweet valence and subject matter are critical attributes of information that have temporary and permanent price impacts.
The process of generating new features from existing information is known as feature generation.
22.1 Text analysis
Tweets are unstructured data, compared to the structured data stored in data frames that we have been working with in the past modules. To analyze tweets, we will utilize text mining techniques, which involve extracting useful insights from text with various types of statistical algorithms.
Discussions on text analysis in this chapter are based on the literature below.
Aggarwal, C. C. (2018). Machine Learning for Text. Cham: Springer. https://doi.org/10.1007/978-3-030-96623-2. [Ch1, Ch2]
Gentzkow, M., Kelly, B., & Taddy, M. (2019). Text as Data. Journal of Economic Literature, 57(3), 535-74. DOI: 10.1257/jel.20181020.
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press. [Ch5]
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Third Edition draft. [Ch1]
We limit our discussion to the English language.
steps of text analysis
In general, text analysis can be summarized in three steps below.
(1) Representing raw text as a numerical array \(C\).
The goal is to reduce the dimensionality of the data to a manageable level prior to any statistical analysis. Numerically representing text is similar to how data practitioners might use a metric like GDP to represent the economic activity of a country.
(2) Mapping the numerical array to predicted values \(\hat{V}\) of unknown outcomes \(V\).
This is where high-dimensional statistical methods are applied. In our case, one unknown variable of interest \(V\) is the true valence of a tweet. The predicted value \(\hat{V}\) of the true valence of a tweet determines whether or not it is positive or negative.
(3) Use the predicted values \(\hat{V}\) in subsequent descriptive or causal analysis.
We’ll use the predicted valence of tweets in the final stage of modeling.
22.2 Representing texts
document
The first step in constructing a text representation is to divide raw text into individual documents. This reduces raw text to a simpler representation suitable for statistical analysis.
This process is equivalent to choosing the unit of analysis. It is governed by the level at which the attributes of interest \(V\) are defined. Choosing the unit of analysis also depends on computational cost and consequence on model fits.
A document can be a sentence, paragraphs of a newspaper article, or all the tweets written by a single author in one day. In our case, a document, or the unit of analysis, is a tweet of a stock at second-level interval.
two approaches of representing texts
There are two approaches of representing texts that are popularly used in text mining applications.
text as a bag of words
The first approach is the bag-of-words model. This approach assumes that position or order in which a language element occurs doesn’t matter, and parses documents into distinct words or sentences. Based on this assumption, a word has the same effect on classification whether it occurs as the 1st, 20th, or last word in the document.
The bag-of-words model converts raw text into a sparse multidimensional representation, where a row is a numerical vector with each element indicating the presence or count of a particular language element (token) in a document.
Using bag-of-words models as opposed to treating text as a set of sequence limits the extent to which we encode dependence among elements within documents.
text as a set of sequence
An alternative approach is to move beyond treating documents as counts of language tokens, and to instead consider the ordered sequence of transitions between words. In contrast to the bag-of-word approach, ordering of words matters in this representation.
This area is closely related to language modeling and natural language processing, and is used by applications that require language semantics, reasoning, and understanding.
dimension
Tasks such as classification use the bag-of-words model. Our task also falls into this category.
In bag-of-words representation, words are treated as dimensions with values corresponding to word frequencies. Dimension is also referred to as term or feature.
corpus
In text analysis, a dataset is a collection of documents, referred to as corpus.
A lexicon is the complete and distinct set of terms used to define the corpus. A sentiment lexicon, for instance, is a list of lexical features (e.g., words) which are labeled according to their semantic orientation as either positive or negative.
22.3 Preprocessing texts
To reduce raw text to a simpler representation suitable for statistical analysis, we typically reduce the number of language elements through text preprocessing.
Text requires a lot of preprocessing because text is often found in highly unstructured environments, embedded within web documents, contaminated with elements such as nonstandard words, HTML tags or other meta-attributes, and misspellings.
These effects can be ameliorated with proper preprocessing. Common tasks of text preprocessing include
- Tokenization
- Reducing complexity
- Case folding
- Removing punctuation
- Removing stop words
- Lemmatization/stemming
- Frequency normalization
- Creating the document-feature matrix
However, each of these steps requires careful decisions. One researcher’s stop words are another’s subject of interest.
tokenization
Most of what we are going to do with language relies on first segmenting words from running text, the task we called tokenization.
Tokenization is the task of splitting a document into discrete words. To see how this process happens, OpenAI’s Tokenizer tool visualizes how text is tokenized, and how different models may yield different tokenization results.
A token is a sequence of characters from a text that is treated as an indivisible unit for processing. Tokens can be words, chunks of words, or single characters.
case folding
It is common to replace all capital letters with lowercase letters when generalization is helpful for a task and when the case of a word is not important to its semantic interpretation. That would be the case for information retrieval or speech recognition.
However, for sentiment analysis and other text classification tasks, information extraction, and machine translation, case can be quite helpful and case folding is generally not done. Consider the example of Rose as a name and rose as a flower, which are different terms in the lexicon.
punctuation
To reduce text complexity, it is also common to remove punctuation like commas, apostrophes, periods, and # symbol on Twitter.
stop words
Common words that have little discriminative power for the mining process may be removed to reduce the feature space. These words are important to the grammatical structure of sentences, but they typically convey relatively little meaning on their own.
In general, all articles, prepositions, and conjunctions are stop words. Pronouns are sometimes considered stop words. Besides, there are language-specific dictionaries of stop words, which we should consult in text preprocessing.
However, hard removal of all stop words may result in information loss. Alternatively, we may set a threshold on the frequency that can be used to identify frequent words and remove those stop words.
lemmatization/stemming
Words with common roots are usually consolidated into a single representative. The goal is to simplify the analysis by treating the variants of a word as equivalent in a document-feature matrix.
Two common strategies are lemmatization and stemming.
Lemmatization is the process of mapping words to their lemma to treat all variants of the word the same, despite their surface differences. A lemma is the dictionary form of a set of words that are related by modifications due to case, number, tense, etc.
For instance, the words am, are, and is have the shared lemma be; the words dinner and dinners both have the lemma dinner.
Lemmatization algorithms can be complex. For this reason we sometimes make use of a technique which is an approximation to lemmatization, stemming. We simply discard the affixes in the words using a few rules. For instance, economic, economics, and economically are all replaced by the stem economic.
frequency-based normalization
Not all words are equally important in analytic tasks. Low-frequency words are often more discriminative than high-frequency words. However, although very rare words do convey meaning, their added computational cost in expanding the set of features often exceeds their diagnostic value.
High-frequency words often do not give much information about the task at hand. Stop words represent an extreme case of very frequent words.
Therefore, we may want to adjust the weight of certain words based on their corpus-specific frequencies. A common technique that excludes both common and rare words in practice is filtering by term frequency–inverse document frequency (tf–idf).
creating the document-feature matrix
After we tokenize the corpus and reduce its complexity, we turn text into numerical data by constructing a document-feature matrix.
A document-feature matrix, also known as a term-document matrix or document-term matrix, is a numerical representation of text data. It is a matrix where each row represents a document, and each column represents a feature or term. The values in the matrix indicate the frequency of each term in each document, or some other measure of relevance, such as the tf-idf score.
The document-feature matrix is a fundamental tool for analyzing text data. By representing text data in a structured and quantitative form, it allows us to apply a variety of statistical and machine learning techniques to extract meaningful information from text.
22.4 Task-dependent, platform-specific preprocessing
Text preprocessing is task-dependent, and platform-specific.
Our corpus is a collection of tweets, which contain URLs, and platform-specific string patterns like handles and retweets. Handles are denoted by the character “@”. Retweets are denoted by the characters “RT @”.
## [1] "RT @Accenture_US: Can you solve this?\n\nDrop your best guess below 🤔🧠💡 https://t.co/vFkgAhEbuq"
## [2] "@wwd A great listen! 👏"
## [3] "Thrivers know that galvanizing their organization around their collective difference means knowing where to focus, who to rally, and are 67% more likely to provide critical input to growth strategies than their peers. https://t.co/zLA0kBj7mV https://t.co/XXLqC1nf90"
Before we apply the VADER lexicon to our corpus to compute tweet valence, we need to process the text and to make the documents ready for text mining. However, we do not necessarily need to utilize all the techniques introduced earlier.
For instance, we will not remove punctuation because punctuation may be used to signal increased sentiment intensity (e.g., “Good!!!”). We will not map everything to lowercase because word-shape may be used to signal emphasis (e.g., using ALL CAPS for words or phrases). This is because VADER treats the exclamation mark as increasing the magnitude of the intensity without modifying the semantic orientation; and it treats capitalization, specifically ALL-CAPS, as increasing the magnitude of the sentiment intensity without affecting the semantic orientation.
reducing text complexity
To reduce text complexity, we perform the procedure below to prepare our corpus for valence extraction with the help of regular expressions.
A regular expression is a language for specifying text search strings. A regular expression search function will search through the corpus, and return all texts that match a specified pattern.
(1) Remove URLs using pattern matching and replacement.
In the regular expression http\\S+\\s*
, the pattern to be matched are specified by the rules below.
\\
means escape in character stringS
means non-spaces
means space+
means at least one time*
tells the computer to match the preceding character for 0 or more times up to infinite
Therefore
\\S+
matches one or more non-space characters\\s*
matches any number of whitespace characters
URLs, indicated by strings beginning with “http”, together with the whitespace characters following the URLs, are removed.
## [1] "RT @Accenture_US: Can you solve this?\n\nDrop your best guess below 🤔🧠💡 "
## [2] "@wwd A great listen! 👏"
## [3] "Thrivers know that galvanizing their organization around their collective difference means knowing where to focus, who to rally, and are 67% more likely to provide critical input to growth strategies than their peers. "
(2) Remove the beginning parts of the retweets.
We will remove “RT ”, and the rest in step 3 below.
\\b
in the regular expression \\b+RT\\s*
means metacharacter matches at the beginning or end of a word.
## [1] "@Accenture_US: Can you solve this?\n\nDrop your best guess below 🤔🧠💡 "
## [2] "@wwd A great listen! 👏"
## [3] "Thrivers know that galvanizing their organization around their collective difference means knowing where to focus, who to rally, and are 67% more likely to provide critical input to growth strategies than their peers. "
(3) Remove mentions “@handle”.
Handles are removed.
## [1] "Can you solve this?\n\nDrop your best guess below 🤔🧠💡 "
## [2] "A great listen! 👏"
## [3] "Thrivers know that galvanizing their organization around their collective difference means knowing where to focus, who to rally, and are 67% more likely to provide critical input to growth strategies than their peers. "
Now compare the output from each line of code we ran, where URL, RT and handles are removed step by step, until we get clean text
.
## full_text
## 632 RT @Accenture_US: Can you solve this?\n\nDrop your best guess below 🤔🧠💡 https://t.co/vFkgAhEbuq
## no_url
## 632 RT @Accenture_US: Can you solve this?\n\nDrop your best guess below 🤔🧠💡
## no_rt
## 632 @Accenture_US: Can you solve this?\n\nDrop your best guess below 🤔🧠💡
## text
## 632 Can you solve this?\n\nDrop your best guess below 🤔🧠💡
sentence tokenization
Another task in text preprocessing before we apply the VADER lexicon to our corpus is sentence tokenization, required by VADER.
The most useful cues for segmenting a text into sentences are punctuation, such as periods, question marks, and exclamation points. The period character “.” can be ambiguous between a sentence boundary marker and a marker of abbreviations. Therefore, in general, sentence tokenization methods work by first deciding (based on rules or machine learning) whether a period is part of the word or is a sentence-boundary marker. An abbreviation dictionary can help to determine whether the period is part of a commonly used abbreviation.
We will use the package tokenizers
for sentence tokenization, which can be identified on the CRAN Task View: Natural Language Processing.
Following its vignette, sentence tokenization can be done using the function tokenize_sentences()
.
## [[1]]
## [1] "Did you know?"
## [2] "COVID-19 has eroded years of progress towards workplace gender equality."
## [3] "😱 🤔 Tune in to Change Conversations, where our hosts chat with leaders from and about ending the 'Shecession':"
##
## [[2]]
## [1] "In Norway, one innovative waste management company is breaking down plastic using enzymes, sorting papers with sensors—and revolutionizing recycling."
## [2] "The full story is in our podcast #BuiltForChange:"
##
## [[3]]
## [1] "From selling clothes on Depop to building Substack audiences, people are using technology to build financial freedom and expand their agency."
## [2] "Here's why this matters for brands: #FjordTrends"
##
## [[4]]
## [1] "Why are nearly 5,300,000 women out of work?"
## [2] "😱 🤔 Tune in to part 1 of a special 2-part episode of Change Conversations, where our hosts chat with leaders from and to discuss Women, Work, and COVID:"
##
## [[5]]
## [1] "Not Wordle, just Accenture ⬜️⬜️🟪⬜️⬜️⬜️⬜️ ⬜️⬜️⬜️🟪⬜️⬜️⬜️ ⬜️⬜️⬜️⬜️🟪⬜️⬜️ ⬜️⬜️⬜️⬜️⬜️🟪⬜️ ⬜️⬜️⬜️⬜️🟪⬜️⬜️ ⬜️⬜️⬜️🟪⬜️⬜️⬜️ ⬜️⬜️🟪⬜️⬜️⬜️⬜️"
22.5 Computing valence
After text preprocessing, we can then use statistical methods to connect counts to attributes. Two attributes of our documents, tweets, are valence and subject matter.
Mapping a document-term matrix to predictions of an attribute can be roughly divided into four categories:
- Dictionary-based methods
- Text regression methods
- Generative models
- Word embeddings
We will utilize the dictionary-based methods. These methods do not involve statistical inference at all. They specify the predicted value \(\hat{V_i}\) of the outcome of interest \(V_i\) as a function of a bag-of-words representation \(f(c_i)\), based on a prespecified dictionary of terms capturing particular categories of text.
In our case, \(V_i\) is valence of a tweet, and \(f(c_i)\) is a compound score computed by summing the valence scores of each word in the lexicon, which we discuss below.
valence
Valence is a measure of the emotional content of the text and is often used in sentiment analysis to classify text as positive, negative, or neutral. Sentiment analysis is the task of extracting the positive or negative orientation that a writer expresses in a text.
One common method for measuring valence is through the use of a sentiment lexicon. The most basic lexicons label words or phrases along one dimension of semantic variability with a positive or negative valence score.
We will rely on the VADER rule-based algorithm to determine the valence of the tweets.
VADER
VADER stands for Valence Aware Dictionary and sEntiment Reasoner. VADER is specifically attuned to sentiments expressed in social media.
Hutto, C., and Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 8, No. 1, pp. 216-225). https://doi.org/10.1609/icwsm.v8i1.14550
The VADER lexicon is sensitive to both the polarity (positive/negative) and the intensity of sentiments (on a scale from –4 to +4).
VADER also performs very well with emojis, slangs and acronyms in sentences, which are known to be important for sentiment analysis of social media text.
Moreover, the VADER lexicon has been validated by human, making it a gold-standard sentiment lexicon for microblog-like contexts. Manual auditing is especially important for dictionary methods. Validity hinges on the assumption that a particular function of text features (e.g., counts of positive or negative words) will be a valid predictor of the true latent variable \(V\).
computing compound scores
Now, let’s connect counts to attributes through the function of text representation. In our case, this function is the the compound scores.
The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive).
Typical threshold values are:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05
vader
We will use the function get_vader()
from R package vader
to help us compute the compound scores.
library(vader)
# create a function get_score
# to sum the valence scores for each sentence in a tweet
# x: each tweet
# y: each sentence
get_score <- function(x, y) {
sum(sapply(x, function(y) as.numeric(get_vader(y)["compound"])))
}
# apply the function to all tweets in our corpus
score_sum <- sapply(sent_tokens, get_score)
# get the number of sentences in each tweet
len <- sapply(sent_tokens, length)
# get an average score for each tweet
sent_compound <- score_sum / len
# compound scores
sent_compound
## [1] 0.1403333 0.0130000 0.3920000 0.2010000 0.0000000
The piece of code above goes through each tokenized sentence, stored in the list sent_tokens
, and get the compound score for each sentence.
get_vader()
returns several objects. The pos
, neu
, and neg
scores are ratios for proportions of text that fall in each category.
## word_scores compound pos
## "{0, 0, 0, 0, 0, 1.8, 0, 0, 0, 0}" "0.421" "0.237"
## neu neg but_count
## "0.763" "0" "0"
After applying get_vader()
to the entire corpus, we should conduct a quick manual check of the results. It was found that get_vader()
failed to process a dozen of sentences. However, using the sample code found on the Python library’s GitHub repository on those failed sentences worked well and fast.
This is the time to switch to the Python VADER library for this specific task of computing the valence scores. We can do so by locating the chunk of code that does the job in the sample code, and copying and pasting the code to run in the R environment.
22.6 Using Python with R
There are different approaches of using Python with R to solve a problem that R cannot handle well. We can export outputs from Python and import them to R. Alternatively, we can run Python directly in the R environment.
We will take the latter approach and use the package reticulate
to work with Python in R Markdown.
reticulate
The package reticulate
provides an R interface to Python modules, classes, and functions.
Before we can use reticulate
, the first step is to install Python. Next, in an R Markdown file, we will load the package reticulate
in an R code chunk.
By default, reticulate
uses the version of Python found on our PATH
, which can be found using Sys.which("python")
. More on Python version configuration.
Then, in Python code chunks, we will run the script below to compute valence scores by adapting the sample code.
import numpy as np
import pandas as pd
news = pd.read_csv('sample.csv')
import nltk
from nltk import tokenize
nltk.download('punkt')
news['sentence_list'] = news.apply(lambda x: tokenize.sent_tokenize(x['text']), axis=1)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
result = []
for ind in news.index:
paragraphSentiments = 0.0
for sentence in news['sentence_list'][ind]:
vs = analyzer.polarity_scores(sentence)
paragraphSentiments += vs["compound"]
d = paragraphSentiments / len(news['sentence_list'][ind])
result.append(d)
news["comp_score"] = result
Note that this script takes care of both sentence tokenization and computing tweet valence.
To retrieve the Python outputs and use them in R environment, we will create a new R code chunk and use py$
to extract it. This can be saved to a new R object.
Tokenized sentences are stored in the column tweets$sentence_list
.
## [[1]]
## [1] "Did you know?"
## [2] "COVID-19 has eroded years of progress towards workplace gender equality."
## [3] "😱 🤔 \n\nTune in to Change Conversations, where our hosts chat with leaders from and about ending the 'Shecession':"
##
## [[2]]
## [1] "In Norway, one innovative waste management company is breaking down plastic using enzymes, sorting papers with sensors—and revolutionizing recycling."
## [2] "The full story is in our podcast #BuiltForChange:"
##
## [[3]]
## [1] "From selling clothes on Depop to building Substack audiences, people are using technology to build financial freedom and expand their agency."
## [2] "Here's why this matters for brands: #FjordTrends"
##
## [[4]]
## [1] "Why are nearly 5,300,000 women out of work?"
## [2] "😱 🤔 \n\nTune in to part 1 of a special 2-part episode of Change Conversations, where our hosts chat with leaders from and to discuss Women, Work, and COVID:"
##
## [[5]]
## [1] "Not Wordle, just Accenture\n\n⬜️⬜️🟪⬜️⬜️⬜️⬜️\n⬜️⬜️⬜️🟪⬜️⬜️⬜️\n⬜️⬜️⬜️⬜️🟪⬜️⬜️\n⬜️⬜️⬜️⬜️⬜️🟪⬜️\n⬜️⬜️⬜️⬜️🟪⬜️⬜️\n⬜️⬜️⬜️🟪⬜️⬜️⬜️\n⬜️⬜️🟪⬜️⬜️⬜️⬜️"
##
## [[6]]
## [1] "Our new report reveals 4 key tenets Continuum Competitors share to unleash competitiveness on the #cloud:"
Compound scores are stored in tweets$comp_score
.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7717 0.0000 0.2120 0.2260 0.3818 0.9545
classifying tweets
Following this guide, we classify tweets into two categories based on their compound scores.
Positive:
##
## 0 1
## 998 2335
Negative:
##
## 0 1
## 3151 182
22.7 Determining subject matter
The remaining task in our text analysis is to determine the subject matter of each tweet.
For this task, we will follow the article below mentioned in the original study and the Dictionary of Phrases used for Content Analysis in its appendix to decide whether a tweet is customer-orientated or competitor-orientated.
Saboo, A. R., & Grewal, R. (2013). Stock Market Reactions to Customer and Competitor Orientations: The Case of Initial Public Offerings. Marketing Science, 32(1), 70-88. https://doi.org/10.1287/mksc.1120.0749.
Dictionary phrases for customer orientation include: assist customer, customer demand, customer expectations, customer feedback, customer need, customer opinion, customer preferences, customer request, customer requirement, customer satisfaction, customer suggestion, focus customer, help customer, improve customer efficiency, listen customer, listen to customer, maintain customer relation, product development customer, provide value customer, reduce customer cost, serve customer, support customer, work closely customer, work with customer.
Dictionary phrases for competitor orientation include: competit action, competit threat, compet effective, competit, competit strength, compet strateg, barrier entry, compet advantage.
We may think of using fuzzy string matching with agrepl()
to search for approximate matches to pattern within each tweet. However, manual auditing of the returned results showed that “competitors” seemed to not have been explicitly mentioned in our corpus in the sense of being a firm’s competitor, compared with the strong competitor orientation shown in the sample dictionary. Strictly following the procedure may yield no qualified tweet.
We will then use simple pattern matching and mark tweets as “competitor-oriented” if the substring “compet” appeared.
Then we manually evaluate the outputs from the previous step. This would reduce the process of determining tweet subject matter to simple pattern matching followed by manual validation.
In other words, if a tweet is flagged as TRUE
from initial pattern matching, does it match any of the phrases in the “competitor” dictionary? If not, then the final decision for that tweet is not competitor-oriented.
The same strategy can be applied to determining “customer orientation”.
tweets$cust <- grepl("customer", tweets$text)
for (i in which(grepl("customer", tweets$text))) {
print(i)
cat(tweets$text[i])
}
tweets$cust[i] <- 0
At this moment, we have generated four dummy variables on tweet valence and subject matter which we will use in the final models.