21 Collecting Data

In this chapter, we explain how to collect data from various sources to construct our dataset for analysis.

The original study spanned from January 2013 to August 2018. For our replication study, data was collected from January 1, 2022, to December 16, 2022. For this tutorial, we use selected subsets of the data to illustrate different aspects of the research.

The number of transactions that a single trading day can produce is staggering, given the nature and volume of ultra-high-frequency trading data. For the 60+ stocks in our sample, this can range from 4 to 7 million records per day. Consequently, even a month’s worth of data would be more than sufficient for our objective of working with large datasets.

Below we show how to access the historical constituent list of the IT firms in the S&P 500 indices, retrieve tweets sent from the corporate accounts of these firms for the firm-generated content, and acquire the ultra-high-frequency trading data of these firms.

21.1 Constituent list of the S&P 500 IT firms

To construct our dataset, the first step is getting the historical constituent list of the IT firms in the S&P 500 indices.

We’ll retrieve the historical constituent list of the S&P 500 IT firms from the database Finaeon, previously Global Financial Data.

Note: Data was collected in December 2022 from Global Financial Data (65 records). Instructions below are based on Finaeon (Feb 2025).

Here are the steps to take:

Engines: Constituent Membership
Choose a Category: Major US Indices
Choose an Index: S&P 500
Choose a Membership Type: As of Specific Date
As of Date: 12/31/2022
EXPORT SEARCH LIST
In the downloaded spreadsheet, apply a filter to the Sector column and select “Information Technology”.

The two columns that we need are Symbol (ticker) and Description (company name).

Symbol	Description
MCHP	Microchip Technology Inc
INTU	Intuit, Inc.
IT	Gartner, Inc.
SWKS	Skyworks Solutions, Inc.
TRMB	Trimble Navigation Ltd.

We will use this piece of information in two ways:

Using company name to search for corporate accounts on Twitter and subsequently their username, also known as handle.
Using ticker to retrieve stock transactions from the Wharton Research Data Services (WRDS) databases. These steps are described below. We’ll later merge tweets and transactions data for modeling purposes towards the end of this project.

21.2 Firm-generated content (FGC) on Twitter

To collect FGC, first, we’ll go to Twitter to search for these corporate accounts one by one and record their handles for later use. A handle is the username that begins with the “@” symbol, and is unique to an account.

Ticker	Name	Handle
MCHP	Microchip Technology Inc.	MicrochipTech
INTU	Intuit, Inc.	Intuit
IT	Gartner, Inc.	Gartner_inc
SWKS	Skyworks Solutions, Inc.	skyworksinc
TRMB	Trimble Navigation Ltd.	TrimbleCorpNews

The search resulted in 64 records (retrieved in December 2022). One company was missing: Henry (Jack) & Associates.

We can then use Twitter’s search API to collect the tweets sent from the corporate accounts using the handles.

Twitter API

API stands for application programming interface.

APIs are mechanisms that enable two software components to communicate with each other using a set of definitions and protocols. “Data APIs” are used to share and manage data between a database and data users.

API endpoints are digital locations from where information is sent and received between API communication systems.

Twitter API endpoints

The Twitter API is a set of programmatic endpoints that can be used to search and retrieve tweets, users, lists, spaces, and media, among other objects.

The API endpoints available to standard use cases includes:

Recent search endpoint retrieves filtered public tweets posted over the last week by specifying the keywords and operators. More on search tweets.
User tweet timeline endpoint retrieves up to 3200 most recent public tweets from a specific user’s timeline. More on timelines.
Filtered stream endpoint collects tweets on a topic in real-time by specifying filters for the required data. More on filtered stream.
Follows lookup endpoint supports exploring ana analyzing relationships between users for use- cases around network analysis. More on follows.

getting access to Twitter APIs

Before February 9, 2023, Twitter supported free access to Twitter APIs. After that, Twitter no longer supported it. Access tiers such as Standard (v1.1), Essential (v2), and Elevated (v2) have been deprecated since March 2023.

Prior to those cut-off dates, we could leverage the timeline endpoint to collect FGC. The limit of approximately 3200 tweets was sufficient for collecting historical tweets within a year’s time frame. The demo datasets were collected in this way.

However, in the new tiers, the free version only enables posting tweets, not retrieving tweets anymore. Therefore, for this tutorial, unfortunately, we’ll not be able to show the process step by step to retrieve tweets. Instead, we’ll introduce the steps to follow in general, and use the tweets downloaded before Feb 2023 for demonstration.

In general, there are several prerequisites to successfully connecting to Twitter’s servers.

Sign up for a developer account by following the instructions on its Developer Platform, and wait for approval.
Navigate to the Developer Portal to create a new Project or App, once your request is approved. The developer portal is where we manage our access to Twitter APIs.

Every request sent to the Twitter API has to go through an “App”. Apps and Projects are ways to organize our work with the Twitter Developer Platform by use case. Each Project can contain one or multiple Apps depending on the access level.

Save the authentication credentials in a secure location.

Once we have access and have created a Project and App, we will be able to find or generate the following credentials within our developer App:

API Key and Secret: Essentially user name and password that represents our App when making API requests.
Bearer Token: Authenticates requests on behalf of our developer App.
Access Token and Secret: User-specific credentials used to authenticate API requests. They specify the Twitter account the request is made on behalf of.
Client ID and Client Secret: credentials are used to obtain a user Access Token with OAuth 2.0 authentication.

If we forget to save these credentials, we can always regenerate new values:

Login to our Twitter account on Developer Platform.
Navigate to the Twitter App dashboard and go to the Twitter App for which we would like to generate access tokens.
Navigate to the “Keys and tokens” tab.
Look for “Consumer Keys”, “Authentication Tokens”, and “OAuth 2.0 Client ID and Client Secret” on this page.

retrieving tweets with `rtweet`

rtweet is an R wrapper for multiple Twitter API endpoints.

library(rtweet)

setup

Before we use the package rtweet to connect to Twitter, the first step is to finish authentication with rtweet.

Following the guide, we first need to set up a client to enable OAuth 2.0 in our App’s authentication settings in the developer portal.

The OAuth credentials can be found in the “Keys and tokens” tab of your app.

client <- rtweet_client(client_id = "", 
              client_secret = "", 
              app = "", 
              scopes = NULL)

We’ll then authorize it via rtweet_oauth2(). It will ask our default browser to authorize the client.

user_oauth2 <- rtweet_oauth2(client = client, scopes = NULL)
## Waiting for authentication in browser...
## Press Esc/Ctrl + C to abort
auth_as(user_oauth2)
auth_save(user_oauth2, "oauth2_authentication")

To use app based authentication, run this code:

auth <- rtweet_app()

We’ll be prompted by a message “Please enter your bearer token:”, where we enter the bearer token of our Twitter app.

After that, we can run some quick tests using small examples to see if the connection has been successfully established.

tweet_search_recent("#rstats", n = 10)

downloading tweets and metadata

I used Twitter’s Standard v1.1 API to retrieve tweets in December 2022 before it was deprecated. Therefore, the part of codes below used for retrieving user timelines and metadata no longer apply, and they are shown here for documentation purposes. Please refer to the rtweet manual and Twitter developer guides for an update if you decide to proceed with the paid tiers. Besides, the data format of these tweets retrieved using v1.1 needs to be mapped to the data format of Twitter API v2.

Let’s first create a vector to store the handles of the S&P 500 IT firms’ corporate accounts.

cusername <- c("Agilent","Accenture","Adobe","ADI_News","ADP","autodesk","Applied4Tech",
               "AMD","AristaNetworks","ANSYS","Broadcom","Broadridge","Ceridian",
               "Cadence","Ciena","salesforce","Cisco","Cognizant","DXCTechnology",
               "Enphase","EPAMSystems","FastenalCompany","FactSet","F5","FISGlobal",
               "Fiserv","FLEETCOR","Fortinet","GenDigitalInc","Corning","GlobalPayInc",
               "intel","Intuit","Gartner_inc","JuniperNetworks","Keysight","KLAcorp",
               "LeidosInc","L3HarrisTech","LamResearch","Mastercard","MicrochipTech",
               "monolithicpower","Microsoft","Moto","MicronTech","ServiceNow",
               "nvidia","NXP","onsemi","Oracle","Paycom","Paychex","PTC","PayPal",
               "Qualcomm","QorvoInc","skyworksinc","Teradyneinc","TrimbleCorpNews",
               "TXInstruments","tylertech","Visa","ZebraTechnology")

For each handle, I downloaded their most recent 3200 tweets using the function get_timeline(). n defines the desired number of results to return; setting n = Inf will download as many results as possible.

for(i in seq_along(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
}

There are rate limits for Twitter API endpoints placed on the number of requests that can be made. Most Twitter rate limits refresh every 15 minutes.

We’ll download tweets into a new environment, which makes it easier to work with when this new environment only contains the newly downloaded tweets without other objects, such as temporary variables like loop iterators i. This idea applies generally when we work with web services to download data.

We’ll use the package rlang to create and manage the new environment. new.env() generates a new environment.

We’ll store tweets and their metadata from each corporate account separately in individual data frames. We use assign() to name each data frame with the handle of each account (e.g., Adobe, nvidia). assign() assigns a value to a name in an environment.

library(rlang)

# create a new environment
ne <- new.env()

for(i in seq_along(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
  assign(cusername[i], temp, envir = ne)
}

Later, we can access and set elements of this environment with operators $ and [[ in the same way as a list. We can use env_print() and env_names() to get more information of the environment.

You can find more discussions on environments here.

Each downloaded tweet and its metadata are stored in 43 columns.

##  [1] "created_at"                    "id"                           
##  [3] "id_str"                        "full_text"                    
##  [5] "truncated"                     "display_text_range"           
##  [7] "entities"                      "source"                       
##  [9] "in_reply_to_status_id"         "in_reply_to_status_id_str"    
## [11] "in_reply_to_user_id"           "in_reply_to_user_id_str"      
## [13] "in_reply_to_screen_name"       "geo"                          
## [15] "coordinates"                   "place"                        
## [17] "contributors"                  "is_quote_status"              
## [19] "retweet_count"                 "favorite_count"               
## [21] "favorited"                     "retweeted"                    
## [23] "possibly_sensitive"            "lang"                         
## [25] "quoted_status"                 "text"                         
## [27] "favorited_by"                  "scopes"                       
## [29] "display_text_width"            "retweeted_status"             
## [31] "quoted_status_id"              "quoted_status_id_str"         
## [33] "quoted_status_permalink"       "quote_count"                  
## [35] "timestamp_ms"                  "reply_count"                  
## [37] "filter_level"                  "metadata"                     
## [39] "query"                         "withheld_scope"               
## [41] "withheld_copyright"            "withheld_in_countries"        
## [43] "possibly_sensitive_appealable"

A sample record looks like below.

## # A tibble: 1 × 43
##   created_at               id id_str             full_text truncated display_text_range
##   <dttm>                <dbl> <chr>              <chr>     <lgl>                  <dbl>
## 1 2022-03-31 00:57:24 1.51e18 15092132144704553… “Women n… FALSE                    206
## # ℹ 37 more variables: entities <list>, source <chr>, in_reply_to_status_id <lgl>,
## #   in_reply_to_status_id_str <lgl>, in_reply_to_user_id <int>,
## #   in_reply_to_user_id_str <chr>, in_reply_to_screen_name <chr>, geo <list>,
## #   coordinates <list>, place <list>, contributors <lgl>, is_quote_status <lgl>,
## #   retweet_count <int>, favorite_count <int>, favorited <lgl>, retweeted <lgl>,
## #   possibly_sensitive <lgl>, lang <chr>, quoted_status <list>, text <chr>,
## #   favorited_by <lgl>, scopes <lgl>, display_text_width <lgl>, …

cleaning downloaded tweets

We’ll work on a subset of the downloaded data, and keep the variables created_at, full_text, retweet_count, and favorite_count.

created_at records the UTC time when a tweet was created, which will be used to match corresponding ultra-high-frequency stock trading activity data.

## # A tibble: 1 × 1
##   created_at         
##   <dttm>             
## 1 2022-03-31 00:57:24

We’ll create a new variable timestamp that converts created_at from character to datetime class in the EST timezone.

We will use retweet_count and favorite_count as control variables in the final models. In the original study, the number of followers was used as a control variable. However, this information was not provided by the Standard v1.1 user timeline API endpoint. Therefore, we will use retweet_count and favorite_count as proxies.

We’ll keep the tweets written in English. lang is the language of a tweet.

We’ll use a loop to clean tweets downloaded from each corporate account, which are stored in separate data frames in the environment ne. At the end of the loop, we’ll bind the data frames that have been processed individually.

d1 <- as.POSIXct("2022-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")
d2 <- as.POSIXct("2022-12-17 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")

nams <- ls(env = ne)
news <- data.frame()

for (i in seq_along(nams)){
  df <- ne[[nams[i]]]
  df$handle <- nams[i]
  df$timestamp <- as.POSIXct(df$created_at, format = "%Y-%m-%d %H:%M:%S")
  attr(df$timestamp, "tzone") <- "EST"
  
  df <- subset(df, 
               subset = timestamp >= d1 & timestamp <= d2 & lang == "en", 
               select = c(timestamp, full_text, handle, retweet_count, favorite_count))
  
  news <- rbind(news, df)
}

ls(env = ne) lists the objects in the environment ne, which are the data frames where we store the downloaded tweets from each corporate account. ne[[nams[i]]] extracts each data frame from ne.

21.3 Intraday transactions data

Now we’ll acquire ultra-high-frequency trading data from Wharton Research Data Services (WRDS). WRDS is a comprehensive business data management system which provides a variety of financial, economic and marketing data sources.

The database we’ll use in WRDS is TAQ - Millisecond Consolidated Trades. This database contains intraday transactions data for all securities listed on the New York Stock Exchange (NYSE), American Stock Exchange (AMEX), the Nasdaq National Market System (NMS), and all other U.S. equity exchanges.

The database requests the information below to retrieve data we need. We’ll do this month by month.

Date range: 2022-01-01 to 2022-01-31
Time range

Beginning: 09:30:00
Ending: 16:05:00
Check “Filter observations by timestamp”

9:30 a.m. to 4:00 p.m. ET is the NYSE Core Trading Session. We’ll use only NYSE stocks to avoid any possibility of the results being influenced by differences in trading protocols.

The last daily trade is assumed to occur no later than 16:05. Transactions are commonly reported up to five minutes after the official close at 16:00.

Company codes

We’ll paste the tickers to a plain text file (.txt), having one code per line, and upload the file to WRDS to download the transaction data of the specified companies.

Query variables

DATE: Date of trade
TIME_M: Time of trade
EX: Exchange that issued the trade
SYM_ROOT: Security symbol root
TR_SCOND: Trade Sale Condition
SIZE: Volume of trade
PRICE: Price of trade

Query output

Output Format: tab-delimited text (*.txt)
Compression Type: zip (*.zip)
Date Format: YYYY-MM-DD. (e.g., 1984-07-25)

The output format is tab-delimited text compressed into a zip file. The output file is quite large, and requires high-performance computers to work with.

21.4 Code summary

For clarity, here are the codes we have used to collect and clean tweets.

library(rtweet)
library(rlang)

# a. Download tweets into a new environment
cusername <- c("Agilent","Accenture","Adobe","ADI_News","ADP","autodesk","Applied4Tech",
               "AMD","AristaNetworks","ANSYS","Broadcom","Broadridge","Ceridian",
               "Cadence","Ciena","salesforce","Cisco","Cognizant","DXCTechnology",
               "Enphase","EPAMSystems","FastenalCompany","FactSet","F5","FISGlobal",
               "Fiserv","FLEETCOR","Fortinet","GenDigitalInc","Corning","GlobalPayInc",
               "intel","Intuit","Gartner_inc","JuniperNetworks","Keysight","KLAcorp",
               "LeidosInc","L3HarrisTech","LamResearch","Mastercard","MicrochipTech",
               "monolithicpower","Microsoft","Moto","MicronTech","ServiceNow",
               "nvidia","NXP","onsemi","Oracle","Paycom","Paychex","PTC","PayPal",
               "Qualcomm","QorvoInc","skyworksinc","Teradyneinc","TrimbleCorpNews",
               "TXInstruments","tylertech","Visa","ZebraTechnology")

ne <- new.env()
for(i in 1:length(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
  
  assign(cusername[i], temp, envir = ne)
}

# Clean tweets
d1 <- as.POSIXct("2022-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")
d2 <- as.POSIXct("2022-12-17 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")

news <- data.frame()
nams <- ls(env = ne)
for (i in 1:length(nams)){
  df <- ne[[nams[i]]]
  df$handle <- nams[i]
  df$timestamp <- as.POSIXct(df$created_at, format = "%Y-%m-%d %H:%M:%S")
  attr(df$timestamp, "tzone") <- "EST"
  
  df <- subset(df, 
               subset = timestamp >= d1 & timestamp <= d2 & lang == "en", 
               select = c(timestamp, full_text, handle, retweet_count, favorite_count))
  
  news <- rbind(news, df)
}