21 Collecting Data

In this chapter, we discuss how to collect data from various sources to construct our dataset for analysis.

The original study covered the time period between January 2013 and August 2018. The demo tweets discussed here were collected from Twitter in December 2022. For the current project, we will use data from January 2022. The number of transactions that a single trading day can produce is staggering, given the nature and volume of ultra-high-frequency trading data. For the 60+ stocks in our sample, this can range from 4 to 7 million records per day. Consequently, even a month’s worth of data would be more than sufficient for our objective of working with large datasets.

Below we show how to get the historical constituent list of the IT firms in the S&P 500 indices, collect tweets sent from the corporate accounts of these firms for the firm-generated content, and acquire the ultra-high-frequency trading data of these firms.

21.1 Constituent list of the S&P 500 IT firms

To construct our dataset, the first step is getting the historical constituent list of the IT firms in the S&P 500 indices.

We’ll follow this post Where can I find constituent lists for the S&P indices? created by the NYU Libraries to retrieve the historical constituent list of the S&P 500 IT firms from the database Global Financial Data.

Here are the steps to take:

  • Choose Constituent Membership from the left-hand menu
  • Category: Major US Indices
  • Index: S&P 500 Information Technology
  • For historical lists, select As of a Specific Date and enter the desired year.

There are 65 records in the downloaded spreadsheet. The two columns that we need are Ticker and Name (company name).

Ticker Name
MCHP Microchip Technology Inc
INTU Intuit, Inc.
IT Gartner, Inc.
SWKS Skyworks Solutions, Inc.
TRMB Trimble Navigation Ltd.

We will use this piece of information in two ways. (1) Using company name to search for corporate accounts on Twitter and subsequently their username, also known as handle. (2) Using ticker to retrieve stock transactions from the databases from the platform Wharton Research Data Services (WRDS). These steps are described below. We’ll later merge tweets and transaction data for modeling purposes towards the end of this project.

21.2 Firm-generated content (FGC) on Twitter

To collect FGC, first, we’ll go to Twitter to search for these corporate accounts one by one and record their handles for later use. A handle is the username that begins with the “@” symbol, and is unique to an account.

Ticker Name Handle
MCHP Microchip Technology Inc. MicrochipTech
INTU Intuit, Inc. Intuit
IT Gartner, Inc. Gartner_inc
SWKS Skyworks Solutions, Inc. skyworksinc
TRMB Trimble Navigation Ltd. TrimbleCorpNews

The search resulted in 64 records. One company was missing: Henry (Jack) & Associates.

We can then use Twitter’s search API to collect the tweets sent from the corporate accounts using the handles.

Twitter API

API stands for application programming interface.

APIs are mechanisms that enable two software components to communicate with each other using a set of definitions and protocols. “Data APIs” are used to share and manage data between a database and data users.

API endpoints are digital locations from where information is sent and received between API communication systems.

Twitter API endpoints

The Twitter API is a set of programmatic endpoints that can be used to search and retrieve tweets, users, lists, spaces, and media, among other objects.

The API endpoints available to standard use cases include:

  • Recent search endpoint retrieves filtered public tweets posted over the last week by specifying the keywords and operators. More on search tweets.
  • User tweet timeline endpoint retrieves up to 3200 most recent public tweets from a specific user’s timeline. More on timelines.
  • Filtered stream endpoint collects tweets on a topic in real-time by specifying filters for the required data. More on filtered stream.
  • Follows lookup endpoint supports exploring ana analyzing relationships between users for use- cases around network analysis. More on follows.

We’ll leverage the timeline endpoint to collect FGC. The limit of approximately 3200 tweets is sufficient for collecting historical tweets within a year’s timeframe.

getting access to Twitter APIs

Starting February 2023, Twitter no longer supports free access to the Twitter API. Access tiers such as Standard (v1.1), Essential (v2), and Elevated (v2) have been deprecated since March 2023.

In the new tiers, the free version will only enable posting tweets, not including retrieving tweets. Therefore, for this tutorial, unfortunately, we’ll not be able to show the process step by step to retrieve tweets. Instead, we’ll introduce the steps to follow in general, and use the tweets downloaded before Feb 2023 for demonstration.

In general, there are several prerequisites to successfully connecting to Twitter’s servers.

  1. Sign up for a developer account by following the instructions on its Developer Platform, and wait for approval.

  2. Navigate to the Developer Portal to create a new Project or App, once your request is approved. The developer portal is where we manage our access to Twitter APIs.

Every request sent to the Twitter API has to go through an “App”. Apps and Projects are ways to organize our work with the Twitter Developer Platform by use case. Each Project can contain one or multiple Apps depending on the access level.

  1. Save the authentication credentials in a secure location.

Once we have access and have created a Project and App, we will be able to find or generate the following credentials within our developer App:

  • API Key and Secret: Essentially user name and password that represents our App when making API requests.
  • Bearer Token: Authenticates requests on behalf of our developer App.
  • Access Token and Secret: User-specific credentials used to authenticate API requests. They specify the Twitter account the request is made on behalf of.
  • Client ID and Client Secret: credentials are used to obtain a user Access Token with OAuth 2.0 authentication.

If we forget to save these credentials, we can always regenerate new values:

  • Login to our Twitter account on Developer Platform.
  • Navigate to the Twitter App dashboard and go to the Twitter App for which we would like to generate access tokens.
  • Navigate to the “Keys and tokens” tab.
  • Look for “Consumer Keys”, “Authentication Tokens”, and “OAuth 2.0 Client ID and Client Secret” on this page.

retrieving tweets with rtweet

rtweet is an R wrapper for multiple Twitter API endpoints.

library(rtweet)

setup

Before we use the package rtweet to connect to Twitter, the first step is to finish authentication with rtweet.

Following the guide, we first need to set up a client to enable OAuth 2.0 in our App’s authentication settings in the developer portal.

The OAuth credentials can be found in the “Keys and tokens” tab of your app.

client <- rtweet_client(client_id = "", 
              client_secret = "", 
              app = "", 
              scopes = NULL)

We’ll then authorize it via rtweet_oauth2(). It will ask our default browser to authorize the client.

user_oauth2 <- rtweet_oauth2(client = client, scopes = NULL)
## Waiting for authentication in browser...
## Press Esc/Ctrl + C to abort
auth_as(user_oauth2)
auth_save(user_oauth2, "oauth2_authentication")

To use app based authentication, run this code:

auth <- rtweet_app()

We’ll be prompted by a message “Please enter your bearer token:”, where we enter the bearer token of our Twitter app.

After that, we can run some quick tests using small examples to see if the connection has been successfully established.

tweet_search_recent("#rstats", n = 10)

downloading tweets and metadata

I used Twitter’s Standard v1.1 API to retrieve tweets in December 2022 before it was deprecated. Therefore, the part of codes below used for retrieving user timelines and metadata no longer apply, and they are shown here for documentation purposes. Please refer to the rtweet manual and Twitter developer guides for an update if you decide to proceed with the paid tiers. Besides, the data format of these tweets retrieved using v1.1 needs to be mapped to the data format of Twitter API v2.

Let’s first create a vector to store the handles of the S&P 500 IT firms’ corporate accounts.

cusername <- c("Agilent","Accenture","Adobe","ADI_News","ADP","autodesk","Applied4Tech",
               "AMD","AristaNetworks","ANSYS","Broadcom","Broadridge","Ceridian",
               "Cadence","Ciena","salesforce","Cisco","Cognizant","DXCTechnology",
               "Enphase","EPAMSystems","FastenalCompany","FactSet","F5","FISGlobal",
               "Fiserv","FLEETCOR","Fortinet","GenDigitalInc","Corning","GlobalPayInc",
               "intel","Intuit","Gartner_inc","JuniperNetworks","Keysight","KLAcorp",
               "LeidosInc","L3HarrisTech","LamResearch","Mastercard","MicrochipTech",
               "monolithicpower","Microsoft","Moto","MicronTech","ServiceNow",
               "nvidia","NXP","onsemi","Oracle","Paycom","Paychex","PTC","PayPal",
               "Qualcomm","QorvoInc","skyworksinc","Teradyneinc","TrimbleCorpNews",
               "TXInstruments","tylertech","Visa","ZebraTechnology")

For each handle, I downloaded their most recent 3200 tweets using the function get_timeline(). n defines the desired number of results to return; setting n = Inf will download as many results as possible.

for(i in seq_along(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
}

There are rate limits for Twitter API endpoints placed on the number of requests that can be made. Most Twitter rate limits refresh every 15 minutes.


We’ll download tweets into a new environment, which makes it easier to work with when this new environment only contains the newly downloaded tweets without other objects such as temporary variables like loop iterators i. This idea applies generally even when we work with other web services to download data.

We’ll use the package rlang to create and manage the new environment. new.env() generates a new environment.

We’ll store tweets and their metadata from each corporate account separately in individual data frames. We use assign() to name each data frame with the handle of each account (e.g., Adobe, nvidia). assign() assigns a value to a name in an environment.

library(rlang)

# create a new environment
ne <- new.env()

for(i in seq_along(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
  assign(cusername[i], temp, envir = ne)
}

Later, we can access and set elements of this environment with operators $ and [[ in the same way as a list. We can use env_print() and env_names() to get more information of the environment.

More discussions on environments.

Each downloaded tweet and its metadata are stored in 43 columns.

##  [1] "created_at"                    "id"                            "id_str"                       
##  [4] "full_text"                     "truncated"                     "display_text_range"           
##  [7] "entities"                      "source"                        "in_reply_to_status_id"        
## [10] "in_reply_to_status_id_str"     "in_reply_to_user_id"           "in_reply_to_user_id_str"      
## [13] "in_reply_to_screen_name"       "geo"                           "coordinates"                  
## [16] "place"                         "contributors"                  "is_quote_status"              
## [19] "retweet_count"                 "favorite_count"                "favorited"                    
## [22] "retweeted"                     "possibly_sensitive"            "lang"                         
## [25] "quoted_status"                 "text"                          "favorited_by"                 
## [28] "scopes"                        "display_text_width"            "retweeted_status"             
## [31] "quoted_status_id"              "quoted_status_id_str"          "quoted_status_permalink"      
## [34] "quote_count"                   "timestamp_ms"                  "reply_count"                  
## [37] "filter_level"                  "metadata"                      "query"                        
## [40] "withheld_scope"                "withheld_copyright"            "withheld_in_countries"        
## [43] "possibly_sensitive_appealable"

A sample record looks like below.

## # A tibble: 1 × 43
##   created_at               id id_str              full_text       truncated display_text_range entities     source
##   <dttm>                <dbl> <chr>               <chr>           <lgl>                  <dbl> <list>       <chr> 
## 1 2022-03-31 00:57:24 1.51e18 1509213214470455300 “Women need to… FALSE                    206 <named list> "<a h…
## # ℹ 35 more variables: in_reply_to_status_id <lgl>, in_reply_to_status_id_str <lgl>, in_reply_to_user_id <int>,
## #   in_reply_to_user_id_str <chr>, in_reply_to_screen_name <chr>, geo <list>, coordinates <list>, place <list>,
## #   contributors <lgl>, is_quote_status <lgl>, retweet_count <int>, favorite_count <int>, favorited <lgl>,
## #   retweeted <lgl>, possibly_sensitive <lgl>, lang <chr>, quoted_status <list>, text <chr>, favorited_by <lgl>,
## #   scopes <lgl>, display_text_width <lgl>, retweeted_status <lgl>, quoted_status_id <lgl>,
## #   quoted_status_id_str <lgl>, quoted_status_permalink <lgl>, quote_count <lgl>, timestamp_ms <lgl>,
## #   reply_count <lgl>, filter_level <lgl>, metadata <lgl>, query <lgl>, withheld_scope <lgl>, …

cleaning downloaded tweets

We’ll work on a subset of the downloaded data, and keep the variables created_at, full_text, retweet_count, and favorite_count.

created_at records the UTC time when a tweet was created, which will be used to match corresponding ultra-high-frequency stock trading activity data.

## # A tibble: 1 × 1
##   created_at         
##   <dttm>             
## 1 2022-03-31 00:57:24

We’ll create a new variable timestamp that converts created_at from character to datetime class in the EST timezone. We’ll then filter the downloaded tweets within the time span of 01/01/2022 - 01/31/2022.

We will use retweet_count and favorite_count as control variables in the final models. In the original study, the number of followers was used as a control variable. However, this information was not provided by the Standard v1.1 user timeline API endpoint. Therefore, we will use retweet_count and favorite_count as proxies.

We’ll keep the tweets written in English. lang is the language of a tweet.

We’ll use a loop to clean the tweets downloaded from each corporate account, stored in separate data frames in the environment ne. At the end of the loop, we’ll bind the data frames that have been processed individually.

d1 <- as.POSIXct("2022-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")
d2 <- as.POSIXct("2022-01-31 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")

nams <- ls(env = ne)
news <- data.frame()

for (i in seq_along(nams)){
  df <- ne[[nams[i]]]
  df$handle <- nams[i]
  df$timestamp <- as.POSIXct(df$created_at, format = "%Y-%m-%d %H:%M:%S")
  attr(df$timestamp, "tzone") <- "EST"
  
  df <- subset(df, 
               subset = timestamp >= d1 & timestamp <= d2 & lang == "en", 
               select = c(timestamp, full_text, handle, retweet_count, favorite_count))
  
  news <- rbind(news, df)
}

ls(env = ne) lists the objects in the environment ne, the data frames where we store the downloaded tweets from each corporate account. ne[[nams[i]]] extracts each data frame from ne.

21.3 Intraday transactions data

Now we’ll acquire ultra-high-frequency trading data from Wharton Research Data Services (WRDS). WRDS is a comprehensive business data management system which provides a variety of financial, economic and marketing data sources.

The database we’ll use in WRDS is TAQ - Millisecond Consolidated Trades. This database contains intraday transactions data for all securities listed on the New York Stock Exchange (NYSE), American Stock Exchange (AMEX), the Nasdaq National Market System (NMS), and all other U.S. equity exchanges.

The database requests the information below to retrieve data we need.

  1. Date range: 2022-01-01 to 2022-01-31

  2. Time range

  • Beginning: 09:30:00
  • Ending: 16:05:00

9:30 a.m. to 4:00 p.m. ET is the NYSE Core Trading Session. We’ll use only NYSE stocks to avoid any possibility of the results being influenced by differences in trading protocols.

The last daily trade is assumed to occur no later than 16:05. Transactions are commonly reported up to five minutes after the official close at 16:00.

  1. Company codes

We’ll paste the tickers to a plain text file (.txt), having one code per line, and upload the file to WRDS to download the transaction data of the specified companies.

  1. Query variables
  • DATE: Date of trade
  • TIME_M: Time of trade
  • EX: Exchange that issued the trade
  • SYM_ROOT: Security symbol root
  • TR_SCOND: Trade Sale Condition
  • SIZE: Volume of trade
  • PRICE: Price of trade
  • TR_SEQNUM: Trade Sequence Number
  1. Query output
  • Output Format: tab-delimited text (*.txt)
  • Compression Type: zip (*.zip)
  • Date Format: YYYY-MM-DD. (e.g., 1984-07-25)

The output format is tab-delimited text compressed into a zip file, which is relevant to how we will read these files into R. The output file is quite large, and requires high-performance computers to work with.