21 Collecting Data

In this chapter, we discuss how to collect data from various sources to construct our dataset for analysis.

The original study covered the time period between January 2013 and August 2018. For the current project, we will concentrate on January 2022 and collect data during that period. Our decision to focus on this time period is motivated by two reasons. Firstly, due to the nature and volume of ultra-high-frequency trading data, a single trading day can generate a vast number of transactions, ranging from 4 to 7 million records for the 60+ stocks in our sample.

Second, there are limitations on how far back we can go to collect tweets using Twitter’s standard search API endpoints. Specifically, the timeline endpoint we are going to use allows for extraction of around 3200 most recent tweets per user, which amounts to approximately one year’s worth of tweets at an average of 10 tweets per day. While Twitter’s Academic Research access and enterprise APIs offer higher caps, they are either prohibitively expensive or difficult to obtain access to.

Below we show how to get the historical constituent list of the IT firms in the S&P 500 indices, collect tweets sent from the corporate accounts of these firms for the firm-generated content, and acquire the ultra-high-frequency trading data of these firms.

21.1 Constituent list of the S&P 500 IT firms

To construct our dataset, the first step is getting the historical constituent list of the IT firms in the S&P 500 indices.

We’ll follow this post Where can I find constituent lists for the S&P indices? created by the NYU Libraries to retrieve the historical constituent list of the S&P 500 IT firms from the database Global Financial Data.

Here are the steps to take:

  • Choose Constituent Membership from the left-hand menu
  • Category: Major US Indices
  • Index: S&P 500 Information Technology
  • For historical lists, select As of a Specific Date and enter the desired year.

There are 65 records in the downloaded spreadsheet. The two columns that we need are Ticker and Name (company name).

Ticker Name
MCHP Microchip Technology Inc
INTU Intuit, Inc.
IT Gartner, Inc.
SWKS Skyworks Solutions, Inc.
TRMB Trimble Navigation Ltd.

We will use this piece of information in two ways. (1) Using company name to search for corporate accounts on Twitter and subsequently their username, also known as handle. (2) Using ticker to retrieve stock transactions from the databases in Wharton Research Data Services (WRDS). These steps are described below. We’ll merge datasets of tweets and transactions for modeling purposes towards the end of this project.

21.2 Firm-generated content (FGC) on Twitter

To collect FGC, first, we’ll go to Twitter to search for these corporate accounts one by one and record their handles for later use. A handle is the username that begins with the “@” symbol, and is unique to an account.

Ticker Name Handle
MCHP Microchip Technology Inc. MicrochipTech
INTU Intuit, Inc. Intuit
IT Gartner, Inc. Gartner_inc
SWKS Skyworks Solutions, Inc. skyworksinc
TRMB Trimble Navigation Ltd. TrimbleCorpNews

The search resulted in 64 records. One company was not found: Henry (Jack) & Associates.

We will use Twitter’s search API to collect the tweets sent from the corporate accounts, matched to their handles respectively.

Twitter API

API stands for application programming interface.

APIs are mechanisms that enable two software components to communicate with each other using a set of definitions and protocols. “Data APIs” are used to share and manage data between a database and data users.

API endpoints are digital locations from where information is sent and received between API communication systems.

Twitter API endpoints

The Twitter API is a set of programmatic endpoints that can be used to find and retrieve tweets, users, lists, trends, media, and places, among other objects.

For academic users, Twitter provides various API endpoints to access tweets and their metadata. Among them, the API endpoints available to standard use cases include:

  • Recent search endpoint retrieves tweets from the last 7 days by specifying the keywords and operators.
  • User tweet timeline endpoint retrieves up to 3200 most recent public tweets from a specific user’s timeline.
  • Filtered stream endpoint collects tweets on a topic in real-time by specifying filters for the required data.
  • Follows lookup endpoint supports exploring relationships between the following and followers for use-cases around network analysis.

We’ll leverage the timeline endpoint to collect FGC. The limit of approximately 3200 tweets should be sufficient for us to search for historical tweets within a year’s timeframe.

getting access to Twitter API

There are several prerequisites to successfully connecting to Twitter’s servers.

  1. Sign up for a developer account by following the instructions at its Developer Platform.

  2. Navigate to the Developer Portal to create a new Project or standalone App.

The developer portal is where we manage our access to Twitter APIs.

Every request to the Twitter API has to go through an “App”. Apps and Projects are ways to organize our work with the Twitter Developer Platform by use case. Each Project can contain one or multiple Apps depending on the access level.

  1. Save the authentication credentials in a secure location.

Once we have access and have created a Project and App, we will be able to find or generate the following credentials within our developer App:

  • API Key and Secret: Essentially user name and password that represents our App when making API requests.
  • Bearer Token: Authenticates requests on behalf of our developer App.
  • Access Token and Secret: User-specific credentials used to authenticate API requests. They specify the Twitter account the request is made on behalf of.

If we forget to save these credentials, we can always regenerate new values:

  • Login to our Twitter account on Developer Platform.
  • Navigate to the Twitter App dashboard and go to the Twitter App for which we would like to generate access tokens.
  • Navigate to the “Keys and tokens” tab.
  • Find the Consumer Keys and Authentication Tokens on this page.

Authentication uses a Bearer Token, also known as application-only authentication. We’ll provide this piece of information to the package rtweet to retrieve tweets, as shown below.

retrieving tweets with rtweet

rtweet is an R wrapper for multiple Twitter API endpoints.

setup

Before we use rtweet to connect to Twitter, the first step is to finish authentication with rtweet using rtweet_app().

library(rtweet)
auth <- rtweet_app()

We’ll be prompted by a message “Please enter your bearer token:”, where we enter the bearer token of our Twitter app.

After that, we can run a quick test using small examples to see if the connection has been successfully established.

search_tweets("#rstats", token = auth, n = 10)

rtweet’s vignette Intro to rtweet provide many such examples.

downloading tweets and metadata

Next, we’re ready to get user timelines with the function get_timeline().

Let’s create a vector to store the handles of the S&P 500 IT firms’ corporate accounts.

cusername <- c("Agilent","Accenture","Adobe","ADI_News","ADP","autodesk","Applied4Tech",
               "AMD","AristaNetworks","ANSYS","Broadcom","Broadridge","Ceridian",
               "Cadence","Ciena","salesforce","Cisco","Cognizant","DXCTechnology",
               "Enphase","EPAMSystems","FastenalCompany","FactSet","F5","FISGlobal",
               "Fiserv","FLEETCOR","Fortinet","GenDigitalInc","Corning","GlobalPayInc",
               "intel","Intuit","Gartner_inc","JuniperNetworks","Keysight","KLAcorp",
               "LeidosInc","L3HarrisTech","LamResearch","Mastercard","MicrochipTech",
               "monolithicpower","Microsoft","Moto","MicronTech","ServiceNow",
               "nvidia","NXP","onsemi","Oracle","Paycom","Paychex","PTC","PayPal",
               "Qualcomm","QorvoInc","skyworksinc","Teradyneinc","TrimbleCorpNews",
               "TXInstruments","tylertech","Visa","ZebraTechnology")

For each handle, we’ll download their most recent 3200 tweets. In get_timeline(), n defines the desired number of results to return; setting n = Inf will download as many results as possible.

for(i in seq_along(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
}

There are rate limits for Twitter API endpoints placed on the number of requests that can be made. Most Twitter rate limits refresh every 15 minutes. The easiest way to download more than that is to set retryonratelimit = TRUE.


We’ll download tweets into a new environment, which makes it easier to work with when this new environment only contains the newly downloaded tweets without other objects such as temporary variables like loop iterators i.

We’ll use the package rlang to create and manage the new environment. new.env() generates a new environment.

We’ll store tweets and their metadata from each corporate account separately in individual data frames. We use assign() to name each data frame with the handle of each account (e.g. Adobe, nvidia). assign() assigns a value to a name in an environment.

library(rlang)

# create a new environment
ne <- new.env()

for(i in seq_along(cusername)) {
  temp <- 
    get_timeline(
      user = cusername[i],
      n = Inf,
      retryonratelimit = TRUE,
      token = auth
    )
  assign(cusername[i], temp, envir = ne)
}

Later, we can access and set elements of this environment with operators $ and [[ in the same way as a list. We can use env_print() and env_names() to get more information of the environment.

More discussions on environments.

Each downloaded tweet and its metadata are stored in 43 columns.

##  [1] "created_at"                    "id"                           
##  [3] "id_str"                        "full_text"                    
##  [5] "truncated"                     "display_text_range"           
##  [7] "entities"                      "source"                       
##  [9] "in_reply_to_status_id"         "in_reply_to_status_id_str"    
## [11] "in_reply_to_user_id"           "in_reply_to_user_id_str"      
## [13] "in_reply_to_screen_name"       "geo"                          
## [15] "coordinates"                   "place"                        
## [17] "contributors"                  "is_quote_status"              
## [19] "retweet_count"                 "favorite_count"               
## [21] "favorited"                     "retweeted"                    
## [23] "possibly_sensitive"            "lang"                         
## [25] "quoted_status"                 "text"                         
## [27] "favorited_by"                  "scopes"                       
## [29] "display_text_width"            "retweeted_status"             
## [31] "quoted_status_id"              "quoted_status_id_str"         
## [33] "quoted_status_permalink"       "quote_count"                  
## [35] "timestamp_ms"                  "reply_count"                  
## [37] "filter_level"                  "metadata"                     
## [39] "query"                         "withheld_scope"               
## [41] "withheld_copyright"            "withheld_in_countries"        
## [43] "possibly_sensitive_appealable"

A sample record looks like below.

## # A tibble: 1 × 43
##   created_at               id id_str full_…¹ trunc…² displ…³ entities     source
##   <dttm>                <dbl> <chr>  <chr>   <lgl>     <dbl> <list>       <chr> 
## 1 2022-03-31 00:57:24 1.51e18 15092… “Women… FALSE       206 <named list> "<a h…
## # … with 35 more variables: in_reply_to_status_id <lgl>,
## #   in_reply_to_status_id_str <lgl>, in_reply_to_user_id <int>,
## #   in_reply_to_user_id_str <chr>, in_reply_to_screen_name <chr>, geo <list>,
## #   coordinates <list>, place <list>, contributors <lgl>,
## #   is_quote_status <lgl>, retweet_count <int>, favorite_count <int>,
## #   favorited <lgl>, retweeted <lgl>, possibly_sensitive <lgl>, lang <chr>,
## #   quoted_status <list>, text <chr>, favorited_by <lgl>, scopes <lgl>, …

cleaning downloaded tweets

We’ll work on a subset of the downloaded data and keep the variables created_at, full_text, retweet_count, and favorite_count.

created_at records the UTC time when a tweet was created, which will be used to match corresponding ultra-high-frequency stock trading activity data.

## # A tibble: 1 × 1
##   created_at         
##   <dttm>             
## 1 2022-03-31 00:57:24

We’ll create a new variable timestamp that converts created_at from character to datetime class in the EST timezone. We’ll filter the downloaded tweets within the time span of 01/01/2022 - 01/31/2022.

We will use retweet_count and favorite_count as control variables in the final models. In the original study, the number of followers was used as a control variable. However, this information is not provided by a standard user timeline API endpoint. Therefore, we will use retweet_count and favorite_count as proxies.

We’ll keep the tweets written in English. lang is the language of a tweet.

We’ll use a loop to clean tweets we downloaded from each corporate account stored in separate data frames in the environment ne. At the end of the loop, we’ll bind the data frames that have been processed individually.

d1 <- as.POSIXct("2022-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")
d2 <- as.POSIXct("2022-01-31 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "EST")

nams <- ls(env = ne)
news <- data.frame()

for (i in seq_along(nams)){
  df <- ne[[nams[i]]]
  df$handle <- nams[i]
  df$timestamp <- as.POSIXct(df$created_at, format = "%Y-%m-%d %H:%M:%S")
  attr(df$timestamp, "tzone") <- "EST"
  
  df <- subset(df, 
               subset = timestamp >= d1 & timestamp <= d2 & lang == "en", 
               select = c(timestamp, full_text, handle, retweet_count, favorite_count))
  
  news <- rbind(news, df)
}

ls(env = ne) lists the objects in the environment ne, which are data frames where we store the downloaded tweets from each corporate account. ne[[nams[i]]] extracts each data frame from ne.

21.3 Intraday transactions data

Now we’ll acquire ultra-high-frequency trading data from Wharton Research Data Services (WRDS). WRDS is a comprehensive business data management system which provides a variety of financial, economic and marketing data sources.

The database we’ll use in WRDS is TAQ - Millisecond Consolidated Trades. This database contains intraday transactions data for all securities listed on the New York Stock Exchange (NYSE), American Stock Exchange (AMEX), the Nasdaq National Market System (NMS), and all other U.S. equity exchanges.

The database requests the information below to retrieve data we need.

  1. Date range: 2022-01-01 to 2022-01-31

  2. Time range

  • Beginning: 09:30:00
  • Ending: 16:05:00

9:30 a.m. to 4:00 p.m. ET is the NYSE Core Trading Session. We’ll use only NYSE stocks to avoid any possibility of the results being influenced by differences in trading protocols.

The last daily trade is assumed to occur no later than 16:05. Transactions are commonly reported up to five minutes after the official close at 16:00.

  1. Company codes

We’ll paste the tickers to a plain text file (.txt), having one code per line, and upload the file to WRDS to download the transaction data of the specified companies.

  1. Query variables
  • DATE: Date of trade
  • TIME_M: Time of trade
  • EX: Exchange that issued the trade
  • SYM_ROOT: Security symbol root
  • TR_SCOND: Trade Sale Condition
  • SIZE: Volume of trade
  • PRICE: Price of trade
  • TR_SEQNUM: Trade Sequence Number
  1. Query output
  • Output Format: tab-delimited text (*.txt)
  • Compression Type: zip (*.zip)
  • Date Format: YYYY-MM-DD. (e.g. 1984-07-25)

The output format is tab-delimited text compressed into a zip file, which is relevant to how we will read these files into R. The output file is quite large, and requires high-performance computers to work with.