Big data and Social Science

8.23 Lab 7: Twitter’s streaming API

8.23.1 Authenticating

Before we can start collecting Twitter data, we need to create an OAuth token that will allow us to authenticate our connection and access our personal data.

After the new API changes, getting a new token requires submitting an application for a developer account, which may take a few days. For teaching purposes only, I will temporarily share one of my tokens with each of you, so that we can use the API without having to do the authentication.

However, if in the future you want to get your own token, here’s how you would do it:

Follow these steps to create your token:

Go to https://developer.twitter.com/en/apps and sign in.
If you don’t have a developer account, you will need to apply for one first. Fill in the application form and wait for a response.
Once it’s approved, click on “Create New App”. You will need to have a phone number associated with your account in order to be able to create a token.
Fill name, description, and website (it can be anything, even http://www.google.com). Make sure you leave ‘Callback URL’ empty.
Agree to user conditions.
From the “Keys and Access Tokens” tab, copy consumer key and consumer secret and paste below
Click on “Create my access token”, then copy and paste your access token and access token secret below

library(ROAuth) # oauth = open authorization see - https://en.wikipedia.org/wiki/OAuth
my_oauth <- list(consumer_key = "ynznhx8u4Xyf6spbX2mxTQPHS",
   consumer_secret = "L6Jylw1iRUR7ExL6jPqRpcLZdHybOBUWORtbLhduQq6pW5HrXn",
   access_token="2714315514-Wsu9IL6AvfXxcTCgiEXIoUPRdsQCytOFBK24BCp",
   access_token_secret = "GhRUFsc87rSveHCDAjI15nDTmv9j27pWfUX9zePDovOqV")
save(my_oauth, file="./www/my_oauth")

load("./www/my_oauth")

What can go wrong here? Make sure all the consumer and token keys are pasted here as is, without any additional space character. If you don’t see any output in the console after running the code above, that’s a good sign.

Note that I saved the list as a file in my hard drive. That will save us some time later on, but you could also just re-run the code in lines 22 to 27 before connecting to the API in the future.

To check that it worked, try running the line below:

library(tweetscores)
getUsers(screen_names="LSEnews", oauth = my_oauth)[[1]]$screen_name

If this displays LSEnews then we’re good to go!

Some of the functions below will work with more than one token. If you want to save multiple tokens, see the instructions at the end of the file.

8.23.2 Collecting data from Twitter’s Streaming API

8.23.2.1 Collecting using keywords

Collecting tweets filtering by keyword:

library(streamR)
filterStream(file.name="./www/trump-streaming-tweets2.json", track="trump", 
    timeout=10, oauth=my_oauth)

# Q: What does the path "./www/trump-streaming-tweets.json" mean?

Note the options:

file.name indicates the file in your disk where the tweets will be downloaded
track is the keyword(s) mentioned in the tweets we want to capture.
timeout is the number of seconds that the connection will remain open
oauth is the OAuth token we are using

Once it has finished, we can open it in R as a data frame with the parseTweets function

tweets <- parseTweets("./www/trump-streaming-tweets.json")
tweets[1,]

If we want, we could also export it to a csv file to be opened later with Excel

write.csv(tweets, file="./www/trump-streaming-tweets.csv", row.names=FALSE)

And this is how we would capture tweets mentioning multiple keywords:

filterStream(file.name="./www/politics-tweets.json", 
    track=c("wildfire", "trump"),
    tweets=10000, oauth=my_oauth)

Note that here I choose a different option, tweets, which indicates how many tweets (approximately) the function should capture before we close the connection to the Twitter API.

8.23.2.2 Collecting using geolocation

This second example shows how to collect tweets filtering by location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area.

For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude = West/East and latitude = North/South) that indicate the southwest corner AND the northeast corner. Note the reverse order: it’s not (lat, long), but (long, lat).

In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? Simply click on point in Google Maps and copy paste the coordinates.

filterStream(file.name="./www/tweets_geo.json", locations=c(-125, 25, -66, 50), # (long, lat, long, lat)
    timeout=30, oauth=my_oauth)

In the case of Europe it would be approx. the coordinates below.

# Europe
filterStream(file.name="./www/tweets_geo_europe.json", locations=c(-9, 35, 27, 60), 
    timeout=30, oauth=my_oauth)

We can do as before and open the tweets in R

tweets <- parseTweets("./www/tweets_geo.json")
tweets2 <- parseTweets("./www/tweets_geo_europe.json")

And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: lat/lon (from geolocated tweets) and place_lat and place_lon (from tweets with place information). We will work with whatever is available.

library(maps)
# replace geolocation with the other one if the first is not available
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
tweets <- tweets[!is.na(tweets$lat),]
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))


# Identify country
  # replace geolocation with the other one if the first is not available
  tweets2$lat <- ifelse(is.na(tweets2$lat), tweets2$place_lat, tweets2$lat)
  tweets2$lon <- ifelse(is.na(tweets2$lon), tweets2$place_lon, tweets2$lon)
  tweets2 <- tweets2[!is.na(tweets2$lat),]
  country <- map.where("world", tweets2$lon, tweets2$lat)
  head(sort(table(country), decreasing=TRUE))

We can also prepare a map of the exact locations of the tweets.

library(ggplot2)

## First create a data frame with the map data 
map.data <- map_data("state")

# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90", 
    color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    # 2) limits for x and y axis
    scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) + 
    # 3) adding the dot for each tweet 
    geom_point(data = tweets, 
    aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
    # 4) removing unnecessary graph elements
    theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        plot.background = element_blank())

And try to do the same for Europe:

library(ggplot2)

## First create a data frame with the map data 
map.data <- map_data("world")

# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90", 
    color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    # 2) limits for x and y axis
    scale_x_continuous(limits=c(-9,27)) + scale_y_continuous(limits=c(35,60)) +
    # 3) adding the dot for each tweet
    geom_point(data = tweets2, 
    aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
    # 4) removing unnecessary graph elements
    theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        plot.background = element_blank())

8.23.2.3 Extract a network of retweets (skip)

And here’s how to extract the edges of a network of retweets (at least one possible way of doing it):

library(stringr)
tweets <- parseTweets("./www/trump-streaming-tweets.json")
# subset only RTs
rts <- tweets[grep("RT @", tweets$text),]

edges <- data.frame(
  node1 = rts$screen_name,
  node2 = str_extract(rts$text, '.*RT @([a-zA-Z0-9_]+):? ?.*'),
  stringsAsFactors=F
)

library(igraph)
g <- graph_from_data_frame(d=edges, directed=TRUE)

8.23.2.4 Collecting a random sample

Finally, it’s also possible to collect a random sample of tweets. That’s what the “sampleStream” function does:

sampleStream(file.name="./www/tweets_random.json", timeout=30, oauth=my_oauth)

Here I’m collecting 30 seconds of tweets. And once again, to open the tweets in R…

tweets <- parseTweets("./www/tweets_random.json")

What is the most retweeted tweet?

tweets[which.max(tweets$retweet_count),]

What are the most popular hashtags at the moment? We’ll use regular expressions to extract hashtags.

library(stringr)
ht <- str_extract_all(tweets$text, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))

And who are the most frequently mentioned users?

handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)

How many tweets mention Justin Bieber?

length(grep("bieber", tweets$text, ignore.case=TRUE))

These are toy examples, but for large files with tweets in JSON format, there might be faster ways to parse the data. For example, the ndjson package offers a robust and fast way to parse JSON data:

library(ndjson)
json <- stream_in("./www/tweets_geo.json")
json