4.12 Lab: Scraping data from APIs

To understand how APIs work, we’ll take the New York Times API as an example. This API allows users to search articles by string and dates, and returns counts of articles and a short description of each article (but not the full text). You can find the documentation here. Get a new API token (setup a developer account and create and app here) and paste the key here:

apikey <- '####'

The fist step is to identify the base url and the parameters that we can use to query the API. Now we can do a first API call using the httr package. (You can use the my API key above.. as long as we don’t hit the rate limit!)

You can find a description of how the article search api works here: http://developer.nytimes.com/article_search_v2.json#/README

base_url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
library(httr)
r <- GET(base_url, query=list(q="inequality","api-key"=apikey))
r

From the output of r, we can see that the query was successful (Status: 200), the content is in json format, and its size is 17.3kB.

To extract the text returned by this API call, you can use content. You can write it to a file to take a look at it.

content(r, 'text')

# getwd() # get working directory
# setwd()
http_type(r)
writeLines(content(r, 'text'), con=file("./www/nyt.json"))
#writeLines(content(r, 'text'), con=file("nyt.json"))

We can save the output into an object in R to learn more about its structure.

json <- content(r, 'parsed') # parse to list
class(json); names(json) # list with 3 elements
json$status # this should be "OK"
names(json$response) # the actual data
json$response$meta # metadata

If we check the documentation, we find that we can subset by date with the begin_date and end_date parameters. Let’s see how this works…

r <- GET(base_url, query=list(q="inequality",
                              "api-key"=apikey,
                              "begin_date"=20160101,
                              "end_date"=20161231))
json <- content(r, 'parsed')
json$response$meta

Between these two dates, there were X articles in the NYT mentioning “inequality”.

Now imagine we want to look at the evolution of mentions of this word over time. Following the best coding practices we introduced earlier, we want to write a function that will take a word and a set of dates as arguments and return the counts of articles.

This would be a first draft of that function:

# A simple function
squared <- function(input){input^2}
squared(5) # try it out

# Function
nyt_count <- function(q, date1, date2){
  r <- GET(base_url, query=list(q=q,
                                "api-key"=apikey,
                                "begin_date"=date1,
                                "end_date"=date2))
  json <- content(r, "parsed")
  return(json$response$meta$hits)
}

# Apply the function
nyt_count(q="inequality", date1=20160101, date2=20160131)

# Apply the function over several search terms
queries <- c("trump", "clinton", "blocher", "merkel", "tennis")
queries.list <- list()
for(i in queries){
  queries.list[[i]] <- nyt_count(q=i, date1=20160101, date2=20160131)
  Sys.sleep(2)
}

Ok, so this seems to work. But we want to run this function multiple times across different years, so let’s write another function that helps us do that.

# Function
nyt_years_count <- function(q, yearinit, yearend){
  # sequence of years to loop over
  years <- seq(yearinit, yearend)
  counts <- rep(NA, length(years))
  # loop over periods
  for (i in 1:length(years)){
    # information message to track progress
    message(years[i])
    # retrieve count
    counts[i] <- nyt_count(q=q, date1=paste0(years[i], "0101"),
                           date2=paste0(years[i], "1231"))
    Sys.sleep(1)
  }
  return(counts)
}

Oops! What happened? Why the error? Maybe we’re querying the API too fast. Let’s modify the nyt_count function to add a while loop that will wait a couple of seconds in case there’s an error:

nyt_count <- function(q, date1, date2){
  r <- GET(base_url, query=list(q=q,
                                "api-key"=apikey,
                                "begin_date"=date1,
                                "end_date"=date2))
  json <- content(r, "parsed")
  ## if there is no response
  while (r$status_code!=200){ # If error
    Sys.sleep(1) # wait a couple of seconds
    # try again:
    r <- GET(base_url, query=list(q=q,
                  "api-key"=apikey,
                  "begin_date"=date1,
                  "end_date"=date2))
    json <- content(r, "parsed")
  }
  return(json$response$meta$hits)
}

And let’s see if this does the trick…

counts <- nyt_years_count(q="inequality", yearinit=2014, yearend=2019)
plot(2014:2019, counts, type="l", main="Mentions of inequality on the NYT, by year",
     xlab="Year", ylab="Article count")


counts <- nyt_years_count(q="trump", yearinit=2014, yearend=2019)
plot(2014:2019, counts, type="l", main="Mentions of inequality on the NYT, by year",
     xlab="Year", ylab="Article count")

Some additional code

We can try to generalize the function even more so that it works with any date interval, not just years:

nyt_dates_count <- function(q, init, end, by){
  # sequence of dates to loop over
  dates <- seq(from=init, to=end, by=by)
  dates <- format(dates, "%Y%m%d") # changing format to match NYT API format
  counts <- rep(NA, length(dates)-1)
  # loop over periods
  for (i in 1:(length(dates)-1)){ ## note the -1 here
    # information message to track progress
    message(dates[i])
    # retrieve count
    counts[i] <- nyt_count(q=q, date1=dates[i],
                           date2=dates[i+1])
  }
  # improving this as well so that it returns a data frame
  df <- data.frame(date = as.Date(dates[-length(dates)], format="%Y%m%d"), count = counts)
  return(df)
}

And now we can count articles at the month level…

counts <- nyt_dates_count(q="trump", init = as.Date("2019/01/01"), end = as.Date("2018/07/31"), by="month")
plot(counts$date, counts$count, type="l", main="Mentions of 'Trump' in the NYT, by month",
     xlab="Month", ylab="Article count")