Chapter 4 Application Programming Interfaces (APIs)
While web scraping (or screen scraping, as you extract the stuff that appears on your screen) is certainly fun, it should be seen as a last resort. More and more web platforms provide so-called Application Programming Interfaces (APIs).
“An application programming interface (API) is a connection between computers or between computer programs.” (Wikipedia)
There are a bunch of different sorts of APIs, but the most common one is the REST API. REST stands for “REpresentational State Transfer” and describes a set of rules the API designers are supposed to obey when developing their particular interface. You can make different requests, such as GET content, POST a file to a server – PUT
is similar, or request to DELETE
a file. We will only focus on the GET
part.
APIs offer you a structured way to communicate with the platform via your machine. In our use case, this means that you can get the data you want in a usually well-structured format and without all the “dirt” that you need to scrape off tediously (enough web scraping metaphors for today). With APIs, you can generally quite clearly define what you want and how you want it. In R, we achieve this by using the httr
(Wickham 2020) package. Moreover, using APIs does not bear the risk of acquiring the information you are not supposed to access and you also do not need to worry about the server not being able to handle the load of your requests (usually, there are rate limits in place to address this particular issue). However, it’s not all fun and games with APIs: they might give you their data in a special format, both XML and JSON are common. The former is the one rvest
uses as well, the latter can be tamed using jsonlite
(Ooms, Temple Lang, and Hilaiel 2020) which is to be introduced as well. Moreover, you usually have to ask the platform for permission and perhaps pay to get it. Once you have received the keys you need, you can tell R to fill them automatically, similar to how your browser knows your Amazon password, etc.; usethis
(Wickham et al. 2021) can help you with such tasks.
The best thing that can happen with APIs: some of them are so popular that people have already written specific R packages for working with them – an overview can be found on the ROpenSci website. One example of this was Twitter and the rtweet
package (Kearney 2019).
4.1 Obtaining their data
API requests are performed using URLs. Those start with the basic address of the API (e.g., https://api.nytimes.com), followed by the endpoint that you want to use (e.g., /lists). They also contain so-called headers which are provided as key-value pairs. Those headers can contain for instance authentication tokens or different search parameters. A request to the New York Times API to obtain articles for January 2019 would then look like this: https://api.nytimes.com/svc/archive/v1/2019/1.json?api-key=yourkey.
At most APIs, you will have to register first. As we will play with the New York Times API, do this here.
4.1.1 Making queries
A basic query is performed using the GET()
function. However, first, you need to define the call you want to make. The different keys and values they can take can be found in the API documentation. Of course, there is also a neater way to deal with the key problem. We will show it later.
#see overview here: https://developer.nytimes.com/docs/timeswire-product/1/overview
key <- "xxx"
#key <- Sys.getenv("nyt_key")
nyt_book_reviews <- modify_url(
url = "https://api.nytimes.com/",
path = "svc/books/v3/reviews.json",
query = list(author = "Michelle+Obama",
`api-key` = key))
response <- GET(nyt_book_reviews)
When it comes to the NYT news API, there is the problem that the type of section is specified not in the query but in the endpoint path itself. Hence, if we were to scrape the different sections, we would have to change the path itself, e.g., through str_c()
.
paths <- str_c("svc/news/v3/content/nyt/", c("business", "world"), ".json")
map(paths,
\(x) GET(modify_url(
url = "https://api.nytimes.com/",
path = x,
query = list(`api-key` = key))
)
)
## [[1]]
## Response [https://api.nytimes.com/svc/news/v3/content/nyt/business.json?api-key=xxx]
## Date: 2023-06-15 14:10
## Status: 401
## Content-Type: application/json
## Size: 90 B
##
##
## [[2]]
## Response [https://api.nytimes.com/svc/news/v3/content/nyt/world.json?api-key=xxx]
## Date: 2023-06-15 14:10
## Status: 401
## Content-Type: application/json
## Size: 90 B
The Status:
code you want to see here is 200
which stands for success. If you want to put it inside a function, you might want to break the function once you get a non-successful query. http_error()
or http_status()
are your friends here.
## [1] TRUE
## $category
## [1] "Client error"
##
## $reason
## [1] "Unauthorized"
##
## $message
## [1] "Client error: (401) Unauthorized"
content()
will give you the content of the request.
What you see is also the content of the call – which is what we want. It is in a format that we cannot work with right away, though, it is in JSON.
4.1.2 JSON
The following unordered list is stolen from this blog entry:
- The data are in name/value pairs
- Commas separate data objects
- Curly brackets {} hold objects
- Square brackets [] hold arrays
- Each data element is enclosed with quotes “” if it is a character, or without quotes if it is a numeric value
jsonlite
helps us to bring this output into a data frame.
## $fault
## $fault$faultstring
## [1] "Invalid ApiKey"
##
## $fault$detail
## $fault$detail$errorcode
## [1] "oauth.v2.InvalidApiKey"
4.1.3 Dealing with authentification
Well, as we saw before, we would have to put our official NYT API key publicly visible in this script. This is bad practice and should be avoided, especially if you work on a joint project (where everybody uses their code) or if you put your scripts in public places (such as GitHub). The usethis
package can help you here.
Hence, if we now search for articles – find the proper parameters here, we provide the key by using the Sys.getenv
function. So, if somebody wants to work with your code and their own key, all they need to make sure is that they have the API key stored in the environment with the same name.
modify_url(
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json",
query = list(q = "Trump",
pub = "20161101",
end_date = "20161110",
`api-key` = Sys.getenv("nyt_key"))
) |>
GET()
## Response [http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Trump&pub=20161101&end_date=20161110&api-key=qekEhoGTXqjsZnXpqHns0Vfa2U6T7ABf]
## Date: 2023-06-15 14:10
## Status: 200
## Content-Type: application/json
## Size: 190 kB