Chapter 2 Best Practices

Chung-hong Chan

When working with 3rd party APIs, please make sure you follow the best practices.

2.1 Read the Developer Agreement, Policy and API Documentation

You must read the Developer Agreement, Policy and API documentation. It is still true even though you are going to use the R packages providing the wrapper functions (e.g. RCrowdTangle, tuber, academictwitteR etc.)

First, the Developer Agreement and Policy provide information on what you can and cannot do with the data obtained through the API. It is important for both Open Science practices (e.g. sharing data publicly) and sharing data between individuals within the research group. Please make sure you understand the data redistribution policy. The API provided by Twitter, for example, forbids the redistribution of Twitter Content to third parties. However, academic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs for peer review purposes. The API provided by CrowdTangle basically forbids any data redistribution. This point is of paramount importance for social scientists because the Cambridge Analytica data scandal is a case of API data abuse by an academic researcher.

Second, the API documentation provides information on what are the expected API responses and rate limits. Knowing the information is important because you know what to expect. Also, you won’t offset the problems related to the API to the R package developers.

2.2 Don’t Hardcode Authentication Information into your R Code

You should not hardcode your authentication information (authentication keys, secrets, tokens) into your R code. But what do I mean by that? For example, the following is an example of hardcoding.

require(tuber)
## fake, taken from tuber's vignette
client_id <- "998136489867-5t3tq1g7hbovoj46dreqd6k5kd35ctjn.apps.googleusercontent.com"
client_secret <- "MbOSt6cQhhFkwETXKur-L9rN"

yt_oauth(app_id = client_id,
         app_secret = client_secret,
         token = '')

This is not a good practice for two reasons. First, your client_id and client_secret are directly visible in your R code. It is super easy to accidentally leak these supposedly secret information while sharing your code. Even supposedly professional programmers do that quite often. Also unlike a typical password system which renders these secret information as asterisks, it enables the so-called “shoulder surfing attack”: a malicious actor can obtain these information by simply looking (or videotaping through a tele lens) at your computer screen over your shoulder.

Second, when you run your code, your client_id and client_secret are burnt into your command history. A malicious person can browse through your command history to obtain these information.

You might wonder, well, they are just two pieces of string. No big deal. But please bare in mind these information is not simply for collecting data from YouTube. It could also be the credential for your Google Cloud access. Frequent access to some API endpoints that require money can incur financial loss. Other APIs such as Twitter v1 APIs allow deletion of data, posting data or reading all your direct messages on your behalf, simply with your API authentication information. So please: PROTECT YOUR AUTHENTICATION INFORMATION LIKE YOUR PASSWORDS!

2.2.1 Alternative: Use Environment Variables

So, what is the alternative? A securer solution is to store your authentication information as environment variables (envvars) instead. Specifically, you should store your authentication information as your user-level envvars.

For a quick experimentation, run this:

usethis::edit_r_environ(scope = "user")

If you are doing this in RStudio, a new file is now opened. That is your user-level .Renviron file. It is a hidden file (indicated by the “.”) in your user directory ¹.

For most people, it should be a blank file. For some, the file might already have something in it. Now, in the file, put this line into it:

MYSECRET=ROMANCE

This line says: I want to create a user-level envvar called “MYSECRET” with the value being “ROMANCE”. Please note that this is not R and you must use the equal sign. As instructed, save the file and then restart R.

In the fresh R session, you can now retrieve the value of the environment variable “MYSECRET”. If you did that correctly, running this should give you the value of “MYSECRET”, i.e. “ROMANCE”.

## You should get "ROMANCE"
Sys.getenv("MYSECRET")

The practice is to setup envvars of your authentication information. For example, in your .Renviron set up the following envvars.

YT_CLIENTID=998136489867-5t3tq1g7hbovoj46dreqd6k5kd35ctjn.apps.googleusercontent.com
YT_CLIENTSECRET=MbOSt6cQhhFkwETXKur-L9rN

And then in your R code, it should be:

require(tuber)
yt_oauth(app_id = Sys.getenv("YT_CLIENTID"),
         app_secret = Sys.getenv("YT_CLIENTSECRET"),
         token = '')

As long as your hidden .Renviron file is not leaked, you are safe. For reproducibility purposes, you should document all these envvars (the definitions, not the values) in the README file ².

This method is not perfect, of course. For example, your .Renviron is a plain text file and it can still be leaked. If you want to know other alternatives, see the “Managing secrets” vignette of httr.

2.3 Memoise your API Calls

Many API documentation will tell you to cache. Caching is to store a local copy of the response from the API. If you submit the exact same API request again, instead of making another API request to the server, the result is retrieved from the local copy. The technique is also called memoisation. This method is more useful for API responses that are not dynamically changed, e.g. Google Natural Language API, Google Places API, or CKAN API. It is less useful for social media APIs because information such as number of likes changes frequently. If you are only interested in retrieving the content, you can also cache those social media APIs.

Caching is good because it reduces unnecessary API requests. It is also helpful to prevent exceeding rate limit.

2.3.1 Implementation of memoisation in R

As a quick experiment, we use an extremely simple API: restful catAPI.

library(httr)
content(GET("https://thatcopy.pw/catapi/rest/"))

If you run the above code many times, you should get a different cat photo every time. However, you can create a memoised version of the httr::GET function called mGET.

library(memoise)
mGET <- memoise(GET)

Similarly, you can create a memoised version of any API function, e.g.

library(googleway)
mgoogle_places <- memoise(google_places)

Back to the restful catAPI example: If you run it the mGET instead of GET many times, you will get the same result over and over again.

content(mGET("https://thatcopy.pw/catapi/rest/"))

It is because the response from the API is cached locally. All subsequent identical requests will not be made online. Instead they get fetched from the local cache. If you don’t need the local cache anymore, delete it using the forget function.

forget(mGET)

For more information about memoisation, please refer to the official website of memoise.

For the Unix users (Mac OSX, Linux, FreeBSD, Solaris, HPUX, etc.) reading this, you can also define envvars in your hidden .*rc file (e.g. .bashrc or .zshrc, depending on your shell). The method is to set that up using export YT_CLIENTSECRET="MbOSt6cQhhFkwETXKur-L9rN". The envvars defined in your .*rc file can also be retrieved by Sys.getenv. The section on Google Natural Language API actually contains an example. If you have an habit of publishing your dotfiles, you should store these envvars in your .localrc instead. Well, if you know what dotfiles are, your Unix Wizardry should be able to tell you how to do that.↩︎
For the people who need to use Github Actions to run or test your code, you can also store those envvars in your R code as Github Encrypted Secrets.↩︎