Chapter 7 GitHub.com API

Markus Konrad (WZB)

You will need to install the following packages for this chapter (run the code):

# install.packages('pacman')
library(pacman)
p_load('jsonlite', 'httr')

7.1 Provided services/data

What data/service is provided by the API?

The GitHub.com API provides a service for interacting with the social coding platform GitHub. A social coding platform is a website that allows users to work on software projects collaboratively. Users can share their work, engage in discussions and track activities of other users and projects. GitHub is currently the largest such platform with more than 50M user accounts (as of January 2022). GitHub’s API allows for retrieving user-generated data from its platform, which is probably of main interest for social scientists. It also provides tools for controlling your account, organizations and projects on the platform in order to automate workflows. This is more of interest for professional software development.

From a social scientist’s perspective, the API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development. Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there’s understandably only a few aggregate data available.

When collecting data on GitHub, you should follow GitHub’s terms of service, especially the API terms. GitHub users agree that, when using the service, anything they post publicly may be viewed and used by other users (see ToS section D5). Still, you should be cautious when collecting and analyzing such user-generated data, as the user was not informed about contributing to a research project. If research data should later be published for replication, only aggregated and/or anonymized data should be made public. Depending on the data you collect, an ethics review should also be considered.

Special care should be taken about sampling techniques and replicability when working with this API. The sheer amount of data shared on GitHub requires sampling in most cases, but this is often hard to implement with the API, since results are usually sorted in a way that introduces bias when you can only fetch a certain number of top results (e.g. searching for users gives you the “most popular” users matching your criteria). Cosentino, Izquierdo, and Cabot (2016) show that many studies that use the GitHub API fail to implement proper sampling.

There are several competing platforms to GitHub. Many of them also provide an API which allow you to retrieve data. To list a few:

7.2 Prerequisites

The API can be used for free and you can send up to 60 requests per hour if you’re not authenticated (i.e. if you don’t provide an API key). For serious data collection, this is not much, so it is recommended to sign up on GitHub and generate a personal access token that acts as API key. This token can then be used to authenticate your API requests. Your quota is then 5000 requests per hour.

7.3 Simple API call

What does a simple API call look like?

First of all, it’s important to note that GitHub actually provides two APIs: A REST API and a GraphQL API. They differ mainly in how requests are performed and how the responses are formatted, but also in what can be requested. The REST API is the default interface and provides a way of access akin to most other APIs in this book. The GraphQL API provides a quite different interface which requires you to more specifically describe what kind of information you want to retrieve. Though some very detailed information can only be retrieved via the GraphQL API, we will stick to the REST API for this chapter due to the complexity of the GraphQL API.

We can perform a request using the curl command in a terminal or by simply visiting the URL in a browser. We will start by retrieving public profile data from the GitHub account of software developer and hacker Lilith Wittmann:⁶

curl https://api.github.com/users/LilithWittmann

The result is an HTTP response with JSON formatted data which contains user profile information that can also be found on the respective public profile page:

{
  "login": "LilithWittmann",
  "name": "Lilith Wittmann",
  "location": "Berlin, Germany",
  "bio": "freelance developer & consultant",
  "twitter_username": "LilithWittmann",
  "public_repos": 37,
  "public_gists": 6,
  "followers": 440,
  "following": 52,
  // [ more data ... ]
}

The GitHub API offers extensive search capabilities. You can search for users, repositories, discussions and more by constructing a search query. Here, we search for users that use R and report being located in Berlin:

curl 'https://api.github.com/search/users?q=language:r+location:berlin'

{
  "total_count": 541,
  "incomplete_results": false,
  "items": [
    {
      "login": "IndrajeetPatil",
      "id": 11330453,
      // [...]
    },
    {
      "login": "christophergandrud",
      "id": 1285805,
      // [...]
    },
    {
      "login": "RobertTLange",
      "id": 20374662,
      // [...]
    },
  // [ more data ... ]
}

We could then continue fetching profile information for each account in the result set.

Note that by default, the results are ordered by “best match”. This can be changed with an additional “sort” parameter. The GitHub API doesn’t provide a way of sampling search results – they are always sorted in some way and hence may introduce bias when you only fetch a limited number of results (which is often necessary since the result set is too large). You may consider to narrow the criteria for your search requests, e.g. by sampling geographical locations, so that you can always obtain the whole search results for a place.

Another example would be retrieving data about a project repository. We will use the GitHub repository for the popular R package dplyr as an example:

curl https://api.github.com/repos/tidyverse/dplyr

{
  "id": 6427813,
  "name": "dplyr",
  "full_name": "tidyverse/dplyr",
  "description": "dplyr: A grammar of data manipulation",
  "homepage": "https://dplyr.tidyverse.org",
  "size": 50767,
  "stargazers_count": 3959,
  "watchers_count": 3959,
  "language": "R",
  // [ more data ... ]
}

Finally, let’s fetch data about an issue in the dplyr repository. On GitHub, issues are code or documentation problems as well as development tasks that users and collaborators bring up and discuss. Here, we collect all data on dplyr issue #5958:

curl https://api.github.com/repos/tidyverse/dplyr/issues/5958

{
  "url": "https://api.github.com/repos/tidyverse/dplyr/issues/5958",
  "number": 5958,
  "title": "Inaccurate documentation for `name` argument of `count()`",
  "user": {
    "login": "sfirke",
    "id": 7569808,
    // [ more data ... ]
  },
  "comments": 13,
  "body": "The documentation for `count()` says of the [...]",  // truncated
  // [ more data ... ]
}

As you can see in the example above, a response may contain nested data (the “user” entry shows information about the author of the issue as a nested structure).

7.4 API access in R

How can we access the API from R?

There are packages for many programming languages that provide convenient access for communicating with the GitHub API. Unfortunately, there are no such packages for R at the time of writing.⁷ This means we can only access the API directly, e.g. by using the jsonlite package to fetch the data and convert it to an R list or dataframe.

We start with translating the first API call from the previous section to R:

library(jsonlite)

profile_data <- fromJSON('https://api.github.com/users/LilithWittmann')
# this gives a list
head(profile_data, 3)

## $login
## [1] "LilithWittmann"
## 
## $id
## [1] 891030
## 
## $node_id
## [1] "MDQ6VXNlcjg5MTAzMA=="

The JSON response from the GitHub API was directly converted to an R list object. When a JSON result only consists of arrays, fromJSON() automatically converts the result to a dataframe by default. We can observe that when fetching the repositories of an user:

profile_repos <- fromJSON('https://api.github.com/users/LilithWittmann/repos')
# this gives a dataframe
head(profile_repos[,1:3])   # selecting only the first 3 columns

##          id                          node_id                     name
## 1  42730077 MDEwOlJlcG9zaXRvcnk0MjczMDA3Nw==                  airflow
## 2 481880572                     R_kgDOHLjp_A               ant-design
## 3 410145768                     R_kgDOGHJT6A aries-staticagent-python
## 4 319779724 MDEwOlJlcG9zaXRvcnkzMTk3Nzk3MjQ=  awesome-behoerden-floss
## 5 582106925                     R_kgDOIrI_LQ   credentials_dev_public
## 6 231679431 MDEwOlJlcG9zaXRvcnkyMzE2Nzk0MzE=              deployments

Let’s also repeat the search query from the previous section. We want to obtain accounts that use R and entered “Berlin” in their profile’s location field. This time, the result is a list that reports the number of search results as search_results$total_count and the search results details as a dataframe in search_results$items:

search_results <- fromJSON('https://api.github.com/search/users?q=language:r+location:berlin')
# this gives a list with a dataframe in "items"
str(search_results, list.len = 3, vec.len   = 3)

## List of 3
##  $ total_count       : int 636
##  $ incomplete_results: logi FALSE
##  $ items             :'data.frame':  30 obs. of  19 variables:
##   ..$ login              : chr [1:30] "IndrajeetPatil" "RobertTLange" "christophergandrud" ...
##   ..$ id                 : int [1:30] 11330453 20374662 1285805 9379282 1569647 638225 13415025 17516791 ...
##   ..$ node_id            : chr [1:30] "MDQ6VXNlcjExMzMwNDUz" "MDQ6VXNlcjIwMzc0NjYy" "MDQ6VXNlcjEyODU4MDU=" ...
##   .. [list output truncated]

Let’s suppose we want to collect profile data for each account in the result set. First, let’s investigate the search results in search_results$items:

head(search_results$items[,1:3])

##                login       id              node_id
## 1     IndrajeetPatil 11330453 MDQ6VXNlcjExMzMwNDUz
## 2       RobertTLange 20374662 MDQ6VXNlcjIwMzc0NjYy
## 3 christophergandrud  1285805 MDQ6VXNlcjEyODU4MDU=
## 4  lisacharlotterost  9379282 MDQ6VXNlcjkzNzkyODI=
## 5     dirkschumacher  1569647 MDQ6VXNlcjE1Njk2NDc=
## 6              al2na   638225 MDQ6VXNlcjYzODIyNQ==

To retrieve the profile details of each user, we only need the account name which we can then feed into the https://api.github.com/users/<account> API query. For demonstration purposes, let’s only select the first ten users in the result set:

r_users_berlin <- search_results$items$login[1:10]
# gives a character vector of 10 account names
head(r_users_berlin)

## [1] "IndrajeetPatil"     "RobertTLange"       "christophergandrud" "lisacharlotterost"  "dirkschumacher"     "al2na"

We can now build the API queries to fetch the profile information for each user:

users_query <- paste0('https://api.github.com/users/', r_users_berlin)
# gives a character vector of 10 API queries
head(users_query)

## [1] "https://api.github.com/users/IndrajeetPatil"     "https://api.github.com/users/RobertTLange"       "https://api.github.com/users/christophergandrud" "https://api.github.com/users/lisacharlotterost"  "https://api.github.com/users/dirkschumacher"    
## [6] "https://api.github.com/users/al2na"

The fromJSON() function only accepts a single character string as query, but we have a vector of ten queries for the ten users. Hence, we resort to lapply() which passes each query to fromJSON() and stores the result as a list of ten lists (i.e. a nested structure – a list of lists). Each item in users_details is then a list that contains the profile information for the respective user.

users_details <- lapply(users_query, fromJSON)
# gives a list of 10 lists (one for each user)
str(users_details, list.len = 3)

## List of 10
##  $ :List of 32
##   ..$ login              : chr "IndrajeetPatil"
##   ..$ id                 : int 11330453
##   ..$ node_id            : chr "MDQ6VXNlcjExMzMwNDUz"
##   .. [list output truncated]
##  $ :List of 32
##   ..$ login              : chr "RobertTLange"
##   ..$ id                 : int 20374662
##   ..$ node_id            : chr "MDQ6VXNlcjIwMzc0NjYy"
##   .. [list output truncated]
##  $ :List of 32
##   ..$ login              : chr "christophergandrud"
##   ..$ id                 : int 1285805
##   ..$ node_id            : chr "MDQ6VXNlcjEyODU4MDU="
##   .. [list output truncated]
##   [list output truncated]

A nested list is hard to work with, so we convert the result to a dataframe. On the way, we select only the information that we actually want to work with. For demonstration purposes, we only select the account name, the real name, the location, the number of public repositories and the number of followers of each user.

users_rows <- lapply(users_details, function(d) {
  # select information that we need and generate a dataframe with a single row
  data.frame(login = d$login,
             name = d$name,
             loc = d$location,
             repos = d$public_repos,
             followers = d$followers,
             stringsAsFactors = FALSE)
})

# combine all dataframe rows to form the final dataframe
users_df <- do.call(rbind, users_rows)
# gives a dataframe with login, name, location, number of repos, number of followers
users_df

##                 login                name                       loc repos followers
## 1      IndrajeetPatil     Indrajeet Patil           Berlin, Germany    26      1354
## 2        RobertTLange Robert Tjarko Lange Berlin, Barcelona, London    35       569
## 3  christophergandrud Christopher Gandrud                    Berlin   147       436
## 4   lisacharlotterost Lisa Charlotte Muth           Berlin, Germany    11       357
## 5      dirkschumacher     Dirk Schumacher                    Berlin   155       317
## 6               al2na       Altuna Akalin                    Berlin    68       234
## 7         saraswatmks     Manish Saraswat           Berlin, Germany    62       146
## 8      joachim-gassen      Joachim Gassen           Berlin, Germany    33       139
## 9         chartgerink    Chris Hartgerink                    Berlin    53       130
## 10         vbezgachev    Vitaly Bezgachev           Berlin, Germany    15       128

As we can see, working with the results from the API requires some effort in R, since we have to deal with the often complex structure of the provided data. There’s no “convenience wrapper” package for the GitHub API in R so far. If you’re familiar with other programming languages, you may consider using one of the recommended packages.⁸

7.4.1 Provide API key

Authenticating with the GitHub API via an API key allows you to send much more requests to the API (5000 requests per hour instead of 60 at the time of writing). API access keys for the GitHub API are called personal access tokens (PAT) and the documentation explains how to generate a PAT once you’ve logged into your GitHub account. Please be careful with your PATs and never publish them.

When you want to use authenticated requests, you need to pass along authentication information (your GitHub user account and your PAT) with every request. This is not possible with fromJSON() – you have to send HTTP requests directly, e.g. via GET() from the httr package and then need to handle the HTTP response. The following example shows this. Here, we have the secret access token PAT stored as environment variable and fetch it via Sys.getenv(). Next, we use GET() to make an authenticated HTTP response to the GitHub API. We use the /user API endpoint, which simply reports information about the currently authenticated user (which is your own account). If we were to use this API endpoint in an unauthenticated manner, we’d receive the error message “Requires authentication”.⁹ We then need to get the content of the response as text (content(response, as = 'text')) and pass this to fromJSON() in order to get the result as list object.

library(httr)

PAT <- Sys.getenv("GitHub_token")

response <- GET('https://api.github.com/user', authenticate('<account name>', PAT))
account_details <- fromJSON(httr::content(response, as = 'text'))
account_details

You can use that code the same way as for the other API requests, only that when you use authenticated requests the request quotas are much higher.

References

Chatziasimidis, Fragkiskos, and Ioannis Stamelos. 2015. “Data Collection and Analysis of GitHub Repositories and Users.” In 2015 6th International Conference on Information, Intelligence, Systems and Applications (IISA), 1–6. https://doi.org/10.1109/IISA.2015.7388026.

Cosentino, Valerio, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2016. “Findings from GitHub: Methods, Datasets and Limitations.” In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), 137–41.

Hipp, Lena, and Markus Konrad. 2021. “Has Covid-19 Increased Gender Inequalities in Professional Advancement? Cross-Country Evidence on Productivity Differences Between Male and Female Software Developers.” Journal of Family Research, September. https://doi.org/10.20377/jfr-697.

Lima, Antonio, Luca Rossi, and Mirco Musolesi. 2014. “Coding Together at Scale: GitHub as a Collaborative Social Network.” In Eighth International AAAI Conference on Weblogs and Social Media.

The author asked for permission for fetching and displaying this data.↩︎
Please note that the git2r package does not provide access to the GitHub API. It provides access to git repositories which is something completely different to accessing the web API of a social coding platform.↩︎
For example, the PyGithub package for Python provides a comprehensive set of tools for interacting with the GitHub API which usually requires much less programming effort than with R and jsonlite.↩︎
Try that out yourself via curl https://api.github.com/user in the terminal.↩︎