Chapter 22 MediaWiki Action API

Noam Himmelrath, Jacopo Gambato

You will need to install the following packages for this chapter (run the code):

# install.packages('pacman')
library(pacman)
p_load('WikipediR', 'rvest', 'xml2')

22.1 Provided services/data

What data/service is provided by the API? To access Wikipedia, MediaWiki provides the MediaWiki Action API.

The API can be used for multiple things, such as accessing wiki features, interacting with a wiki and obtaining meta-information about wikis and public users. Additionally, the web service can provide access data and post changes of Wikipedia-webpages.

22.2 Prerequisites

What are the prerequisites to access the API (authentication)?

No pre-registration is required to access the API. However, for certain actions, such as very large queries, a registration is required. Moreover, while there is no hard and fast limit on read requests, the system administrators heavily recommend limiting the request rate to secure the stability of the side. It is also best practice to set a descriptive User Agent header.

22.3 Simple API call

What does a simple API call look like?

As mentioned, the API can be used to communicate with Wikipedia for a variety of actions. As it is most likely for social scientist to extract information rather than post changes to a Wikipedia page, we focus here on obtaining from Wikipedia the information we need.

We include a basic API call to obtain information about the Albert Einstein Wikipedia page

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info&titles=Albert%20Einstein

to be plugged into the search bar of a browser to obtain the basic information on the page.

result:
{"batchcomplete":"","query":
  {"pages":
    {"736":
      {"pageid":736,"ns":0,"title":"AlbertEinstein",
      "contentmodel":"wikitext","pagelanguage":"en",
      "pagelanguagehtmlcode":"en","pagelanguagedir":"ltr",
      "touched":"2022-02-06T12:46:49Z","lastrevid":1070093046,"length":184850}
    }
  }
}

Notice that the first line is common for all calls of the API, while the second line relates to the specific action you are trying to perform.

22.4 API access in R

How can we access the API from R (httr + other packages)?

The most common tool is WikipediR, a wrapper around the Wikipedia API. It allows R to access information and “directions” for the relevant page or pages of Wikipedia and the content or metadata therein. Importantly, the wrapper only allows to gather information, which implies that the instrument needs to be accompanied by other packages such as rvest for scraping and XML or jsonlite for parsing.

WikipediR allows us to get different information like general page info, backlinks, categories in page etc. In this example we are interested in the titles of the first ten backlinks of the Albert Einstein site.

library(WikipediR)

all_bls <- page_backlinks("en","wikipedia", page = "Albert Einstein", limit = 10) #using "page_backlings" function of the WikipediR package
bls_title <- data.frame()
for(i in 1:10){
  bls_title <- rbind(bls_title,all_bls$query$backlinks[[i]]$title)
}
colnames(bls_title) <- "backlinks"
bls_title

##                  backlinks
## 1  Talk:Altruism/Archive 1
## 2      Arthur Schopenhauer
## 3          Android (robot)
## 4             Alfred Nobel
## 5                     Atom
## 6                    Axiom
## 7                 April 30
## 8        Albert Schweitzer
## 9           Anatole France
## 10   Augustin-Jean Fresnel

We can also scrape data of Wikipedia by simple using the “rvest” package to scrape all kind of informations like tables, which is done in the following example.

library(rvest)
library(xml2)

url <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
html <- read_html(url) # reading the html code into memory

#with the commands of the rvest package we are able to scrape the information we need, here html_table()

tab <- html_table(html, fill=TRUE) # shows all tables on the website with the number of the table in double brackets [[number]]

#we are interested in table 1, the inequality index of the countries 
data <- tab[[2]]
data <- data[-1,] # delete first row
data[1:5,1:4] #for a better overview we are just loooking at the first 5 rows and the first 4 columns

# A tibble: 5 x 4
  Rank  `Country / Dependency` Population    Population
  <chr> <chr>                  <chr>         <chr>     
1 –     World                  8,035,755,000 100%      
2 1     China                  1,411,750,000 17.6%     
3 2     India                  1,392,329,000 17.3%     
4 3     United States          334,886,000   4.17%     
5 4     Indonesia              277,749,853   3.46%

References

ElBahrawy, Abeer, Laura Alessandretti, and Andrea Baronchelli. 2019. “Wikipedia and Cryptocurrencies: Interplay Between Collective Attention and Market Performance.” Frontiers in Blockchain 2: 12.

Gozzi, Nicolò, Michele Tizzani, Michele Starnini, Fabio Ciulla, Daniela Paolotti, André Panisson, and Nicola Perra. 2020. “Collective Response to Media Coverage of the COVID-19 Pandemic on Reddit and Wikipedia: Mixed-Methods Analysis.” Journal of Medical Internet Research 22 (10): e21597.

Margolin, Drew B, Sasha Goodman, Brian Keegan, Yu-Ru Lin, and David Lazer. 2016. “Wiki-Worthy: Collective Judgment of Candidate Notability.” Information, Communication & Society 19 (8): 1029–45.

Salem, Hamza, and Fabian Stephany. 2021. “Wikipedia: A Challenger’s Best Friend? Utilizing Information-Seeking Behaviour Patterns to Predict US Congressional Elections.” Information, Communication & Society, 1–27.