5.10 Lab: Scraping web data behind web forms

5.10.1 Using RSelenium

The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability. Selenium is a suitable method in many cases.

Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc.

To learn how it works, we will scrape the website Monitor Legislativo, which provides information about the candidates in the recent Venezuelan legislative elections.

url <- 'https://www.bundestag.de/abgeordnete/'

As you can see, the information we want to scrape is hidden behind these two selectors. Let’s see how we can use Selenium to scrape it.

The first step is to (install and) load the two packages associated to RSelenium. Then, we will start a headless browser running in the background.

#install.packages("RSelenium")
library(RSelenium)
library(wdman)
library(rvest)
library(stringr)
library(tidyverse)

server <- phantomjs(port=5000L)
browser <- remoteDriver(browserName = "phantomjs", port=5000L)

Note that you may need to change the server port. Now we can open an instance of PhantomJS and navigate to the URL

browser$open() # start browser
browser$navigate(url) # navigate to url

Here’s how we would check that it worked:

src <- browser$getPageSource() # get html code of website
substr(src, 1, 1000) # show first 1000 characters

Using the code below we can see what the website looks like at any time by taking screenshots. This will become very useful as we start playing with the web form.

browser$screenshot(display=TRUE) # take/display screenshot

Let’s assume we want to get a list of members of the parliament. As it happens on the website https://www.bundestag.de/abgeordnete/ there is a button to display a list on the right. First we inspect the code of the website to identify how we can access this button. It turns out the class of the respective element is bt-link-list and we can use it to access the button. We do this below and push the button.

First, we identify the selector for this buttonWe use RSelenium to click this button.

# find button
browser <- browser$findElement(using = 'class', "bt-link-list")
browser$clickElement() # click button
browser$screenshot(display=TRUE) # show website after click

That seemed to work! Now we can get the source code of the html website:

# Download source code
content <- browser$getPageSource()[[1]] # get html of page
html <- read_html(content) # reading the html code into memory

# Get list of all entries (Inspect/untersucht > copy > selector)
# li = list elements
list <- html %>% 
    html_nodes('#bt-collapse-525246 > div.bt-slider-list-view.row.row-no-padding > div > ul > li')

Extract different type of information from the list:

  # Extract single information
  name_full <- html_nodes(list, "a") %>% # extract <a> nodes
           html_text() %>%  # convert to text
           str_replace_all("[\r\n]", "") %>%  # clean text
           str_squish() %>% # delete whitespace at start/end of string
           str_remove("CDU/CSU|Die Linke|Bündnis 90/Die Grünen|AfD|FDP|SPD") %>%
           str_squish()
  party <- html_nodes(list, "a") %>% 
           html_text() %>% 
           str_replace_all("[\r\n]", "") %>% 
           str_squish() %>%
           str_extract("CDU/CSU|Die Linke|Bündnis 90/Die Grünen|AfD|FDP|SPD") %>%
           str_squish()
  link_profile <- html_nodes(list, "a") %>% 
                  html_attr("href") %>% 
                  paste('https://www.bundestag.de', ., sep="") # Link found through checking the websites

Write the information into a dataframe:

data <- data_frame(name_full, party, link_profile)