5.10 Lab: Scraping web data behind web forms

5.10.1 Using RSelenium

The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability. Selenium is a suitable method in many cases.

Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc.

To learn how it works, we will scrape the website Monitor Legislativo, which provides information about the candidates in the recent Venezuelan legislative elections.

As you can see, the information we want to scrape is hidden behind these two selectors. Let’s see how we can use Selenium to scrape it.

The first step is to (install and) load the two packages associated to RSelenium. Then, we will start a headless browser running in the background.

Note that you may need to change the server port. Now we can open an instance of PhantomJS and navigate to the URL

Here’s how we would check that it worked:

Using the code below we can see what the website looks like at any time by taking screenshots. This will become very useful as we start playing with the web form.

Let’s assume we want to get a list of members of the parliament. As it happens on the website https://www.bundestag.de/abgeordnete/ there is a button to display a list on the right. First we inspect the code of the website to identify how we can access this button. It turns out the class of the respective element is bt-link-list and we can use it to access the button. We do this below and push the button.

First, we identify the selector for this buttonWe use RSelenium to click this button.

That seemed to work! Now we can get the source code of the html website:

Extract different type of information from the list:

Write the information into a dataframe: