Big data and Social Science

8.12 Lab 4: Scraping web data behind web forms

The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability. Selenium is a suitable method in many cases.

Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc.

To learn how it works, we will scrape the website Monitor Legislativo, which provides information about the candidates in the recent Venezuelan legislative elections.

url <- 'http://eligetucandidato.org/filtro/'

As you can see, the information we want to scrape is hidden behind these two selectors. Let’s see how we can use Selenium to scrape it.

The first step is to (install and) load the two packages associated to RSelenium. Then, we will start a headless browser running in the background.

library(RSelenium)
library(wdman)

server <- phantomjs(port=5000L)
browser <- remoteDriver(browserName = "phantomjs", port=5000L)

Note that you may need to change the server port. Now we can open an instance of PhantomJS and navigate to the URL

browser$open()
browser$navigate(url)

Here’s how we would check that it worked:

src <- browser$getPageSource()
substr(src, 1, 1000)

We can see what the website looks like at any time by taking screenshots. This will become very useful as we start playing with the web form.

browser$screenshot(display=TRUE)

Let’s assume we want to see the results of the state of Distrito Capital for the GPP party. First, let’s use selectorGadget to identify the elements that we’re trying to scrape. Then, click on the list of states and parties to find the exact state and party you would want to scrape.

state <- browser$findElement(using = 'id', value="estado") # #estado
state$sendKeysToElement(list("Distrito Capital"))
browser$screenshot(display=TRUE)

party <- browser$findElement(using = 'id', value="partido")
party$sendKeysToElement(list("GPP"))
browser$screenshot(display=TRUE)

That seemed to work! Finally, let’s find the information for the Send button and click on it.

send <- browser$findElement(using = 'id', value="enviar")
send$clickElement()
browser$screenshot(display=TRUE)

From this website, we want to scrape the URLs to each candidate’s page. Check the website and where the links are first. Note that there are some duplicates, which we will need to clean. As in the previous cases, we can use selectorGadget and inspect to help us identify the information we want to extract. Then, for each of the elements, we will extract the URL and the name of the candidate.

links <- browser$findElements(using="css selector", value=".portfolio_title a")

urls <- rep(NA, length(links))
names <- rep(NA, length(links))
for (i in 1:length(links)){
    urls[i] <- links[[i]]$getElementAttribute('href')[[1]]
    names[i] <- links[[i]]$getElementText()[[1]]
}

# removing duplicates
urls <- urls[!duplicated(urls)]
urls

The final step would be to scrape the information from each of these candidate websites, and do all of this inside loops over states and parties, and then candidates. I will not cover the rest of the example in class, but I’m pasting the code below in case you want to use parts of it for your projects.

Before we switch topics, one more thing: it’s important to clean up after we scrape some data by closing the browser that we opened in the background.

browser$close()

And now, here’s the rest of the code:

############################################################
## first, we scrape the list of districts and parties
############################################################

url <- 'http://eligetucandidato.org/filtro/'
txt <- readLines(url)

estados <- txt[288:311]
estados <- gsub(".*>(.*)<.*", estados, repl="\\1")

partidos <- txt[315:317]
partidos <- gsub(".*>(.*)<.*", partidos, repl="\\1")

############################################################
## now, loop over pages to extract the URL for each candidate
############################################################

library(RSelenium)
library(RSelenium)
library(wdman)

server <- phantomjs(port=4446L, verbose=FALSE)
browser <- remoteDriver(browserName = "phantomjs", port=4446L)

browser$open()

candidatos <- c()

for (estado in estados){
    message(estado)
    for (partido in partidos){
        message(partido)

        browser$navigate(url)
        # input: estado
        Sys.sleep(1)
        more <- browser$findElement(using = 'id', value="estado")
        more$sendKeysToElement(list(estado))
        # input: partido
        Sys.sleep(1)
        more <- browser$findElement(using = 'id', value="partido")
        more$sendKeysToElement(list(partido))
        # click on "buscar"
        more <- browser$findElement(using = 'id', value="enviar")
        more$clickElement()
        # extracting URLS
        more <- browser$findElements(using = 'xpath', value="//a[@target='_self']")
        count <- 0
        while (length(more)<2 & count < 10){
            Sys.sleep(1)
            more <- browser$findElements(using = 'xpath', value="//a[@target='_self']")
            count <- count + 1
        }
        urls <- unlist(sapply(more, function(x) x$getElementAttribute('href')))
        urls <- unique(urls)

        candidatos <- c(candidatos, urls)
        message(length(urls), ' candidatos nuevos, ', length(candidatos), ' en total')


    }
}

writeLines(candidatos, con=file("candidatos-urls.txt"))

############################################################
## download the html code of candidates' URLS
############################################################

for (url in candidatos){
    id <- gsub('.*id=', '', url)
    filename <- paste0('html/', id, '.html')
    if (file.exists(filename)){ next }
    message(url)
    html <- readLines(url)
    writeLines(html, con=file(filename))
    Sys.sleep(2)
}

############################################################
## extract information from html
############################################################

fls <- list.files("html", full.names=TRUE)
df <- list()

for (i in 1:length(fls)){

    txt <- readLines(fls[i])

    # id
    id <- gsub('html/(.*).html', fls[i], repl="\\1")
    # name
    name <- txt[grep("displayCenter vc_single_image", txt)]
    name <- gsub(".*title=\"(.*)\">.*", name, repl='\\1')
    # partido
    partido <- txt[grep("Partido Político", txt)]
    partido <- gsub('.*</strong> (.*)</span.*', partido, repl="\\1")
    # edad
    edad <- txt[grep("Edad", txt)]
    edad <- gsub('.*ong>(.*)</span.*', edad, repl="\\1")
    # circuito
    circuito <- txt[grep("Circuito", txt)]
    circuito <- gsub('.*Circuito (.*)</span.*', circuito, repl="\\1")[1]
    # estado
    estado <- txt[grep("Circuito", txt)]
    estado <- gsub('.*\">(.*) –&nbsp;.*', estado, repl="\\1")[1]
    # twitter
    twitter <- txt[grep("fa-twitter.*@", txt)]
    twitter <- gsub('.*@(.*)</spa.*', twitter, repl="\\1")
    if (length(twitter)==0){ twitter <- NA }

    df[[i]] <- data.frame(
        id = id, name = name, partido = partido, edad = edad,
        circuito = circuito, estado = estado, twitter = twitter,
        stringsAsFactors=F)
}

df <- do.call(rbind, df)
df <- df[order(df$estado, df$partido, df$circuito),]

write.csv(df, file="venezuela-monitor-legislativo-data.csv",
    row.names=FALSE)