## 8.12 Lab 4: Scraping web data behind web forms

The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability. Selenium is a suitable method in many cases.

Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc.

To learn how it works, we will scrape the website Monitor Legislativo, which provides information about the candidates in the recent Venezuelan legislative elections.

url <- 'http://eligetucandidato.org/filtro/'

As you can see, the information we want to scrape is hidden behind these two selectors. Let’s see how we can use Selenium to scrape it.

The first step is to (install and) load the two packages associated to RSelenium. Then, we will start a headless browser running in the background.

library(RSelenium)
library(wdman)

server <- phantomjs(port=5000L)
browser <- remoteDriver(browserName = "phantomjs", port=5000L)

Note that you may need to change the server port. Now we can open an instance of PhantomJS and navigate to the URL

browser$open() browser$navigate(url)

Here’s how we would check that it worked:

src <- browser$getPageSource() substr(src, 1, 1000) We can see what the website looks like at any time by taking screenshots. This will become very useful as we start playing with the web form. browser$screenshot(display=TRUE)

Let’s assume we want to see the results of the state of Distrito Capital for the GPP party. First, let’s use selectorGadget to identify the elements that we’re trying to scrape. Then, click on the list of states and parties to find the exact state and party you would want to scrape.

state <- browser$findElement(using = 'id', value="estado") # #estado state$sendKeysToElement(list("Distrito Capital"))
browser$screenshot(display=TRUE) party <- browser$findElement(using = 'id', value="partido")
party$sendKeysToElement(list("GPP")) browser$screenshot(display=TRUE)

That seemed to work! Finally, let’s find the information for the Send button and click on it.

send <- browser$findElement(using = 'id', value="enviar") send$clickElement()
browser$screenshot(display=TRUE) From this website, we want to scrape the URLs to each candidate’s page. Check the website and where the links are first. Note that there are some duplicates, which we will need to clean. As in the previous cases, we can use selectorGadget and inspect to help us identify the information we want to extract. Then, for each of the elements, we will extract the URL and the name of the candidate. links <- browser$findElements(using="css selector", value=".portfolio_title a")

urls[i] <- links[[i]]$getElementAttribute('href')[[1]] names[i] <- links[[i]]$getElementText()[[1]]
}

# removing duplicates
urls <- urls[!duplicated(urls)]
urls

The final step would be to scrape the information from each of these candidate websites, and do all of this inside loops over states and parties, and then candidates. I will not cover the rest of the example in class, but I’m pasting the code below in case you want to use parts of it for your projects.

Before we switch topics, one more thing: it’s important to clean up after we scrape some data by closing the browser that we opened in the background.

browser$close() And now, here’s the rest of the code: ############################################################ ## first, we scrape the list of districts and parties ############################################################ url <- 'http://eligetucandidato.org/filtro/' txt <- readLines(url) estados <- txt[288:311] estados <- gsub(".*>(.*)<.*", estados, repl="\\1") partidos <- txt[315:317] partidos <- gsub(".*>(.*)<.*", partidos, repl="\\1") ############################################################ ## now, loop over pages to extract the URL for each candidate ############################################################ library(RSelenium) library(RSelenium) library(wdman) server <- phantomjs(port=4446L, verbose=FALSE) browser <- remoteDriver(browserName = "phantomjs", port=4446L) browser$open()

candidatos <- c()

for (partido in partidos){
message(partido)

browser$navigate(url) # input: estado Sys.sleep(1) more <- browser$findElement(using = 'id', value="estado")
more$sendKeysToElement(list(estado)) # input: partido Sys.sleep(1) more <- browser$findElement(using = 'id', value="partido")
more$sendKeysToElement(list(partido)) # click on "buscar" more <- browser$findElement(using = 'id', value="enviar")
more$clickElement() # extracting URLS more <- browser$findElements(using = 'xpath', value="//a[@target='_self']")
count <- 0
while (length(more)<2 & count < 10){
Sys.sleep(1)
more <- browser$findElements(using = 'xpath', value="//a[@target='_self']") count <- count + 1 } urls <- unlist(sapply(more, function(x) x$getElementAttribute('href')))
urls <- unique(urls)

candidatos <- c(candidatos, urls)
message(length(urls), ' candidatos nuevos, ', length(candidatos), ' en total')

}
}

writeLines(candidatos, con=file("candidatos-urls.txt"))

############################################################
############################################################

for (url in candidatos){
id <- gsub('.*id=', '', url)
filename <- paste0('html/', id, '.html')
if (file.exists(filename)){ next }
message(url)
writeLines(html, con=file(filename))
Sys.sleep(2)
}

############################################################
## extract information from html
############################################################

fls <- list.files("html", full.names=TRUE)
df <- list()

for (i in 1:length(fls)){

# id
id <- gsub('html/(.*).html', fls[i], repl="\\1")
# name
name <- txt[grep("displayCenter vc_single_image", txt)]
name <- gsub(".*title=\"(.*)\">.*", name, repl='\\1')
# partido
partido <- txt[grep("Partido Político", txt)]
partido <- gsub('.*</strong> (.*)</span.*', partido, repl="\\1")
# circuito
circuito <- txt[grep("Circuito", txt)]
circuito <- gsub('.*Circuito (.*)</span.*', circuito, repl="\\1")[1]
df <- df[order(df$estado, df$partido, df\$circuito),]
row.names=FALSE)