Big data and Social Science

8.9 Lab 3: Scraping unstructured data

A common scenario for web scraping is when the data we want is available in plain html, but in different parts of a website, and not in a table format. In this scenario, we will need to find a way to extract each element, and then put it together into a data frame manually.

The motivating example here will be the website www.databreaches.net, which contains a database of reports of data breaches. We want to learn how many data breaches there have been in the last months.

url <- 'https://www.databreaches.net/'

We will also be using the rvest package, but in a slightly different way: prior to scraping, we need to identify the CSS (Q: What was CSS?) selector of each element we want to extract.

A very useful tool for this purpose is selectorGadget, an extension to the Google Chrome browser. Go to the following website to install it: http://selectorgadget.com/. Now, go back to the data breaches website and open the extension. Then, click on the element you want to extract, and then on the rest of highlighted elements that you do not want to extract. After only the elements you’re interested in are highlighted, copy and paste the CSS selector into R.

Now we’re ready to scrape the website:

library(rvest, warn.conflicts=FALSE)
website <- read_html(url) # reading the HTML code
title <- html_nodes(website, ".entry-title") # identify the CSS selector
title # content of CSS selector

## {xml_nodeset (5)}
## [1] <a href="https://www.databreaches.net/cathay-pacific-flags-data-brea ...
## [2] <a href="https://www.databreaches.net/federation-of-sovereign-indige ...
## [3] <a href="https://www.databreaches.net/byram-healthcare-notifies-pati ...
## [4] <a href="https://www.databreaches.net/update-tio-networks-notifies-c ...
## [5] <a href="https://www.databreaches.net/follow-up-mecklenburg-co-not-f ...

html_text(title)

## [1] "Cathay Pacific flags data breach affecting 9.4 million passengers"                                            
## [2] "Federation of Sovereign Indigenous Nations pays hacker $20K in bitcoin after massive data breach, sources say"
## [3] "Byram Healthcare notifies patients about rogue insider incident"                                              
## [4] "Update: TIO Networks notifies consumers of breach going back to 2014 or earlier"                              
## [5] "Follow-up: Mecklenburg Co. not fined for releasing personal information of health department patients"

Let’s do another one: the year of the breach.

year <- html_nodes(website, ".year")

Let’s also take the day and the month.

day <- html_nodes(website, ".day")
month <- html_nodes(website, ".month")

Now let’s combine that into a dataframe.

df <- data.frame(title = html_text(title),
                       day = html_text(day),
                       month = html_text(month),
                       year = html_text(year))
df

title	day	month	year
Cathay Pacific flags data breach affecting 9.4 million passengers	24	Oct	2018
Federation of Sovereign Indigenous Nations pays hacker $20K in bitcoin after massive data breach, sources say	24	Oct	2018
Byram Healthcare notifies patients about rogue insider incident	24	Oct	2018
Update: TIO Networks notifies consumers of breach going back to 2014 or earlier	24	Oct	2018
Follow-up: Mecklenburg Co. not fined for releasing personal information of health department patients	24	Oct	2018

This was just for one page. How we we go about to get the data on several pages?

First, we will write a function that takes the URL of each page, scrapes it, and returns the information we want.

scrape_website <- function(url){
    website <- read_html(url)
    
# variables that we're interested in
  title <- html_nodes(website, ".entry-title")
  year <- html_nodes(website, ".year")
  day <- html_nodes(website, ".day")
  month <- html_nodes(website, ".month")
    
# putting together into a data frame
  df <- data.frame(title = html_text(title),
                         day = html_text(day),
                         month = html_text(month),
                         year = html_text(year))
    return(df)
}

And we will start a list of data frames, and put the data frame for the initial page in the first position of that list.

datasets <- list()
datasets[[1]] <- scrape_website(url)
str(datasets)

## List of 1
##  $ :'data.frame':    5 obs. of  4 variables:
##   ..$ title: Factor w/ 5 levels "Byram Healthcare notifies patients about rogue insider incident",..: 2 3 1 5 4
##   ..$ day  : Factor w/ 1 level "24": 1 1 1 1 1
##   ..$ month: Factor w/ 1 level "Oct": 1 1 1 1 1
##   ..$ year : Factor w/ 1 level "2018": 1 1 1 1 1

How should we go about the following pages? Basically, we have to find out what the pattern in the urls is. Go to https://www.databreaches.net/, go back a few pages and check out how the url changes. We can see that it changes as follows…

It’s simply a series of numbers. So we will create a base url and then add these additional numbers. (Note that for this exercise we will only scrape the first 5 pages as we don’t want to stress their site.)

base_url <- "https://www.databreaches.net/page/"
pages <- seq(2, 5, by=1)

And now we just need to loop over pages, and use the function we created earlier to scrape the information, and add it to the list. Note that we’re adding a couple of seconds between HTTP requests to avoid overloading the page, as well as a message that will informs us of the progress of the loop.

for (i in 1:length(pages)){
  
    # informative message about progress of loop
      message(i, '/', length(pages))
  
    # prepare URL
      url <- paste(base_url, pages[i], sep="")
      
    # scrape website
      datasets[[i]] <- scrape_website(url)
      
    # wait a couple of seconds between URL calls
      Sys.sleep(0.2)
}

The final step is to convert the list of data frames into a single data frame that we can work with, using the function do.call(rbind, LIST) (where LIST is a list of data frames).

data <- do.call(rbind, datasets)
head(data)

title	day	month	year
NZ: IRD privacy breach raises data handling concerns	24	Oct	2018
ZA: Internet Solutions warns of security breach	23	Oct	2018
Boots cover up breach of confidentiality; over 400 lost ‘prescriptions’ from its Chaddesden store	23	Oct	2018
Man who targeted Georgia Tech employees through phishing scheme sentenced	23	Oct	2018
$50 million settlement in Yahoo security breach	23	Oct	2018
Children’s Hospital of Philadelphia Provides Notice of Two Email Incidents	23	Oct	2018

str(data)

## 'data.frame':    20 obs. of  4 variables:
##  $ title: Factor w/ 20 levels "$50 million settlement in Yahoo security breach",..: 4 5 2 3 1 6 8 9 10 7 ...
##  $ day  : Factor w/ 2 levels "23","24": 2 1 1 1 1 1 1 1 1 1 ...
##  $ month: Factor w/ 1 level "Oct": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...

Let’s get some quick descriptive statistics, e.g. how many breaches were there per day? Per month?

table(data$month) # frequency table

Oct
20

sort(table(data$month), decreasing=TRUE)    # sorting the table from most to least common

## Oct 
##  20

table(data$day)

23	24
19	1

We can also aggregate the data.

data %>% dplyr::group_by(month) %>% dplyr::summarise(n=n())

month	n
Oct	20

data %>% dplyr::group_by(day) %>% dplyr::summarise(n=n())

day	n
23	19
24	1