5.7 Lab: Scraping unstructured data

A common scenario for web scraping is when the data we want is available in plain html, but in different parts of a website, and not in a table format. In this scenario, we will need to find a way to extract each element, and then put it together into a data frame manually.

The motivating example here will be the website www.databreaches.net, which contains a database of reports of data breaches. We want to learn how many data breaches there have been in the last months.

We will also be using the rvest package, but in a slightly different way: prior to scraping, we need to identify the CSS (Q: What was CSS?) selector of each element we want to extract.

A very useful tool for this purpose is selectorGadget, an extension to the Google Chrome browser. Go to the following website to install it: http://selectorgadget.com/. Now, go back to the data breaches website and open the extension. Then, click on the element you want to extract, and then on the rest of highlighted elements that you do not want to extract. After only the elements you’re interested in are highlighted, copy and paste the CSS selector into R.

Now we’re ready to scrape the website:

## {xml_nodeset (5)}
## [1] <h1 class="entry-title">Mean and median ransomware payments up in Q1, but ...
## [2] <h1 class="entry-title">Ca: Privacy commissioner investigating COVID Secr ...
## [3] <h1 class="entry-title">OR: Centennial schools to close for 2 days after  ...
## [4] <h1 class="entry-title">Developing — Babuk claims to have hacked Metropol ...
## [5] <h1 class="entry-title">Jp:  Notice about the occurrence of an online tra ...
## [1] "Mean and median ransomware payments up in Q1, but number of victims paying ransom may be decreasing"
## [2] "Ca: Privacy commissioner investigating COVID Secretariat data breach"                               
## [3] "OR: Centennial schools to close for 2 days after hackers breach school technology systems"          
## [4] "Developing — Babuk claims to have hacked Metropolitan D.C. Police"                                  
## [5] "Jp:  Notice about the occurrence of an online trading system failure due to unauthorized access"

Let’s do another one: the date of the news (or breach incident).

Now let’s combine that into a dataframe.

##                                                                                                 title
## 1 Mean and median ransomware payments up in Q1, but number of victims paying ransom may be decreasing
## 2                                Ca: Privacy commissioner investigating COVID Secretariat data breach
## 3           OR: Centennial schools to close for 2 days after hackers breach school technology systems
## 4                                   Developing — Babuk claims to have hacked Metropolitan D.C. Police
## 5     Jp:  Notice about the occurrence of an online trading system failure due to unauthorized access
##             date
## 1 April 26, 2021
## 2 April 26, 2021
## 3 April 26, 2021
## 4 April 26, 2021
## 5 April 26, 2021

This was just for one page. How we we go about to get the data on several pages?

First, we will write a function that takes the URL of each page, scrapes it, and returns the information we want.

And we will start a list of data frames, and put the data frame for the initial page in the first position of that list.

## List of 1
##  $ :'data.frame':    5 obs. of  2 variables:
##   ..$ title: chr [1:5] "Mean and median ransomware payments up in Q1, but number of victims paying ransom may be decreasing" "Ca: Privacy commissioner investigating COVID Secretariat data breach" "OR: Centennial schools to close for 2 days after hackers breach school technology systems" "Developing — Babuk claims to have hacked Metropolitan D.C. Police" ...
##   ..$ date : chr [1:5] "April 26, 2021" "April 26, 2021" "April 26, 2021" "April 26, 2021" ...

How should we go about the following pages? Basically, we have to find out what the pattern in the urls is. Go to https://www.databreaches.net/news/, go back a few pages and check out how the url changes. We can see that it changes as follows…

It’s simply a series of numbers. So we will create a base url and then add these additional numbers. (Note that for this exercise we will only scrape the first 5 pages as we don’t want to stress their site.)

And now we just need to loop over pages, and use the function we created earlier to scrape the information, and add it to the list. Note that we’re adding a couple of seconds between HTTP requests to avoid overloading the page, as well as a message that will informs us of the progress of the loop.

The final step is to convert the list of data frames into a single data frame that we can work with, using the function do.call(rbind, LIST) (where LIST is a list of data frames).

##                                                                                                          title
## 1                                                      NY: Guilderland Central Schools Hit with Malware Attack
## 2                                   Au: Queensland hospitals and aged care facilities crippled by cyber attack
## 3                                      Milan, the pharmaceutical company Mipharm SPA victim of a hacker attack
## 4 No: Ransomware attack on Nordlo knocked out Vakt og Alarm’s sick signal systems in several care institutions
## 5                                                             De: Grocer Tegut is the target of a cyber attack
## 6                                 It: Union of Comuni Colli del Monferrato, cyber attack: hackers publish data
##             date
## 1 April 26, 2021
## 2 April 26, 2021
## 3 April 26, 2021
## 4 April 26, 2021
## 5 April 26, 2021
## 6 April 25, 2021
## 'data.frame':    20 obs. of  2 variables:
##  $ title: chr  "NY: Guilderland Central Schools Hit with Malware Attack" "Au: Queensland hospitals and aged care facilities crippled by cyber attack" "Milan, the pharmaceutical company Mipharm SPA victim of a hacker attack" "No: Ransomware attack on Nordlo knocked out Vakt og Alarm’s sick signal systems in several care institutions" ...
##  $ date : chr  "April 26, 2021" "April 26, 2021" "April 26, 2021" "April 26, 2021" ...

Let’s get some quick descriptive statistics, e.g. how many news were there per day? Per month?

First, we recode the date variable to the date format (e.g. see here). Then we create a month and a day variable.

## 
##  4 
## 20
##  4 
## 20
## 
## 24 25 26 
##  7  8  5
## 
## Sun Mon Tue Wed Thu Fri Sat 
##   8   5   0   0   0   0   7

We can also aggregate the data.

## # A tibble: 1 x 2
##   month     n
##   <dbl> <int>
## 1     4    20
## # A tibble: 3 x 2
##     day     n
##   <int> <int>
## 1    24     7
## 2    25     8
## 3    26     5