8.9 Lab 3: Scraping unstructured data
A common scenario for web scraping is when the data we want is available in plain html, but in different parts of a website, and not in a table format. In this scenario, we will need to find a way to extract each element, and then put it together into a data frame manually.
The motivating example here will be the website www.databreaches.net
, which contains a database of reports of data breaches. We want to learn how many data breaches there have been in the last months.
url <- 'https://www.databreaches.net/'
We will also be using the rvest
package, but in a slightly different way: prior to scraping, we need to identify the CSS (Q: What was CSS?) selector of each element we want to extract.
A very useful tool for this purpose is selectorGadget
, an extension to the Google Chrome browser. Go to the following website to install it: http://selectorgadget.com/
. Now, go back to the data breaches website and open the extension. Then, click on the element you want to extract, and then on the rest of highlighted elements that you do not want to extract. After only the elements you’re interested in are highlighted, copy and paste the CSS selector into R.
Now we’re ready to scrape the website:
library(rvest, warn.conflicts=FALSE)
website <- read_html(url) # reading the HTML code
title <- html_nodes(website, ".entry-title") # identify the CSS selector
title # content of CSS selector
## {xml_nodeset (5)}
## [1] <a href="https://www.databreaches.net/cathay-pacific-flags-data-brea ...
## [2] <a href="https://www.databreaches.net/federation-of-sovereign-indige ...
## [3] <a href="https://www.databreaches.net/byram-healthcare-notifies-pati ...
## [4] <a href="https://www.databreaches.net/update-tio-networks-notifies-c ...
## [5] <a href="https://www.databreaches.net/follow-up-mecklenburg-co-not-f ...
html_text(title)
## [1] "Cathay Pacific flags data breach affecting 9.4 million passengers"
## [2] "Federation of Sovereign Indigenous Nations pays hacker $20K in bitcoin after massive data breach, sources say"
## [3] "Byram Healthcare notifies patients about rogue insider incident"
## [4] "Update: TIO Networks notifies consumers of breach going back to 2014 or earlier"
## [5] "Follow-up: Mecklenburg Co. not fined for releasing personal information of health department patients"
Let’s do another one: the year of the breach.
year <- html_nodes(website, ".year")
Let’s also take the day and the month.
day <- html_nodes(website, ".day")
month <- html_nodes(website, ".month")
Now let’s combine that into a dataframe.
df <- data.frame(title = html_text(title),
day = html_text(day),
month = html_text(month),
year = html_text(year))
df
title | day | month | year |
---|---|---|---|
Cathay Pacific flags data breach affecting 9.4 million passengers | 24 | Oct | 2018 |
Federation of Sovereign Indigenous Nations pays hacker $20K in bitcoin after massive data breach, sources say | 24 | Oct | 2018 |
Byram Healthcare notifies patients about rogue insider incident | 24 | Oct | 2018 |
Update: TIO Networks notifies consumers of breach going back to 2014 or earlier | 24 | Oct | 2018 |
Follow-up: Mecklenburg Co. not fined for releasing personal information of health department patients | 24 | Oct | 2018 |
This was just for one page. How we we go about to get the data on several pages?
First, we will write a function that takes the URL of each page, scrapes it, and returns the information we want.
scrape_website <- function(url){
website <- read_html(url)
# variables that we're interested in
title <- html_nodes(website, ".entry-title")
year <- html_nodes(website, ".year")
day <- html_nodes(website, ".day")
month <- html_nodes(website, ".month")
# putting together into a data frame
df <- data.frame(title = html_text(title),
day = html_text(day),
month = html_text(month),
year = html_text(year))
return(df)
}
And we will start a list of data frames, and put the data frame for the initial page in the first position of that list.
datasets <- list()
datasets[[1]] <- scrape_website(url)
str(datasets)
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ title: Factor w/ 5 levels "Byram Healthcare notifies patients about rogue insider incident",..: 2 3 1 5 4
## ..$ day : Factor w/ 1 level "24": 1 1 1 1 1
## ..$ month: Factor w/ 1 level "Oct": 1 1 1 1 1
## ..$ year : Factor w/ 1 level "2018": 1 1 1 1 1
How should we go about the following pages? Basically, we have to find out what the pattern in the urls is. Go to https://www.databreaches.net/, go back a few pages and check out how the url changes. We can see that it changes as follows…
- https://www.databreaches.net/page/2/
- https://www.databreaches.net/page/3/
- https://www.databreaches.net/page/4/
It’s simply a series of numbers. So we will create a base url and then add these additional numbers. (Note that for this exercise we will only scrape the first 5 pages as we don’t want to stress their site.)
base_url <- "https://www.databreaches.net/page/"
pages <- seq(2, 5, by=1)
And now we just need to loop over pages, and use the function we created earlier to scrape the information, and add it to the list. Note that we’re adding a couple of seconds between HTTP requests to avoid overloading the page, as well as a message that will informs us of the progress of the loop.
for (i in 1:length(pages)){
# informative message about progress of loop
message(i, '/', length(pages))
# prepare URL
url <- paste(base_url, pages[i], sep="")
# scrape website
datasets[[i]] <- scrape_website(url)
# wait a couple of seconds between URL calls
Sys.sleep(0.2)
}
The final step is to convert the list of data frames into a single data frame that we can work with, using the function do.call(rbind, LIST)
(where LIST
is a list of data frames).
data <- do.call(rbind, datasets)
head(data)
title | day | month | year |
---|---|---|---|
NZ: IRD privacy breach raises data handling concerns | 24 | Oct | 2018 |
ZA: Internet Solutions warns of security breach | 23 | Oct | 2018 |
Boots cover up breach of confidentiality; over 400 lost ‘prescriptions’ from its Chaddesden store | 23 | Oct | 2018 |
Man who targeted Georgia Tech employees through phishing scheme sentenced | 23 | Oct | 2018 |
$50 million settlement in Yahoo security breach | 23 | Oct | 2018 |
Children’s Hospital of Philadelphia Provides Notice of Two Email Incidents | 23 | Oct | 2018 |
str(data)
## 'data.frame': 20 obs. of 4 variables:
## $ title: Factor w/ 20 levels "$50 million settlement in Yahoo security breach",..: 4 5 2 3 1 6 8 9 10 7 ...
## $ day : Factor w/ 2 levels "23","24": 2 1 1 1 1 1 1 1 1 1 ...
## $ month: Factor w/ 1 level "Oct": 1 1 1 1 1 1 1 1 1 1 ...
## $ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
Let’s get some quick descriptive statistics, e.g. how many breaches were there per day? Per month?
table(data$month) # frequency table
Oct |
---|
20 |
sort(table(data$month), decreasing=TRUE) # sorting the table from most to least common
## Oct
## 20
table(data$day)
23 | 24 |
---|---|
19 | 1 |
We can also aggregate the data.
data %>% dplyr::group_by(month) %>% dplyr::summarise(n=n())
month | n |
---|---|
Oct | 20 |
data %>% dplyr::group_by(day) %>% dplyr::summarise(n=n())
day | n |
---|---|
23 | 19 |
24 | 1 |