## 8.15 Lab 4: Scraping newspaper website

Extracting media text from newspaper’s websites is a very frequent task in webscraping. One advantage of these sites is that they tend to offer an RSS feed that contains a list of all the stories they have published, which we can then use to more efficiently scrape them.

Parsing RSS feeds requires we learn a slightly different data format: XML, or eXtensible Markup Language, which predates (but is similar to) JSON. Just like HTML, it uses a series of tags and a tree structure. We will use the xml2 and rvest packages to read data in XML format:

Let’s look at an example:

feed <- "http://www.spiegel.de/politik/index.rss"
library(xml2)
library(rvest)
substr(as.character(rss), 1, 1000)

Just like with HTML, we can extract specific nodes of the XML file using a combination of xml_nodes and xml_text

headlines <- xml_nodes(rss, 'title')
(urls <- xml_text(urls))

Once we have the article URLs, we could go page by page, looking at their internal structure, and then scraping it. However, some packages exist that already compile a set of scrapers that generally work with any type of newspaper website – one of these is boilerpipeR. It uses a combination of machine learning and heuristics to develop functions that should work for any newspaper website. Let’s see how it works in this case:

library(boilerpipeR)
library(rJava)
# read first URL -- note that all text needs to be into a single character vector
text <- paste(text, collapse="\n")
# now let's try to parse it..
main_text <- ArticleExtractor(text)
cat(main_text)

Once we have prototype code, the last step is to generalize using a loop that will iterate over URLs.

articles <- list()
for (i in 1:length(urls)){

message(i, " of ", length(urls))
articles <- do.call(rbind, articles)