8.15 Lab 4: Scraping newspaper website

Extracting media text from newspaper’s websites is a very frequent task in webscraping. One advantage of these sites is that they tend to offer an RSS feed that contains a list of all the stories they have published, which we can then use to more efficiently scrape them.

Parsing RSS feeds requires we learn a slightly different data format: XML, or eXtensible Markup Language, which predates (but is similar to) JSON. Just like HTML, it uses a series of tags and a tree structure. We will use the xml2 and rvest packages to read data in XML format:

Let’s look at an example:

feed <- "http://www.spiegel.de/politik/index.rss"
rss <- read_xml(feed)
substr(as.character(rss), 1, 1000)

Just like with HTML, we can extract specific nodes of the XML file using a combination of xml_nodes and xml_text

headlines <- xml_nodes(rss, 'title')
(headlines <- xml_text(headlines))
urls <- xml_nodes(rss, 'link')
(urls <- xml_text(urls))

Once we have the article URLs, we could go page by page, looking at their internal structure, and then scraping it. However, some packages exist that already compile a set of scrapers that generally work with any type of newspaper website – one of these is boilerpipeR. It uses a combination of machine learning and heuristics to develop functions that should work for any newspaper website. Let’s see how it works in this case:

# read first URL -- note that all text needs to be into a single character vector
text <- readLines(urls[3])
text <- paste(text, collapse="\n")
# now let's try to parse it..
main_text <- ArticleExtractor(text)

Once we have prototype code, the last step is to generalize using a loop that will iterate over URLs.

articles <- list()
for (i in 1:length(urls)){

    message(i, " of ", length(urls))
    text <- paste(readLines(urls[i]), collapse="\n")
    main_text <- ArticleExtractor(text)
    articles[[i]] <- data.frame(
        url = urls[i],
        headline = headlines[i],
        text = main_text,

articles <- do.call(rbind, articles)

Of course, some times this standardized code will not work with specific websites. In those cases, it’s easier to just develop our own code.