5.12 Lab: Scraping newspaper website
Extracting media text from newspaper’s websites is a very frequent task in webscraping. One advantage of these sites is that they tend to offer an RSS feed that contains a list of all the stories they have published, which we can then use to more efficiently scrape them.
Parsing RSS feeds requires we learn a slightly different data format: XML, or eXtensible Markup Language, which predates (but is similar to) JSON. Just like HTML, it uses a series of tags and a tree structure. We will use the xml2
and rvest
packages to read data in XML format:
Let’s look at an example:
feed <- "http://www.spiegel.de/politik/index.rss"
library(xml2)
library(rvest)
rss <- read_xml(feed)
substr(as.character(rss), 1, 1000)
Just like with HTML, we can extract specific nodes of the XML file using a combination of xml_nodes
and xml_text
headlines <- xml_nodes(rss, 'title')
(headlines <- xml_text(headlines))
urls <- xml_nodes(rss, 'link')
(urls <- xml_text(urls))
Once we have the article URLs, we could go page by page, looking at their internal structure, and then scraping it. However, some packages exist that already compile a set of scrapers that generally work with any type of newspaper website – one of these is boilerpipeR
. It uses a combination of machine learning and heuristics to develop functions that should work for any newspaper website. Let’s see how it works in this case:
– DOES NOT WORK BELOW –
library(boilerpipeR)
library(rJava) # You need to have Java installed
# read first URL -- note that all text needs to be into a single character vector
text <- readLines(urls[3])
text <- paste(text, collapse="\n")
# now let's try to parse it..
main_text <- ArticleExtractor(text)
cat(main_text)
Once we have prototype code, the last step is to generalize using a loop that will iterate over URLs.
articles <- list()
for (i in 1:length(urls)){
message(i, " of ", length(urls))
text <- paste(readLines(urls[i]), collapse="\n")
main_text <- ArticleExtractor(text)
articles[[i]] <- data.frame(
url = urls[i],
headline = headlines[i],
text = main_text,
stringsAsFactors=F)
}
articles <- do.call(rbind, articles)
Of course, some times this standardized code will not work with specific websites. In those cases, it’s easier to just develop our own code.