5.12 Lab: Scraping newspaper website

Extracting media text from newspaper’s websites is a very frequent task in webscraping. One advantage of these sites is that they tend to offer an RSS feed that contains a list of all the stories they have published, which we can then use to more efficiently scrape them.

Parsing RSS feeds requires we learn a slightly different data format: XML, or eXtensible Markup Language, which predates (but is similar to) JSON. Just like HTML, it uses a series of tags and a tree structure. We will use the xml2 and rvest packages to read data in XML format:

Let’s look at an example:

Just like with HTML, we can extract specific nodes of the XML file using a combination of xml_nodes and xml_text

Once we have the article URLs, we could go page by page, looking at their internal structure, and then scraping it. However, some packages exist that already compile a set of scrapers that generally work with any type of newspaper website – one of these is boilerpipeR. It uses a combination of machine learning and heuristics to develop functions that should work for any newspaper website. Let’s see how it works in this case:

– DOES NOT WORK BELOW –

Once we have prototype code, the last step is to generalize using a loop that will iterate over URLs.

Of course, some times this standardized code will not work with specific websites. In those cases, it’s easier to just develop our own code.