5.14 Lab 4: Scraping newspaper website

Extracting media text from newspaper’s websites is a very frequent task in webscraping. One advantage of these sites is that they tend to offer an RSS feed that contains a list of all the stories they have published, which we can then use to more efficiently scrape them.

Parsing RSS feeds requires we learn a slightly different data format: XML, or eXtensible Markup Language, which predates (but is similar to) JSON. Just like HTML, it uses a series of tags and a tree structure. We will use the xml2 and rvest packages to read data in XML format:

Let’s look at an example:

feed <- "http://www.spiegel.de/politik/index.rss"
library(xml2)
library(rvest)
rss <- read_xml(feed)
substr(as.character(rss), 1, 1000)

Just like with HTML, we can extract specific nodes of the XML file using a combination of xml_nodes and xml_text

headlines <- xml_nodes(rss, 'title')
(headlines <- xml_text(headlines))
urls <- xml_nodes(rss, 'link')
(urls <- xml_text(urls))

Once we have the article URLs, we could go page by page, looking at their internal structure, and then scraping it. However, some packages exist that already compile a set of scrapers that generally work with any type of newspaper website – one of these is boilerpipeR. It uses a combination of machine learning and heuristics to develop functions that should work for any newspaper website. Let’s see how it works in this case:

library(boilerpipeR)
library(rJava)
# read first URL -- note that all text needs to be into a single character vector
text <- readLines(urls[3])
text <- paste(text, collapse="\n")
# now let's try to parse it..
main_text <- ArticleExtractor(text)
cat(main_text)

Once we have prototype code, the last step is to generalize using a loop that will iterate over URLs.

articles <- list()
for (i in 1:length(urls)){

    message(i, " of ", length(urls))
    text <- paste(readLines(urls[i]), collapse="\n")
    main_text <- ArticleExtractor(text)
    articles[[i]] <- data.frame(
        url = urls[i],
        headline = headlines[i],
        text = main_text,
        stringsAsFactors=F)

}
articles <- do.call(rbind, articles)

Of course, some times this standardized code will not work with specific websites. In those cases, it’s easier to just develop our own code.

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.” Am. Polit. Sci. Rev. 110 (3): 512–29.

Alvarez, Michael R. 2016. Computational Social Science. Cambridge University Press.

Bauer, Paul. 2018. “Writing a Reproducible Paper in R Markdown,” May.

Bauer, Paul C. 2015. “Negative Experiences and Trust: A Causal Analysis of the Effects of Victimization on Generalized Trust.” Eur. Sociol. Rev. 31 (4): 397–417.

Bauer, Paul C, and Clemm von Hohenberg. 2020. “Believing and Sharing Information by Fake Sources: An Experiment.” Political Communication, November.

Cioffi-Revilla, Claudio. 2017. “Computation and Social Science.” In Introduction to Computational Social Science: Principles and Applications, edited by Claudio Cioffi-Revilla, 35–102. Cham: Springer International Publishing.

Entwisle, B, and P Elias. 2013. “New Data for Understanding the Human Condition: International Perspectives.” Paris, France: OECD, available at http://www. oecd. org/sti/sci-tech/new-data-for-understanding-the-hu man-condition. pdf[ 1477].

Gerring, John. 2012. “Mere Description.” British Journal of Political Science 4 (4): 721–46.

Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Polit. Res. Q. 52 (3): 647–74.

Golder, Scott A, and Michael W Macy. 2014. “Digital Footprints: Opportunities and Challenges for Online Social Research.” Annu. Rev. Sociol. 40 (1): 129–52.

Grimmer, Justin. 2015. “We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together.” PS Polit. Sci. Polit. 48 (1): 80–83.

Hilbert, Martin, and Priscila López. 2011. “The World’s Technological Capacity to Store, Communicate, and Compute Information.” Science 332 (6025): 60–65.

King, Gary. 1995. “Replication, Replication.” PS, Political Science & Politics 28 (3): 444–52.

———. 2010. “A Hard Unsolved Problem? Post-Treatment Bias in Big Social Science Questions.” In Hard Problems in Social Science” Symposium, Harvard University. scholar.harvard.edu.

Laney, Doug. 2001. “3D Data Management: Controlling Data Volume, Velocity and Variety.” META Group Research Note 6 (70): 1.

Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Laszlo Barabasi, Devon Brewer, Nicholas Christakis, et al. 2009. “Social Science. Computational Social Science.” Science 323 (5915): 721–23.

Mayer-Schönberger, Viktor, and Kenneth Cukier. 2012. Big Data: A Revolution That Transforms How We Work, Live, and Think. Boston: Houghton Mifflin Harcourt.

Mellon, Jonathan. 2013. “Where and When Can We Use Google Trends to Measure Issue Salience?” PS Polit. Sci. Polit. 46 (2): 280–90.

Monroe, Burt L. 2013. “The Five Vs of Big Data Political Science Introduction to the Virtual Issue on Big Data in Political Science Political Analysis.” Polit. Anal. 21 (V5): 1–9.

Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. John Wiley & Sons.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.

Richthammer, Christian, Michael Netter, Moritz Riesner, Johannes Sänger, and Günther Pernul. 2014. “Taxonomy of Social Network Data Types.” EURASIP Journal on Information Security 2014 (1): 11.

Salganik, Matthew J. 2017. Bit by Bit: Social Research in the Digital Age. Princeton University Press.

Wikipedia contributors. 2018. “Data.” https://en.wikipedia.org/w/index.php?title=Data&oldid=869556199.

Zimmer, Michael. 2010. “‘But the Data Is Already Public’: On the Ethics of Research in Facebook.” Ethics Inf. Technol. 12 (4): 313–25.