Big data and Social Science

8.2 Screen (Web) scraping

8.2.1 Scenarios

Data in table format
- e.g. List of countries by income equality
- Automatic extraction with rvest
Data in unstructured format
- e.g. Data breaches
- e.g. IPW - Team von A-Z
- Element identification with selectorGadget
- Automatic extraction with rvest
Data hidden behind web forms
- Automation of web browser behavior with selenium
- e.g. find example https://ropensci.org/tutorials/rselenium_tutorial/

8.2.2 HTML: a primer

Hypertext Markup Language (HTML): hidden standard behind every website
- HTML is text with marked-up structure, defined by tags:

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

What you see in your browser is an interpretation of the HTML document

8.2.3 HTML: a primer

Some common tags:
- Document elements: <head>, <body>, <footer>…
- Document components: <title>, <h1>, <div>…
- Text style: <b>, <i>, <strong>…
- Hyperlinks: <a>
An example: IPW - Team von A-Z

8.2.4 Beyond HTML

Cascading Style Sheets (CSS): describes formatting of HTML components (e.g. <h1>, <div>…), useful for us!
Javascript: adds functionalities to the website (e.g. change content/structure after website has been loaded)

8.2.5 Parsing HTML code

First step in webscraping: read HTML code in R and parse it
- Parsing = understanding structure
- How? rvest package in R:
  - read_html: parse HTML code into R
  - html_text: extract text from HTML code
  - html_table: extract tables in HTML code
  - html_nodes: extract components with CSS selector
  - html_attrs: extract attributes of nodes
How to identify relevant CSS selectors? selectorGadget extension for Chrome and Firefox.