5.2 Screen (Web) scraping

Data in table format
- e.g. List of countries by income equality
- Automatic extraction with rvest
Data in unstructured format
- e.g. Data breaches
- e.g. IPW - Team von A-Z
- Element identification with selectorGadget
- Automatic extraction with rvest
Data hidden behind web forms
- Automation of web browser behavior with selenium
- e.g. find example https://ropensci.org/tutorials/rselenium_tutorial/

Hypertext Markup Language (HTML): hidden standard behind every website
- HTML is text with marked-up structure, defined by tags:

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

Some common tags:
- Document elements: <head>, <body>, <footer>…
- Document components: <title>, <h1>, <div>…
- Text style: <b>, <i>, <strong>…
- Hyperlinks: <a>
An example: IPW - Team von A-Z (inspect html source code)

Cascading Style Sheets (CSS): describes formatting of HTML components (e.g. <h1>, <div>…), useful for us!
- We can use CSS selectors (see example)
Javascript: adds functionalities to the website (e.g. change content/structure after website has been loaded)

First step in webscraping: read HTML code in R and parse it
Parsing = understanding structure
How?
- xml2 package
  - read_html: parse HTML code into R (and )
- rvest package
  - html_text: extract text from HTML code
  - html_table: extract tables in HTML code
  - html_nodes: extract components with CSS selector
  - html_attrs: extract attributes of nodes
How to identify relevant CSS selectors? selectorGadget extension for Chrome and Firefox.