8.2 Screen (Web) scraping

8.2.1 Scenarios

  1. Data in table format
  2. Data in unstructured format
  3. Data hidden behind web forms

8.2.2 HTML: a primer

  • Hypertext Markup Language (HTML): hidden standard behind every website
    • HTML is text with marked-up structure, defined by tags:

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

  • What you see in your browser is an interpretation of the HTML document

8.2.3 HTML: a primer

  • Some common tags:
    • Document elements: <head>, <body>, <footer>
    • Document components: <title>, <h1>, <div>
    • Text style: <b>, <i>, <strong>
    • Hyperlinks: <a>
  • An example: IPW - Team von A-Z

8.2.4 Beyond HTML

  • Cascading Style Sheets (CSS): describes formatting of HTML components (e.g. <h1>, <div>…), useful for us!
  • Javascript: adds functionalities to the website (e.g. change content/structure after website has been loaded)

8.2.5 Parsing HTML code

  • First step in webscraping: read HTML code in R and parse it
    • Parsing = understanding structure
    • How? rvest package in R:
      • read_html: parse HTML code into R
      • html_text: extract text from HTML code
      • html_table: extract tables in HTML code
      • html_nodes: extract components with CSS selector
      • html_attrs: extract attributes of nodes
  • How to identify relevant CSS selectors? selectorGadget extension for Chrome and Firefox.