8.2 Screen (Web) scraping
8.2.1 Scenarios
- Data in table format
- e.g. List of countries by income equality
- Automatic extraction with
rvest
- Data in unstructured format
- e.g. Data breaches
- e.g. IPW - Team von A-Z
- Element identification with
selectorGadget - Automatic extraction with
rvest
- Data hidden behind web forms
- Automation of web browser behavior with
selenium - e.g. find example https://ropensci.org/tutorials/rselenium_tutorial/
- Automation of web browser behavior with
8.2.2 HTML: a primer
- Hypertext Markup Language (HTML): hidden standard behind every website
- HTML is text with marked-up structure, defined by tags:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
- What you see in your browser is an interpretation of the HTML document
8.2.3 HTML: a primer
- Some common tags:
- Document elements:
<head>,<body>,<footer>… - Document components:
<title>,<h1>,<div>… - Text style:
<b>,<i>,<strong>… - Hyperlinks:
<a>
- Document elements:
- An example: IPW - Team von A-Z
8.2.4 Beyond HTML
- Cascading Style Sheets (CSS): describes formatting of HTML components (e.g.
<h1>,<div>…), useful for us! - Javascript: adds functionalities to the website (e.g. change content/structure after website has been loaded)
8.2.5 Parsing HTML code
- First step in webscraping: read HTML code in R and parse it
- Parsing = understanding structure
- How?
rvestpackage in R:read_html: parse HTML code into Rhtml_text: extract text from HTML codehtml_table: extract tables in HTML codehtml_nodes: extract components with CSS selectorhtml_attrs: extract attributes of nodes
- How to identify relevant CSS selectors?
selectorGadgetextension for Chrome and Firefox.