8.2 Screen (Web) scraping
8.2.1 Scenarios
- Data in table format
- e.g. List of countries by income equality
- Automatic extraction with
rvest
- Data in unstructured format
- e.g. Data breaches
- e.g. IPW - Team von A-Z
- Element identification with
selectorGadget
- Automatic extraction with
rvest
- Data hidden behind web forms
- Automation of web browser behavior with
selenium
- e.g. find example https://ropensci.org/tutorials/rselenium_tutorial/
- Automation of web browser behavior with
8.2.2 HTML: a primer
- Hypertext Markup Language (HTML): hidden standard behind every website
- HTML is text with marked-up structure, defined by tags:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
- What you see in your browser is an interpretation of the HTML document
8.2.3 HTML: a primer
- Some common tags:
- Document elements:
<head>
,<body>
,<footer>
… - Document components:
<title>
,<h1>
,<div>
… - Text style:
<b>
,<i>
,<strong>
… - Hyperlinks:
<a>
- Document elements:
- An example: IPW - Team von A-Z
8.2.4 Beyond HTML
- Cascading Style Sheets (CSS): describes formatting of HTML components (e.g.
<h1>
,<div>
…), useful for us! - Javascript: adds functionalities to the website (e.g. change content/structure after website has been loaded)
8.2.5 Parsing HTML code
- First step in webscraping: read HTML code in R and parse it
- Parsing = understanding structure
- How?
rvest
package in R:read_html
: parse HTML code into Rhtml_text
: extract text from HTML codehtml_table
: extract tables in HTML codehtml_nodes
: extract components with CSS selectorhtml_attrs
: extract attributes of nodes
- How to identify relevant CSS selectors?
selectorGadget
extension for Chrome and Firefox.