8.1 Web scraping: Basics

  • Increasing amount of data is available on websites:
    • Speeches, sentences, biographical information…
    • Social media data, newspaper articles, press releases…
    • Geographic information, conflict data…
  • Often data comes in unstructured form (e.g. no dataframe with rows/columns)
  • Web scraping = process of extracting this information automatically and transforming it into a structured dataset

8.1.1 Scraping data from websites: Why?

  • Copy & paste is time-consuming, boring, prone to errors, and impractical for large datasets
  • In contrast, automated web scraping:
    1. Scales well for large datasets
    2. Is reproducible
    3. Involved adaptable techniques
    4. Facilitates detecting and fixing errors
  • When to scrape?
    1. Trade-off: More time now to built, less time later
    2. Computer time is cheap; human time is expensive
  • Example for time saving: legislatorR package vs. manual collection

8.1.2 Scraping the web: two approaches

  • Two different approaches:
    1. Screen scraping: extract data from source code of website, with html parser and/or regular expressions
      • rvest package in R
    2. Web APIs (application programming interfaces): a set of structured http requests that return JSON or XML data

8.1.3 The rules of the game

  1. Respect the hosting site’s wishes:
    • Check if an API exists or if data are available for download
    • Keep in mind where data comes from and give credit (and respect copyright if you want to republish the data!)
    • Some websites disallow scrapers on robots.txt file
  2. Limit your bandwidth use:
    • Wait one or two seconds after each hit
    • Scrape only what you need, and just once (e.g. store the html file in disk, and then parse it)
    • …otherwise you’ll get a visit from the IT guys! (e.g. scraping articles)
  3. When using APIs, read documentation
    • Is there a batch download option?
    • Are there any rate limits?
    • Can you share the data?

8.1.4 The art of web scraping

  • Workflow:
    1. Learn about structure of website
    2. Choose your strategy
    3. Build prototype code: extract, prepare, validate
    4. Generalize: functions, loops, debugging
    5. Data cleaning