Web scraping: Basics
- Increasing amount of data is available on websites:
- Speeches, sentences, biographical information…
- Social media data, newspaper articles, press releases…
- Geographic information, conflict data…
- Often data comes in unstructured form (e.g. no dataframe with rows/columns)
- Web scraping = process of extracting this information automatically and transforming it into a structured dataset
Scraping data from websites: Why?
- Copy & paste is time-consuming, boring, prone to errors, and impractical for large datasets
- In contrast, automated web scraping:
- Scales well for large datasets
- Is reproducible
- Involved adaptable techniques
- Facilitates detecting and fixing errors
- When to scrape?
- Trade-off: More time now to built, less time later
- Computer time is cheap; human time is expensive
- Example for time saving: legislatorR package vs. manual collection
Scraping the web: two approaches
- Two different approaches:
- Screen scraping: extract data from source code of website, with html parser and/or regular expressions
- Web APIs (application programming interfaces): a set of structured http requests that return JSON or XML data
- In this course we equate “web scraping” more with screen scraping
The rules of the game
- Respect the hosting site’s wishes:
- Check if an API exists or if data are available for download
- Keep in mind where data comes from and give credit (and respect copyright if you want to republish the data!)
- Some websites disallow scrapers on robots.txt file
- Limit your bandwidth use:
- Wait one or two seconds after each hit
- Scrape only what you need, and just once (e.g. store the html file in disk, and then parse it)
- …otherwise you’ll get a visit from the IT guys! (e.g. scraping articles)
- When using APIs, read documentation
- Is there a batch download option?
- Are there any rate limits?
- Can you share the data?
The art of web scraping
- Workflow:
- Learn about structure of website
- Choose your strategy
- Build prototype code: extract, prepare, validate
- Generalize: functions, loops, debugging
- Data cleaning