Big data and Social Science

8.1 Web scraping: Basics

Increasing amount of data is available on websites:
- Speeches, sentences, biographical information…
- Social media data, newspaper articles, press releases…
- Geographic information, conflict data…
Often data comes in unstructured form (e.g. no dataframe with rows/columns)
Web scraping = process of extracting this information automatically and transforming it into a structured dataset

Copy & paste is time-consuming, boring, prone to errors, and impractical for large datasets
In contrast, automated web scraping:
1. Scales well for large datasets
2. Is reproducible
3. Involved adaptable techniques
4. Facilitates detecting and fixing errors
When to scrape?
1. Trade-off: More time now to built, less time later
2. Computer time is cheap; human time is expensive
Example for time saving: legislatorR package vs. manual collection

Respect the hosting site’s wishes:
- Check if an API exists or if data are available for download
- Keep in mind where data comes from and give credit (and respect copyright if you want to republish the data!)
- Some websites disallow scrapers on robots.txt file
Limit your bandwidth use:
- Wait one or two seconds after each hit
- Scrape only what you need, and just once (e.g. store the html file in disk, and then parse it)
- …otherwise you’ll get a visit from the IT guys! (e.g. scraping articles)
When using APIs, read documentation
- Is there a batch download option?
- Are there any rate limits?
- Can you share the data?