Chapter 4 Scraping the Web I: Basics

library(xml2)
library(tidyverse)
library(glue)
This could be you.

Figure 4.1: This could be you.

The next two lessons will focus on scraping the web, a seldom-taught but very in-demand skill. Web scraping is the process of extracting data from websites, often sites that don’t want their data to be downloaded.

This includes 3 steps:

  1. Acquisition: Getting lists of URLs to scrape. These are the links that you want do download.
  2. Scraping: Downloading the links and storing them in a systematic way.
  3. Parsing: Getting data from the downloaded content, for use in text analysis.

4.1 Acquisition

To scrape a website, you need to understand a website.

Figure 4.2: To scrape a website, you need to understand a website.

Usually, the most difficult part of scraping a website is knowing where to find the data itself.

However, there are a few tricks that can help you find the data you want.

4.1.1 robots.txt

Sometimes, websites will actually want you to scrape their data, as this is how search engines like Google find their content. To help you find the data, they will often create a file called robots.txt that lists all the URLs that they want you to scrape. This is often a good starting point, as in the best-case scenario, it will give you basic instructions on how to scrape the site.

Let’s look at some examples:

4.1.1.1 archive.org

Archive.org is a website that archives the internet, preserving it for future generations, and is an excellent resource.

If anyone wants to go into journalism, a ton of dirt can be covered up by looking up old versions of company websites on archive.org.

Let’s look at their robots.txt file, by simply going to: https://archive.org/robots.txt

Sitemap: https://archive.org/sitemap/sitemap.xml

##############################################
#
# Welcome to the Archive!
#
##############################################
# Please crawl our files.
# We appreciate if you can crawl responsibly.
# Stay open!
##############################################


User-agent: *
Disallow: /control/
Disallow: /report/

This is a very open robots.txt file, as it tells us that we can crawl any page on the site, except for the /control/ and /report/ pages.

4.1.1.2 New York Times

The New York Times is a newspaper that has been around for a long time, and has a lot of data that could be useful for text analysis. However, as we can see by it’s robots.txt file ar https://www.nytimes.com/robots.txt, it has some restrictions:

User-agent: *
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /puzzles/leaderboards/invite/*
Disallow: /svc
Allow: /svc/crosswords
Allow: /svc/games
Allow: /svc/letter-boxed
Allow: /svc/spelling-bee
Allow: /svc/vertex
Allow: /svc/wordle
Disallow: /video/embedded/*
Disallow: /search
Disallow: /multiproduct/
Disallow: /hd/
Disallow: /inyt/
Disallow: /*?*query=
Disallow: /*.pdf$
Disallow: /*?*login=
Disallow: /*?*searchResultPosition=
Disallow: /*?*campaignId=
Disallow: /*?*mcubz=
Disallow: /*?*smprod=
Disallow: /*?*ProfileID=
Disallow: /*?*ListingID=
Disallow: /wirecutter/wp-admin/
Disallow: /wirecutter/*.zip$
Disallow: /wirecutter/*.csv$
Disallow: /wirecutter/deals/beta
Disallow: /wirecutter/data-requests
Disallow: /wirecutter/search
Disallow: /wirecutter/*?s=
Disallow: /wirecutter/*&xid=
Disallow: /wirecutter/*?q=
Disallow: /wirecutter/*?l=
Disallow: /search
Disallow: /*?*smid=
Disallow: /*?*partner=
Disallow: /*?*utm_source=
Allow: /wirecutter/*?*utm_source=
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss

User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-Agent: omgili
Disallow: /

User-Agent: omgilibot
Disallow: /

User-agent: Twitterbot
Allow: /*?*smid=

Sitemap: https://www.nytimes.com/sitemaps/new/news.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/collections.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/video.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/cooking.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/recipe-collects.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/regions.xml
Sitemap: https://www.nytimes.com/sitemaps/new/best-sellers.xml
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
Sitemap: https://www.nytimes.com/elections/2018/sitemap
Sitemap: https://www.nytimes.com/wirecutter/sitemapindex.xml

Anything that follows Allow: is allowed, and anything that follows Disallow: is not allowed. This doesn’t mean that you can’t scrape the site, but it does mean that you should be careful, as they don’t want you to.

4.1.2 Sitemaps

If a website wants their information scraped, they will often create a sitemap, which is a list of all the URLs on the site. If this exists, it’s often the best place to start, as it will give you a list of all the URLs that you want to scrape, and the acquisition step is done.

Let’s look at some examples:

4.1.2.1 The Associated Press

When we go to https://apnews.com/robots.txt, we see that they have a sitemap:

User-Agent: *
Disallow:
Disallow: *_ptid=*
Disallow: *?prx_t=*
Disallow: /press-release/*
Sitemap: https://apnews.com/ap-sitemap.xml
Sitemap: https://apnews.com/news-sitemap-content.xml
Sitemap: https://apnews.com/video-sitemap.xml

We see at the bottom that they have a sitemap at https://apnews.com/ap-sitemap.xml, which we can go take a look at.

This is a very long file, but it’s a list of all the URLs on the site. It’s in a (very weird) format called XML, but we can clearly see that it contains a list of URLs.

4.1.2.2 la Repubblica

When we go to https://www.repubblica.it/robots.txt, we see that they have several sitemaps:

...
...
...
Sitemap:        https://www.repubblica.it/sitemap-n.xml 
# 
Sitemap:        https://video.repubblica.it/sitemap-v-day.xml 
Sitemap:        https://video.repubblica.it/sitemap-v.xml 
# 
Sitemap:        https://www.repubblica.it/sitemap-moda-e-beauty-n.xml 
Sitemap:        https://www.repubblica.it/sitemap-italiantech-n.xml 
Sitemap:        https://www.repubblica.it/sitemap-il-gusto-n.xml 
Sitemap:        https://www.repubblica.it/sitemap-salute-n.xml 
Sitemap:        https://www.repubblica.it/sitemap-green-and-blue-n.xml 
...
...
...

We can click around on these and see that each one corresponds to different sections of the website, and we can use these to scrape the site.

4.1.2.3 Xinhua

Xinhua News Agency is the official Chinese news agency, and is one of the harder ones to scrape. When we go to https://www.xinhuanet.com/robots.txt, we can see that they allow all sorts of scraping, but don’t have a sitemap. We’ll cover some tips for dealing with this later.

# robots.txt for http://www.xinhuanet.com/

User-Agent: *
Allow: /

4.2 Classwork: Reading a sitemap

In groups of 3, pick some news websites and find its robots.txt and its sitemap.

When you find a promising one, read the sitemap and answer the following:

  1. Is it possible to scrape the entire site from the sitemap?
  2. How far back in the past could you scrape if you wanted to?
  3. Are some parts of the website not allowed to be scraped?

4.3 Downloading and reading XML

This can be hard to get your brain around.

Figure 4.3: This can be hard to get your brain around.

Downloading your file is pretty easy, just use the same download.file() function we’ve been using, and be sure to save it as XML.

download.file("https://apnews.com/news-sitemap-content.xml", "ap_sitemap.xml")

Now, we have to learn one last data format, left over from last class, XML.

Assuming you’ve loaded the library(xml2), you can initially read it using read_xml().

urls <- read_xml("ap_sitemap.xml")
urls
## {xml_document}
## <urlset schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
##  [1] <url>\n  <loc>https://apnews.com/article/texans-falcons-d2cc274bd193e8f2 ...
##  [2] <url>\n  <loc>https://apnews.com/article/f1-qatar-grand-prix-verstappen- ...
##  [3] <url>\n  <loc>https://apnews.com/article/inter-bologna-milan-juventus-se ...
##  [4] <url>\n  <loc>https://apnews.com/world-news/general-news-1d1e7ef745f2f1e ...
##  [5] <url>\n  <loc>https://apnews.com/article/citadel-furman-f4f00b18df716396 ...
##  [6] <url>\n  <loc>https://apnews.com/world-news/general-news-992058df4545396 ...
##  [7] <url>\n  <loc>https://apnews.com/article/bengals-cardinals-28a9e6bdf3087 ...
##  [8] <url>\n  <loc>https://apnews.com/sports/deportes-57febdd23ba16f433773977 ...
##  [9] <url>\n  <loc>https://apnews.com/article/villanova-north-carolina-at-001 ...
## [10] <url>\n  <loc>https://apnews.com/article/ut-martin-eastern-illinois-c213 ...
## [11] <url>\n  <loc>https://apnews.com/article/timbers-montreal-9ee51f8cf80240 ...
## [12] <url>\n  <loc>https://apnews.com/article/sacred-heart-long-island-univer ...
## [13] <url>\n  <loc>https://apnews.com/sports/golf-formula-one-racing-tennis-s ...
## [14] <url>\n  <loc>https://apnews.com/article/lafc-austin-8c8185a68d6d5dde1e3 ...
## [15] <url>\n  <loc>https://apnews.com/article/sankey-red-river-big-12-state-f ...
## [16] <url>\n  <loc>https://apnews.com/article/tcu-iowa-state-big-12-football- ...
## [17] <url>\n  <loc>https://apnews.com/article/f1-qatar-max-verstappen-9ed49e3 ...
## [18] <url>\n  <loc>https://apnews.com/article/california-newsom-laws-signing- ...
## [19] <url>\n  <loc>https://apnews.com/sports/baseball-houston-astros-yordan-a ...
## [20] <url>\n  <loc>https://apnews.com/article/israel-uefa-european-championsh ...
## ...

4.4 Unnest

This could be you.

Figure 4.4: This could be you.

However, the format isn’t particularly useful.

This is because XML is nested data; one thing is inside another thing. What we want is tabular data, where everything is in a grid.

Nested data might be like a report outline, like so:

  1. Introduction
  2. Literature Review
    1. Old dead guy
    2. Respected academic
    3. Why they’re wrong
  3. Research Questions
    1. RQ1
    2. RQ2
    3. RQ3

Tabular data is what we’ve been working with, with columns and rows:

names age
Curly 45
Larry 20
Moe 35

To do this, we need to convert it to a list, then to a tibble, which is a type of data frame.

urls <- read_xml("ap_sitemap.xml") |>
  as_list() |>
  as_tibble()
urls
## # A tibble: 507 × 1
##    urlset          
##    <named list>    
##  1 <named list [2]>
##  2 <named list [2]>
##  3 <named list [2]>
##  4 <named list [2]>
##  5 <named list [2]>
##  6 <named list [2]>
##  7 <named list [2]>
##  8 <named list [2]>
##  9 <named list [2]>
## 10 <named list [2]>
## # ℹ 497 more rows

Still not useful! We can see that now we have a list of 500 rows and one column, and each column still contains nested data.

We could take a look at one of the rows by selecting it directly:

urls$urlset[1]
## $url
## $url$loc
## $url$loc[[1]]
## [1] "https://apnews.com/article/texans-falcons-d2cc274bd193e8f248883c6ff51d1cfa"
## 
## 
## $url$news
## $url$news$publication
## $url$news$publication$name
## $url$news$publication$name[[1]]
## [1] "Associated Press"
## 
## 
## $url$news$publication$language
## $url$news$publication$language[[1]]
## [1] "eng"
## 
## 
## 
## $url$news$publication_date
## $url$news$publication_date[[1]]
## [1] "2023-10-08T16:20:26-04:00"
## 
## 
## $url$news$title
## $url$news$title[[1]]
## [1] "Desmond Ridder answers critics, Younghoe Koo kicks last-second field goal, Falcons edge Texans 21-19"

Examining it, we can see that the nested data looks like this:

  • url
    • loc
    • news
      • publication
        • language
      • publication_date
      • title

To fix this problem, we can use a dplyr function called unnest_wider, which simply splits nested data into different columns. We pass the name of the column into the

read_xml("ap_sitemap.xml") |>
  as_list() |>
  as_tibble() |>
  unnest_wider(urlset)
## # A tibble: 507 × 2
##    loc        news            
##    <list>     <list>          
##  1 <list [1]> <named list [3]>
##  2 <list [1]> <named list [3]>
##  3 <list [1]> <named list [3]>
##  4 <list [1]> <named list [3]>
##  5 <list [1]> <named list [3]>
##  6 <list [1]> <named list [3]>
##  7 <list [1]> <named list [3]>
##  8 <list [1]> <named list [3]>
##  9 <list [1]> <named list [3]>
## 10 <list [1]> <named list [3]>
## # ℹ 497 more rows

This is better, as now the loc and news are different columns. However, each one is still a list. The nested structure of each list should now look like this:

  • loc
  • news
    • publication
      • language
    • publication_date
    • title

Let’s add another unnest() to get the loc column from a list to the actual value. If a list has a length of 1, we have to add names_sep = "". Now, we should have the actual data.

read_xml("ap_sitemap.xml") |>
  as_list() |>
  as_tibble() |>
  unnest_wider(urlset) |>
  unnest_wider(loc, names_sep = "")

Let’s continue this process with the “news” column, for completeness sake:

read_xml("ap_sitemap.xml") |>
  as_list() |>
  as_tibble() |>
  unnest_wider(urlset) |>
  unnest_wider(loc, names_sep = "") |>
  unnest_wider(news)
## # A tibble: 507 × 4
##    loc1                                     publication  publication_date title 
##    <chr>                                    <list>       <list>           <list>
##  1 https://apnews.com/article/texans-falco… <named list> <list [1]>       <list>
##  2 https://apnews.com/article/f1-qatar-gra… <named list> <list [1]>       <list>
##  3 https://apnews.com/article/inter-bologn… <named list> <list [1]>       <list>
##  4 https://apnews.com/world-news/general-n… <named list> <list [1]>       <list>
##  5 https://apnews.com/article/citadel-furm… <named list> <list [1]>       <list>
##  6 https://apnews.com/world-news/general-n… <named list> <list [1]>       <list>
##  7 https://apnews.com/article/bengals-card… <named list> <list [1]>       <list>
##  8 https://apnews.com/sports/deportes-57fe… <named list> <list [1]>       <list>
##  9 https://apnews.com/article/villanova-no… <named list> <list [1]>       <list>
## 10 https://apnews.com/article/ut-martin-ea… <named list> <list [1]>       <list>
## # ℹ 497 more rows
read_xml("ap_sitemap.xml") |>
  as_list() |>
  as_tibble() |>
  unnest_wider(urlset) |>
  unnest_wider(loc, names_sep = "") |>
  unnest_wider(news) |>
  unnest_wider(publication) |>
  unnest_wider(name, names_sep = "")
## # A tibble: 507 × 5
##    loc1                                   name1 language publication_date title 
##    <chr>                                  <chr> <list>   <list>           <list>
##  1 https://apnews.com/article/texans-fal… Asso… <list>   <list [1]>       <list>
##  2 https://apnews.com/article/f1-qatar-g… Asso… <list>   <list [1]>       <list>
##  3 https://apnews.com/article/inter-bolo… Asso… <list>   <list [1]>       <list>
##  4 https://apnews.com/world-news/general… Asso… <list>   <list [1]>       <list>
##  5 https://apnews.com/article/citadel-fu… Asso… <list>   <list [1]>       <list>
##  6 https://apnews.com/world-news/general… Asso… <list>   <list [1]>       <list>
##  7 https://apnews.com/article/bengals-ca… Asso… <list>   <list [1]>       <list>
##  8 https://apnews.com/sports/deportes-57… Asso… <list>   <list [1]>       <list>
##  9 https://apnews.com/article/villanova-… Asso… <list>   <list [1]>       <list>
## 10 https://apnews.com/article/ut-martin-… Asso… <list>   <list [1]>       <list>
## # ℹ 497 more rows

Keep doing this until you have all the data that you need.

read_xml("ap_sitemap.xml") |>
  as_list() |>
  as_tibble() |>
  unnest_wider(urlset) |>
  unnest_wider(loc, names_sep = "") |>
  unnest_wider(news) |>
  unnest_wider(publication) |>
  unnest_wider(name, names_sep = "") |>
  unnest_wider(language, names_sep = "") |>
  unnest_wider(publication_date, names_sep = "") |>
  unnest_wider(title, names_sep = "")
## # A tibble: 507 × 5
##    loc1                                 name1 language1 publication_date1 title1
##    <chr>                                <chr> <chr>     <chr>             <chr> 
##  1 https://apnews.com/article/texans-f… Asso… eng       2023-10-08T16:20… Desmo…
##  2 https://apnews.com/article/f1-qatar… Asso… eng       2023-10-08T09:58… No re…
##  3 https://apnews.com/article/inter-bo… Asso… eng       2023-10-07T11:06… Pulis…
##  4 https://apnews.com/world-news/gener… Asso… spa       2023-10-07T20:00… Tras …
##  5 https://apnews.com/article/citadel-… Asso… eng       2023-10-07T17:44… Huff …
##  6 https://apnews.com/world-news/gener… Asso… spa       2023-10-07T11:44… Ataqu…
##  7 https://apnews.com/article/bengals-… Asso… eng       2023-10-08T20:44… Cardi…
##  8 https://apnews.com/sports/deportes-… Asso… spa       2023-10-07T12:35… Pulis…
##  9 https://apnews.com/article/villanov… Asso… eng       2023-10-07T20:11… Watki…
## 10 https://apnews.com/article/ut-marti… Asso… eng       2023-10-07T20:00… Dent …
## # ℹ 497 more rows

There you go! You now have a list of articles and metadata. When you’re happy with the way it looks, assign it to a variable to save it.

url_list <- read_xml("ap_sitemap.xml") |> # This is the only line I changed.
  as_list() |>
  as_tibble() |>
  unnest_wider(urlset) |>
  unnest_wider(loc, names_sep = "") |>
  unnest_wider(news) |>
  unnest_wider(publication) |>
  unnest_wider(name, names_sep = "") |>
  unnest_wider(language, names_sep = "") |>
  unnest_wider(publication_date, names_sep = "") |>
  unnest_wider(title, names_sep = "")

4.5 Classwork: Processing XML

With the website you found earlier, download the XML file and get a data frame with a list of URLs. If you still have time, un-nest the other metadata as well.

4.6 Scraping

Now that we have our nested data, we need to actually get the data from the links. For this, we’re going to go back to the basics, with our old friend the for loop.

As a refresher, remember that given a vector, a for loop will do something for each element in the vector.

Using our list of urls, we could print every url. It might go too fast for us, so we could use Sys.sleep() to pause for a second between each one.

for (url in url_list$loc1) {
  print(url)
  Sys.sleep(0.1)
}
[1] "https://apnews.com/article/texans-falcons-d2cc274bd193e8f248883c6ff51d1cfa"
[1] "https://apnews.com/article/f1-qatar-grand-prix-verstappen-tires-d65ed543a90589a580a7357023b80fa7"
[1] "https://apnews.com/article/inter-bologna-milan-juventus-serie-a-d8569cc950036788dea4cbc5444767c0"
[1] "https://apnews.com/world-news/general-news-1d1e7ef745f2f1e773e8be03d23f070c"
[1] "https://apnews.com/article/citadel-furman-f4f00b18df71639616f1ba34b54a0ed8"
[1] "https://apnews.com/world-news/general-news-992058df45453966961b4e54b2073d0c"
[1] "https://apnews.com/article/bengals-cardinals-28a9e6bdf3087e35f8a9495f4973822e"
[1] "https://apnews.com/sports/deportes-57febdd23ba16f433773977af234e2b0"
[1] "https://apnews.com/article/villanova-north-carolina-at-0018092487249a74496f88c494efc404"
[1] "https://apnews.com/article/ut-martin-eastern-illinois-c213f1751d186412db91fa40b65951fe"
[1] "https://apnews.com/article/timbers-montreal-9ee51f8cf802407a10728c6cffb4c5d3"
...
...
...

However, instead of printing it, we want to use download.file() to save it somewhere. We’ll want to:

  1. Make a new folder for our html files.
  2. On our dataframe, give each row a code, which will be a good name for our new file.
  3. Loop through the dataframe, and download all the links.

4.6.1 Making a folder

You can make a folder righti n R with dir.create(). Just put the name in.

dir.create("ap_files")
## Warning in dir.create("ap_files"): 'ap_files' already exists

4.6.2 Adding file names

We now need to assign an file name to each of the links we want to download. You can do this however you want, but the easiest way might be to just take the row number in the data frame with row_number(), and paste it to a file ending.

url_list <- url_list |>
  mutate(file_name = glue("{row_number()}.html"))
url_list
## # A tibble: 507 × 6
##    loc1                       name1 language1 publication_date1 title1 file_name
##    <chr>                      <chr> <chr>     <chr>             <chr>  <glue>   
##  1 https://apnews.com/articl… Asso… eng       2023-10-08T16:20… Desmo… 1.html   
##  2 https://apnews.com/articl… Asso… eng       2023-10-08T09:58… No re… 2.html   
##  3 https://apnews.com/articl… Asso… eng       2023-10-07T11:06… Pulis… 3.html   
##  4 https://apnews.com/world-… Asso… spa       2023-10-07T20:00… Tras … 4.html   
##  5 https://apnews.com/articl… Asso… eng       2023-10-07T17:44… Huff … 5.html   
##  6 https://apnews.com/world-… Asso… spa       2023-10-07T11:44… Ataqu… 6.html   
##  7 https://apnews.com/articl… Asso… eng       2023-10-08T20:44… Cardi… 7.html   
##  8 https://apnews.com/sports… Asso… spa       2023-10-07T12:35… Pulis… 8.html   
##  9 https://apnews.com/articl… Asso… eng       2023-10-07T20:11… Watki… 9.html   
## 10 https://apnews.com/articl… Asso… eng       2023-10-07T20:00… Dent … 10.html  
## # ℹ 497 more rows

4.6.2.1 Discussion: What might be some better ways to do this?

4.6.3 Saving the file

Whatever you do, be sure to save the file!

url_list |>
  write_csv("ap_url_list.csv")

4.6.4 Downloading our files

Our last step is to download the files. We can do this with a for loop, so let’s build it from the inside out. To start, we need to loop through all the rows in our data frame. We can do this by selecting it using the [square brackets].

row <- url_list[1, ]
row
## # A tibble: 1 × 6
##   loc1                        name1 language1 publication_date1 title1 file_name
##   <chr>                       <chr> <chr>     <chr>             <chr>  <glue>   
## 1 https://apnews.com/article… Asso… eng       2023-10-08T16:20… Desmo… 1.html

Within the row, we can select the different columns using the [dollar$sign].

output_location <- row$file_name
output_location
## 1.html

We want to put it in the folder we made earlier, so we can use glue() to add the full path

output_location <- glue("ap_files/{row$file_name}")
output_location
## ap_files/1.html

Finally, we can select the URL location and download it using download.file()

download.file(row$loc1, output_location)

If this works, let’s wrap it up in a for-loop, and test the first 5 rows.

I also added two things:

  1. Sys.sleep(2) Adds a 2-second pause between requests.
  2. I also print a nice little message after each request.
for (i in 1:5) {
  row <- url_list[i, ]
  output_location <- glue("ap_files/{row$file_name}")
  download.file(row$loc1, output_location)
  Sys.sleep(2)
  print(glue("{i} files have been downloaded"))
}
trying URL 'https://apnews.com/article/texans-falcons-d2cc274bd193e8f248883c6ff51d1cfa'
downloaded 573 KB

1 files have been downloaded
trying URL 'https://apnews.com/article/f1-qatar-grand-prix-verstappen-tires-d65ed543a90589a580a7357023b80fa7'
downloaded 398 KB

2 files have been downloaded
trying URL 'https://apnews.com/article/inter-bologna-milan-juventus-serie-a-d8569cc950036788dea4cbc5444767c0'
downloaded 466 KB

3 files have been downloaded
trying URL 'https://apnews.com/world-news/general-news-1d1e7ef745f2f1e773e8be03d23f070c'
downloaded 194 KB

4 files have been downloaded
trying URL 'https://apnews.com/article/citadel-furman-f4f00b18df71639616f1ba34b54a0ed8'
downloaded 162 KB

Now, we want to make sure we don’t download the same thing twice, as this could get us banned from the website. Let’s add an if/else statement, too make sure we didn’t already download the file. First, we need a list of files in our folder, using list.files().

already_downloaded <- list.files("ap_files")
already_downloaded
[1] "1.html" "2.html" "3.html" "4.html" "5.html"

We can now put our main code in an if/else statement: Now test it with the first 10 rows:

for (i in 1:10) {
  row <- url_list[i, ]
  if (row$file_name %in% already_downloaded) {
    print(glue("Skipping {row$file_name}"))
  }
  else {
    output_location <- glue("ap_files/{row$file_name}")
    download.file(row$loc1, output_location)
    Sys.sleep(2)
    print(glue("{i} files have been downloaded"))
  }
}
Skipping 1.html
Skipping 2.html
Skipping 3.html
Skipping 4.html
Skipping 5.html
trying URL 'https://apnews.com/world-news/general-news-992058df45453966961b4e54b2073d0c'
downloaded 160 KB

6 files have been downloaded
trying URL 'https://apnews.com/article/bengals-cardinals-28a9e6bdf3087e35f8a9495f4973822e'
downloaded 268 KB

7 files have been downloaded
trying URL 'https://apnews.com/sports/deportes-57febdd23ba16f433773977af234e2b0'
downloaded 225 KB

8 files have been downloaded
trying URL 'https://apnews.com/article/villanova-north-carolina-at-0018092487249a74496f88c494efc404'
downloaded 162 KB

9 files have been downloaded
trying URL 'https://apnews.com/article/ut-martin-eastern-illinois-c213f1751d186412db91fa40b65951fe'
downloaded 162 KB

10 files have been downloaded

It should now skip the first 5 rows, because they’re in your download folder.

Congratulations! You now just built a web scraper.

4.7 Not Getting your IP banned

Web scraping is very much in a legal grey area, and the rules vary from place to place. Please consult lawyers if you’re going to do anything legally questionable.

However, here are some basic rules:

  1. Scrape slowly. When you’re downloading web pages, make sure to only do a page every few seconds. Sys.sleep() is your go-to for this.

  2. Pretend to be human. Download pages in a random order, add a slightly random amount of time between downloads, don’t do everything in one batch. For example, you could generate a random number with runif(1), then sleep with Sys.sleep(runif(1) + 2).

  3. Only scrape once. Write your scrapers so that you’re only downloading the data one time, even if the scraper crashes in the middle. Your if/else statement will largely solve this issue.

  4. Don’t publish your data. At most, provide a list of URLs that you scraped. This keeps you mostly in the clear in terms of copyright.

The main idea is to not get caught.

Figure 4.5: The main idea is to not get caught.