Chapter 4 Scraping the Web I: Basics
The next two lessons will focus on scraping the web, a seldom-taught but very in-demand skill. Web scraping is the process of extracting data from websites, often sites that don’t want their data to be downloaded.
This includes 3 steps:
- Acquisition: Getting lists of URLs to scrape. These are the links that you want do download.
- Scraping: Downloading the links and storing them in a systematic way.
- Parsing: Getting data from the downloaded content, for use in text analysis.
4.1 Acquisition
Usually, the most difficult part of scraping a website is knowing where to find the data itself.
However, there are a few tricks that can help you find the data you want.
4.1.1 robots.txt
Sometimes, websites will actually want you to scrape their data, as this is how search engines like Google find their content.
To help you find the data, they will often create a file called robots.txt
that lists all the URLs that they want you to scrape.
This is often a good starting point, as in the best-case scenario, it will give you basic instructions on how to scrape the site.
Let’s look at some examples:
4.1.1.1 archive.org
Archive.org is a website that archives the internet, preserving it for future generations, and is an excellent resource.
If anyone wants to go into journalism, a ton of dirt can be covered up by looking up old versions of company websites on archive.org.
Let’s look at their robots.txt
file, by simply going to: https://archive.org/robots.txt
Sitemap: https://archive.org/sitemap/sitemap.xml
##############################################
#
# Welcome to the Archive!
#
##############################################
# Please crawl our files.
# We appreciate if you can crawl responsibly.
# Stay open!
##############################################
User-agent: *
Disallow: /control/
Disallow: /report/
This is a very open robots.txt
file, as it tells us that we can crawl any page on the site, except for the /control/
and /report/
pages.
4.1.1.2 New York Times
The New York Times is a newspaper that has been around for a long time, and has a lot of data that could be useful for text analysis. However, as we can see by it’s robots.txt file ar https://www.nytimes.com/robots.txt, it has some restrictions:
User-agent: *
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /puzzles/leaderboards/invite/*
Disallow: /svc
Allow: /svc/crosswords
Allow: /svc/games
Allow: /svc/letter-boxed
Allow: /svc/spelling-bee
Allow: /svc/vertex
Allow: /svc/wordle
Disallow: /video/embedded/*
Disallow: /search
Disallow: /multiproduct/
Disallow: /hd/
Disallow: /inyt/
Disallow: /*?*query=
Disallow: /*.pdf$
Disallow: /*?*login=
Disallow: /*?*searchResultPosition=
Disallow: /*?*campaignId=
Disallow: /*?*mcubz=
Disallow: /*?*smprod=
Disallow: /*?*ProfileID=
Disallow: /*?*ListingID=
Disallow: /wirecutter/wp-admin/
Disallow: /wirecutter/*.zip$
Disallow: /wirecutter/*.csv$
Disallow: /wirecutter/deals/beta
Disallow: /wirecutter/data-requests
Disallow: /wirecutter/search
Disallow: /wirecutter/*?s=
Disallow: /wirecutter/*&xid=
Disallow: /wirecutter/*?q=
Disallow: /wirecutter/*?l=
Disallow: /search
Disallow: /*?*smid=
Disallow: /*?*partner=
Disallow: /*?*utm_source=
Allow: /wirecutter/*?*utm_source=
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
User-agent: CCBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-Agent: omgili
Disallow: /
User-Agent: omgilibot
Disallow: /
User-agent: Twitterbot
Allow: /*?*smid=
Sitemap: https://www.nytimes.com/sitemaps/new/news.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/collections.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/video.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/cooking.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/recipe-collects.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/regions.xml
Sitemap: https://www.nytimes.com/sitemaps/new/best-sellers.xml
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
Sitemap: https://www.nytimes.com/elections/2018/sitemap
Sitemap: https://www.nytimes.com/wirecutter/sitemapindex.xml
Anything that follows Allow:
is allowed, and anything that follows Disallow:
is not allowed. This doesn’t mean
that you can’t scrape the site, but it does mean that you should be careful, as they don’t want you to.
4.1.2 Sitemaps
If a website wants their information scraped, they will often create a sitemap, which is a list of all the URLs on the site. If this exists, it’s often the best place to start, as it will give you a list of all the URLs that you want to scrape, and the acquisition step is done.
Let’s look at some examples:
4.1.2.1 The Associated Press
When we go to https://apnews.com/robots.txt, we see that they have a sitemap:
User-Agent: *
Disallow:
Disallow: *_ptid=*
Disallow: *?prx_t=*
Disallow: /press-release/*
Sitemap: https://apnews.com/ap-sitemap.xml
Sitemap: https://apnews.com/news-sitemap-content.xml
Sitemap: https://apnews.com/video-sitemap.xml
We see at the bottom that they have a sitemap at https://apnews.com/ap-sitemap.xml, which we can go take a look at.
This is a very long file, but it’s a list of all the URLs on the site. It’s in a (very weird) format called XML, but we can clearly see that it contains a list of URLs.
4.1.2.2 la Repubblica
When we go to https://www.repubblica.it/robots.txt, we see that they have several sitemaps:
...
...
...
Sitemap: https://www.repubblica.it/sitemap-n.xml
#
Sitemap: https://video.repubblica.it/sitemap-v-day.xml
Sitemap: https://video.repubblica.it/sitemap-v.xml
#
Sitemap: https://www.repubblica.it/sitemap-moda-e-beauty-n.xml
Sitemap: https://www.repubblica.it/sitemap-italiantech-n.xml
Sitemap: https://www.repubblica.it/sitemap-il-gusto-n.xml
Sitemap: https://www.repubblica.it/sitemap-salute-n.xml
Sitemap: https://www.repubblica.it/sitemap-green-and-blue-n.xml
...
...
...
We can click around on these and see that each one corresponds to different sections of the website, and we can use these to scrape the site.
4.1.2.3 Xinhua
Xinhua News Agency is the official Chinese news agency, and is one of the harder ones to scrape. When we go to https://www.xinhuanet.com/robots.txt, we can see that they allow all sorts of scraping, but don’t have a sitemap. We’ll cover some tips for dealing with this later.
# robots.txt for http://www.xinhuanet.com/
User-Agent: *
Allow: /
4.2 Classwork: Reading a sitemap
In groups of 3, pick some news websites and find its robots.txt and its sitemap.
When you find a promising one, read the sitemap and answer the following:
- Is it possible to scrape the entire site from the sitemap?
- How far back in the past could you scrape if you wanted to?
- Are some parts of the website not allowed to be scraped?
4.3 Downloading and reading XML
Downloading your file is pretty easy, just use the same download.file()
function we’ve been using, and be sure to save it as XML.
Now, we have to learn one last data format, left over from last class, XML.
Assuming you’ve loaded the library(xml2)
, you can initially read it using read_xml()
.
## {xml_document}
## <urlset schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
## [1] <url>\n <loc>https://apnews.com/article/texans-falcons-d2cc274bd193e8f2 ...
## [2] <url>\n <loc>https://apnews.com/article/f1-qatar-grand-prix-verstappen- ...
## [3] <url>\n <loc>https://apnews.com/article/inter-bologna-milan-juventus-se ...
## [4] <url>\n <loc>https://apnews.com/world-news/general-news-1d1e7ef745f2f1e ...
## [5] <url>\n <loc>https://apnews.com/article/citadel-furman-f4f00b18df716396 ...
## [6] <url>\n <loc>https://apnews.com/world-news/general-news-992058df4545396 ...
## [7] <url>\n <loc>https://apnews.com/article/bengals-cardinals-28a9e6bdf3087 ...
## [8] <url>\n <loc>https://apnews.com/sports/deportes-57febdd23ba16f433773977 ...
## [9] <url>\n <loc>https://apnews.com/article/villanova-north-carolina-at-001 ...
## [10] <url>\n <loc>https://apnews.com/article/ut-martin-eastern-illinois-c213 ...
## [11] <url>\n <loc>https://apnews.com/article/timbers-montreal-9ee51f8cf80240 ...
## [12] <url>\n <loc>https://apnews.com/article/sacred-heart-long-island-univer ...
## [13] <url>\n <loc>https://apnews.com/sports/golf-formula-one-racing-tennis-s ...
## [14] <url>\n <loc>https://apnews.com/article/lafc-austin-8c8185a68d6d5dde1e3 ...
## [15] <url>\n <loc>https://apnews.com/article/sankey-red-river-big-12-state-f ...
## [16] <url>\n <loc>https://apnews.com/article/tcu-iowa-state-big-12-football- ...
## [17] <url>\n <loc>https://apnews.com/article/f1-qatar-max-verstappen-9ed49e3 ...
## [18] <url>\n <loc>https://apnews.com/article/california-newsom-laws-signing- ...
## [19] <url>\n <loc>https://apnews.com/sports/baseball-houston-astros-yordan-a ...
## [20] <url>\n <loc>https://apnews.com/article/israel-uefa-european-championsh ...
## ...
4.4 Unnest
However, the format isn’t particularly useful.
This is because XML is nested data; one thing is inside another thing. What we want is tabular data, where everything is in a grid.
Nested data might be like a report outline, like so:
- Introduction
- Literature Review
- Old dead guy
- Respected academic
- Why they’re wrong
- Research Questions
- RQ1
- RQ2
- RQ3
Tabular data is what we’ve been working with, with columns and rows:
names | age |
---|---|
Curly | 45 |
Larry | 20 |
Moe | 35 |
To do this, we need to convert it to a list, then to a tibble, which is a type of data frame.
## # A tibble: 507 × 1
## urlset
## <named list>
## 1 <named list [2]>
## 2 <named list [2]>
## 3 <named list [2]>
## 4 <named list [2]>
## 5 <named list [2]>
## 6 <named list [2]>
## 7 <named list [2]>
## 8 <named list [2]>
## 9 <named list [2]>
## 10 <named list [2]>
## # ℹ 497 more rows
Still not useful! We can see that now we have a list of 500 rows and one column, and each column still contains nested data.
We could take a look at one of the rows by selecting it directly:
## $url
## $url$loc
## $url$loc[[1]]
## [1] "https://apnews.com/article/texans-falcons-d2cc274bd193e8f248883c6ff51d1cfa"
##
##
## $url$news
## $url$news$publication
## $url$news$publication$name
## $url$news$publication$name[[1]]
## [1] "Associated Press"
##
##
## $url$news$publication$language
## $url$news$publication$language[[1]]
## [1] "eng"
##
##
##
## $url$news$publication_date
## $url$news$publication_date[[1]]
## [1] "2023-10-08T16:20:26-04:00"
##
##
## $url$news$title
## $url$news$title[[1]]
## [1] "Desmond Ridder answers critics, Younghoe Koo kicks last-second field goal, Falcons edge Texans 21-19"
Examining it, we can see that the nested data looks like this:
- url
- loc
- news
- publication
- language
- publication_date
- title
- publication
To fix this problem, we can use a dplyr function called unnest_wider
, which simply splits
nested data into different columns. We pass the name of the column into the
## # A tibble: 507 × 2
## loc news
## <list> <list>
## 1 <list [1]> <named list [3]>
## 2 <list [1]> <named list [3]>
## 3 <list [1]> <named list [3]>
## 4 <list [1]> <named list [3]>
## 5 <list [1]> <named list [3]>
## 6 <list [1]> <named list [3]>
## 7 <list [1]> <named list [3]>
## 8 <list [1]> <named list [3]>
## 9 <list [1]> <named list [3]>
## 10 <list [1]> <named list [3]>
## # ℹ 497 more rows
This is better, as now the loc
and news
are different columns. However, each one is still a list.
The nested structure of each list should now look like this:
- loc
- news
- publication
- language
- publication_date
- title
- publication
Let’s add another unnest()
to get the loc
column from a list to the actual value. If a list has a length of 1,
we have to add names_sep = ""
. Now, we should have the actual data.
read_xml("ap_sitemap.xml") |>
as_list() |>
as_tibble() |>
unnest_wider(urlset) |>
unnest_wider(loc, names_sep = "")
Let’s continue this process with the “news” column, for completeness sake:
read_xml("ap_sitemap.xml") |>
as_list() |>
as_tibble() |>
unnest_wider(urlset) |>
unnest_wider(loc, names_sep = "") |>
unnest_wider(news)
## # A tibble: 507 × 4
## loc1 publication publication_date title
## <chr> <list> <list> <list>
## 1 https://apnews.com/article/texans-falco… <named list> <list [1]> <list>
## 2 https://apnews.com/article/f1-qatar-gra… <named list> <list [1]> <list>
## 3 https://apnews.com/article/inter-bologn… <named list> <list [1]> <list>
## 4 https://apnews.com/world-news/general-n… <named list> <list [1]> <list>
## 5 https://apnews.com/article/citadel-furm… <named list> <list [1]> <list>
## 6 https://apnews.com/world-news/general-n… <named list> <list [1]> <list>
## 7 https://apnews.com/article/bengals-card… <named list> <list [1]> <list>
## 8 https://apnews.com/sports/deportes-57fe… <named list> <list [1]> <list>
## 9 https://apnews.com/article/villanova-no… <named list> <list [1]> <list>
## 10 https://apnews.com/article/ut-martin-ea… <named list> <list [1]> <list>
## # ℹ 497 more rows
read_xml("ap_sitemap.xml") |>
as_list() |>
as_tibble() |>
unnest_wider(urlset) |>
unnest_wider(loc, names_sep = "") |>
unnest_wider(news) |>
unnest_wider(publication) |>
unnest_wider(name, names_sep = "")
## # A tibble: 507 × 5
## loc1 name1 language publication_date title
## <chr> <chr> <list> <list> <list>
## 1 https://apnews.com/article/texans-fal… Asso… <list> <list [1]> <list>
## 2 https://apnews.com/article/f1-qatar-g… Asso… <list> <list [1]> <list>
## 3 https://apnews.com/article/inter-bolo… Asso… <list> <list [1]> <list>
## 4 https://apnews.com/world-news/general… Asso… <list> <list [1]> <list>
## 5 https://apnews.com/article/citadel-fu… Asso… <list> <list [1]> <list>
## 6 https://apnews.com/world-news/general… Asso… <list> <list [1]> <list>
## 7 https://apnews.com/article/bengals-ca… Asso… <list> <list [1]> <list>
## 8 https://apnews.com/sports/deportes-57… Asso… <list> <list [1]> <list>
## 9 https://apnews.com/article/villanova-… Asso… <list> <list [1]> <list>
## 10 https://apnews.com/article/ut-martin-… Asso… <list> <list [1]> <list>
## # ℹ 497 more rows
Keep doing this until you have all the data that you need.
read_xml("ap_sitemap.xml") |>
as_list() |>
as_tibble() |>
unnest_wider(urlset) |>
unnest_wider(loc, names_sep = "") |>
unnest_wider(news) |>
unnest_wider(publication) |>
unnest_wider(name, names_sep = "") |>
unnest_wider(language, names_sep = "") |>
unnest_wider(publication_date, names_sep = "") |>
unnest_wider(title, names_sep = "")
## # A tibble: 507 × 5
## loc1 name1 language1 publication_date1 title1
## <chr> <chr> <chr> <chr> <chr>
## 1 https://apnews.com/article/texans-f… Asso… eng 2023-10-08T16:20… Desmo…
## 2 https://apnews.com/article/f1-qatar… Asso… eng 2023-10-08T09:58… No re…
## 3 https://apnews.com/article/inter-bo… Asso… eng 2023-10-07T11:06… Pulis…
## 4 https://apnews.com/world-news/gener… Asso… spa 2023-10-07T20:00… Tras …
## 5 https://apnews.com/article/citadel-… Asso… eng 2023-10-07T17:44… Huff …
## 6 https://apnews.com/world-news/gener… Asso… spa 2023-10-07T11:44… Ataqu…
## 7 https://apnews.com/article/bengals-… Asso… eng 2023-10-08T20:44… Cardi…
## 8 https://apnews.com/sports/deportes-… Asso… spa 2023-10-07T12:35… Pulis…
## 9 https://apnews.com/article/villanov… Asso… eng 2023-10-07T20:11… Watki…
## 10 https://apnews.com/article/ut-marti… Asso… eng 2023-10-07T20:00… Dent …
## # ℹ 497 more rows
There you go! You now have a list of articles and metadata. When you’re happy with the way it looks, assign it to a variable to save it.
url_list <- read_xml("ap_sitemap.xml") |> # This is the only line I changed.
as_list() |>
as_tibble() |>
unnest_wider(urlset) |>
unnest_wider(loc, names_sep = "") |>
unnest_wider(news) |>
unnest_wider(publication) |>
unnest_wider(name, names_sep = "") |>
unnest_wider(language, names_sep = "") |>
unnest_wider(publication_date, names_sep = "") |>
unnest_wider(title, names_sep = "")
4.5 Classwork: Processing XML
With the website you found earlier, download the XML file and get a data frame with a list of URLs. If you still have time, un-nest the other metadata as well.
4.6 Scraping
Now that we have our nested data, we need to actually get the data from the links. For this, we’re going to go back to the basics, with our old friend the for loop.
As a refresher, remember that given a vector, a for loop will do something for each element in the vector.
Using our list of urls, we could print every url. It might go too fast for us,
so we could use Sys.sleep()
to pause for a second between each one.
[1] "https://apnews.com/article/texans-falcons-d2cc274bd193e8f248883c6ff51d1cfa"
[1] "https://apnews.com/article/f1-qatar-grand-prix-verstappen-tires-d65ed543a90589a580a7357023b80fa7"
[1] "https://apnews.com/article/inter-bologna-milan-juventus-serie-a-d8569cc950036788dea4cbc5444767c0"
[1] "https://apnews.com/world-news/general-news-1d1e7ef745f2f1e773e8be03d23f070c"
[1] "https://apnews.com/article/citadel-furman-f4f00b18df71639616f1ba34b54a0ed8"
[1] "https://apnews.com/world-news/general-news-992058df45453966961b4e54b2073d0c"
[1] "https://apnews.com/article/bengals-cardinals-28a9e6bdf3087e35f8a9495f4973822e"
[1] "https://apnews.com/sports/deportes-57febdd23ba16f433773977af234e2b0"
[1] "https://apnews.com/article/villanova-north-carolina-at-0018092487249a74496f88c494efc404"
[1] "https://apnews.com/article/ut-martin-eastern-illinois-c213f1751d186412db91fa40b65951fe"
[1] "https://apnews.com/article/timbers-montreal-9ee51f8cf802407a10728c6cffb4c5d3"
...
...
...
However, instead of printing it, we want to use download.file()
to save it somewhere.
We’ll want to:
- Make a new folder for our html files.
- On our dataframe, give each row a code, which will be a good name for our new file.
- Loop through the dataframe, and download all the links.
4.6.1 Making a folder
You can make a folder righti n R with dir.create()
. Just put the name in.
## Warning in dir.create("ap_files"): 'ap_files' already exists
4.6.2 Adding file names
We now need to assign an file name to each of the links we want to download.
You can do this however you want, but the easiest way might be to just take the
row number in the data frame with row_number()
, and paste it to a file ending.
## # A tibble: 507 × 6
## loc1 name1 language1 publication_date1 title1 file_name
## <chr> <chr> <chr> <chr> <chr> <glue>
## 1 https://apnews.com/articl… Asso… eng 2023-10-08T16:20… Desmo… 1.html
## 2 https://apnews.com/articl… Asso… eng 2023-10-08T09:58… No re… 2.html
## 3 https://apnews.com/articl… Asso… eng 2023-10-07T11:06… Pulis… 3.html
## 4 https://apnews.com/world-… Asso… spa 2023-10-07T20:00… Tras … 4.html
## 5 https://apnews.com/articl… Asso… eng 2023-10-07T17:44… Huff … 5.html
## 6 https://apnews.com/world-… Asso… spa 2023-10-07T11:44… Ataqu… 6.html
## 7 https://apnews.com/articl… Asso… eng 2023-10-08T20:44… Cardi… 7.html
## 8 https://apnews.com/sports… Asso… spa 2023-10-07T12:35… Pulis… 8.html
## 9 https://apnews.com/articl… Asso… eng 2023-10-07T20:11… Watki… 9.html
## 10 https://apnews.com/articl… Asso… eng 2023-10-07T20:00… Dent … 10.html
## # ℹ 497 more rows
4.6.4 Downloading our files
Our last step is to download the files. We can do this with a for loop, so let’s build it
from the inside out. To start, we need to loop through all the rows in our data frame. We can do this
by selecting it using the [square brackets]
.
## # A tibble: 1 × 6
## loc1 name1 language1 publication_date1 title1 file_name
## <chr> <chr> <chr> <chr> <chr> <glue>
## 1 https://apnews.com/article… Asso… eng 2023-10-08T16:20… Desmo… 1.html
Within the row, we can select the different columns using the [dollar$sign]
.
## 1.html
We want to put it in the folder we made earlier, so we can use glue()
to add the full path
## ap_files/1.html
Finally, we can select the URL location and download it using download.file()
If this works, let’s wrap it up in a for-loop, and test the first 5 rows.
I also added two things:
Sys.sleep(2)
Adds a 2-second pause between requests.- I also print a nice little message after each request.
for (i in 1:5) {
row <- url_list[i, ]
output_location <- glue("ap_files/{row$file_name}")
download.file(row$loc1, output_location)
Sys.sleep(2)
print(glue("{i} files have been downloaded"))
}
trying URL 'https://apnews.com/article/texans-falcons-d2cc274bd193e8f248883c6ff51d1cfa'
downloaded 573 KB
1 files have been downloaded
trying URL 'https://apnews.com/article/f1-qatar-grand-prix-verstappen-tires-d65ed543a90589a580a7357023b80fa7'
downloaded 398 KB
2 files have been downloaded
trying URL 'https://apnews.com/article/inter-bologna-milan-juventus-serie-a-d8569cc950036788dea4cbc5444767c0'
downloaded 466 KB
3 files have been downloaded
trying URL 'https://apnews.com/world-news/general-news-1d1e7ef745f2f1e773e8be03d23f070c'
downloaded 194 KB
4 files have been downloaded
trying URL 'https://apnews.com/article/citadel-furman-f4f00b18df71639616f1ba34b54a0ed8'
downloaded 162 KB
Now, we want to make sure we don’t download the same thing twice, as this could get us banned from the website.
Let’s add an if/else
statement, too make sure we didn’t already download the file. First, we need a list of files in our
folder, using list.files()
.
[1] "1.html" "2.html" "3.html" "4.html" "5.html"
We can now put our main code in an if/else
statement:
Now test it with the first 10 rows:
for (i in 1:10) {
row <- url_list[i, ]
if (row$file_name %in% already_downloaded) {
print(glue("Skipping {row$file_name}"))
}
else {
output_location <- glue("ap_files/{row$file_name}")
download.file(row$loc1, output_location)
Sys.sleep(2)
print(glue("{i} files have been downloaded"))
}
}
Skipping 1.html
Skipping 2.html
Skipping 3.html
Skipping 4.html
Skipping 5.html
trying URL 'https://apnews.com/world-news/general-news-992058df45453966961b4e54b2073d0c'
downloaded 160 KB
6 files have been downloaded
trying URL 'https://apnews.com/article/bengals-cardinals-28a9e6bdf3087e35f8a9495f4973822e'
downloaded 268 KB
7 files have been downloaded
trying URL 'https://apnews.com/sports/deportes-57febdd23ba16f433773977af234e2b0'
downloaded 225 KB
8 files have been downloaded
trying URL 'https://apnews.com/article/villanova-north-carolina-at-0018092487249a74496f88c494efc404'
downloaded 162 KB
9 files have been downloaded
trying URL 'https://apnews.com/article/ut-martin-eastern-illinois-c213f1751d186412db91fa40b65951fe'
downloaded 162 KB
10 files have been downloaded
It should now skip the first 5 rows, because they’re in your download folder.
Congratulations! You now just built a web scraper.
4.7 Not Getting your IP banned
Web scraping is very much in a legal grey area, and the rules vary from place to place. Please consult lawyers if you’re going to do anything legally questionable.
However, here are some basic rules:
Scrape slowly. When you’re downloading web pages, make sure to only do a page every few seconds.
Sys.sleep()
is your go-to for this.Pretend to be human. Download pages in a random order, add a slightly random amount of time between downloads, don’t do everything in one batch. For example, you could generate a random number with
runif(1)
, then sleep withSys.sleep(runif(1) + 2)
.Only scrape once. Write your scrapers so that you’re only downloading the data one time, even if the scraper crashes in the middle. Your
if/else
statement will largely solve this issue.Don’t publish your data. At most, provide a list of URLs that you scraped. This keeps you mostly in the clear in terms of copyright.