6 Tutorial 6: Intro to Scraping
After working through Tutorial 6, you’ll…
- know how to define web scraping
- know about chances, but also legal and ethical limitations of web scraping
- know the basics of HTML & CSS
6.1 What is web scraping?
I understand web scraping as the automated collection and retrieval of relevant data from website code. Generally, this includes at least three steps5:
Identify the URL of the website you want to scrape
Download its content
From downloaded data, separate „junk“ from relevant data
6.2 Why use web scraping?
In previous tutorials, we have mainly worked with existing data - for example, Excel files already (and conveniently) including the data we needed.
However, in most cases this is not the case. You may have to acquire data on your down. Since a lot of social media platforms have closed automated access to their data via so-called Application Programming Interfaces (APIs), see more here, researchers increasingly have to collect data via other methods.
While “doing your own data collection” is relatively normal for surveys or experiments, it has led to new modes of data collection for content analyses. “Newer” methods include tracking, data donation - or scraping (for an overview of these methods, see here).
Web scraping has several advantages:
- It allows us to collect data that may otherwise not be available for research
- It allows us to collect new (meta) data (e.g., timestamps of content, multi-modal data)
- It allows us to rely on the structuredness of websites to scale-up data collection
6.3 Ethical, legal, and technical limitations
While web scraping has some advantages, these go hand in hand with several ethical, legal, and technical limitations. For excellent and far more detailed overviews on all of these points, see this overview article by Luscombe et al., 2022 (in English) or recent discussions by M. Haim (in German).
Ethically, researchers should ask: Should I scrape this website? For example, is this data public or private? Does it contain information by individuals and/or could this information be used to harm specific individuals? Also, by repeatedly scraping content from a website, could I unvoluntarily disturb the service of the website (e.g., in what may be considered a DoS attack)?
Legally, researchers should ask: Am I allowed to scrape this website? . Whether and to what degree scraping is legal depends, among other contexts, on what and how much content is scraped, who is scraping this content in which country, etc. In Germany, current legislation allows scraping for scientific research - but only under certain conditions. A good way of addressing both ethical and legal aspects is to follow “rules of good scraping behavior”. For example, website hosts often define which elements of their website can be scraped, by whom, and with what speed in their robots.txt, something we should respect. For the example of Wikipedia, see here.
Technically, researchers should ask: Can I scrape this website? Not all elements of websites could/should be accessed via scraping. For example, dynamic content is harder to scrape; the same accounts for content behind paywalls or user logins.
6.4 Source code
To understand web scraping, you have to understand what it relies on: source code.
Generally, websites are text documents that are interpreted and designed based on their source code. Let’s take the Wikipedia page on “Communication studies” as an example.
This is how the website looks like in my browser:
Image: Wikipedia page
Now, let us look at the underlying code (click on the right, then choose “View Page Source” or “View Source” depending on your computer). For different options to do so across browsers and operating systems, see here.
Image: Wikipedia page Source Code
You can see: Websites are simple text documents (ok, the code does not look simple - but we can easily understand parts of it!). When you visit websites, your browser reads the underlying source code (e.g., HTML, CSS, Javascript) to correctly display the website.
- HTML (Hypertext Markup Language): is a “markup language” that defines the structure of websites. “Markup” means that the code includes additional info besides just the content you want to display. For example, you can use HTML to define the title of your website: you may not only include information on the content of the title, but also where and how it should be displayed.
- CSS (Cascading Style Sheets): is a “style sheet” language we use to change the design of websites. For example, you can use CSS to define the color of the title of your website.
- JavaScript: is a language we also use to change the design/behavior of websites (mostly for dynamic and interactive elements). For example, you can use Javascript to automatically play a video when you hover over the title of your website.
In the following, we will focus on understanding the basics of HTML and CSS to learn web scraping.
6.5 HTML
You can use HTML (Hypertext Markup Language) to structure websites. For our seminar, you mainly need to know three things about HTML:
Websites consist of nested elements. For example, this is an element:
<body>Some content</body>
. This gives HTML files a “tree-like” structure where elements are nested in elements nested in elements etc. For example, the element<body>Some content</body>
is usually nested inhtml
:<html><body>Some content</body></html>
Most elements (e.g.,
<body>Some content</body>
) consist of a tag which marks the beginning<>
. Similarly. a tag marks the end of the element</>
(this differs for some elements, but we can ignore this for now). In between these tags is some content, e.g. text. For example: the element<h1>My heading</h1>
consists of a start tag (<h1>
), content (“My heading”), and an end tag (</h1>
). What type of tag is used tells us a bit about what content we may expect (e.g., inh1
we may expect a title, inimg
an image - which is useful information for automatically scraping such code!).You can also include attributes in-between tags. Attributes provide additional information. For example, we may want to add a link to our heading “My heading” from above. We could do that via the
a
tag (an “anchor” tag used to embed links) and an additionalhref
attribute that specifies where the link should guide readers.
A typical HTML text may look like this:
<html>
<body>
<h1>My heading</h1>
<p>Some text I wrote.</p>
</body>
</html>
What does this mean?
- the
<html>
element is the root element of an HTML website - the
<body>
element defines the document’s body (where text, images etc. are included) - the
<h1>
element defines a large heading - the
<p>
element defines a paragraph
On a website, the HTML snippet above is rendered to only the following two sentences:
Output of HTML snippet
My heading
Some text I wrote.
If you want to try this yourself, copy-in the HTML code here and try to play around with it a bit.
As you can see, while the HTML file contains a lot of information, only some of it is displayed here. Most of it is, instead, used to structure text “in the background”.
Next, let’s try to include a link for the text “Some text I wrote”. We can use the a
anchor and the href
attribute (including the link, here to Google).
<html>
<body>
<h1>My heading</h1>
<p><a href="https://www.google.de/">Some text I wrote</a></p>
</body>
</html>
On a website, the result looks like this (try clicking the link!):
Output of HTML snippet
My heading
You may ask yourself: Great - so why exactly should you know about HTML?
Knowing about the structure of HTML files (and what type of content different elements contain) is important to systematically parse relevant data from websites.
For example, articles in news websites will often be included in body
, article titles in h1
, etc. So we could look for these tags/elements when scraping news websites to only extract the title and text of a news article.
6.6 CSS
While you could fine-tune the appearance of your website via HTML elements (e.g., <b>
, <i>
), developers came up with CSS (Cascading Style Sheets) to more neatly format the appearance of HTML pages.
For our seminar, you mainly need to know four things about CSS:
Rules are the building blocks of CSS. They describe how different sections of websites should be formatted. Rules consist of selectors and declaration blocks. For example, a rule could define that all headings of a website should be red.
Selectors define which HTML element you want to style. In the example above, the selector could be the element
h1
, which stands for the first heading.Declaration blocks describe how the element should be styled by including information on properties (e.g., please set my
color
) and values (e.g., please set it tored
).There are different ways in which we can include CSS in HTML. Here, we will discuss inline CSS (defining rules for every single element) and internal CSS (defining rules for types of elements, so-called classes).
6.6.1 Inline CSS
Let’s try to change the color of my h1
heading “My heading”. My new rule should be that all headings h1
should displayed in red.
How do I do this?
- I want to change the content included in the heading
h1
element via inline CSS. To do so, I change thestyle
attribute within theh1
element. - I want to change the color of this element, so I include the property
color
within thestyle
attribute. - I want to change the color of this element to red, so I include the value
red
for the propertycolor
within thestyle
attribute.
<html>
<body>
<h1 style="color:red;">My heading</h1>
<p>Some text I wrote</p>
</body>
</html>
On a website, this changes the color of the h1
element:
Output of HTML snippet
My heading
Some text I wrote
6.6.2 Internal CSS
The inline version above is a bit inefficient, since you would have to include information on the color
for every single heading (making your code very long and more prone to errors).
Instead of inline code, we could also use internal CSS: Here, we define rules not for every single element (e.g., every h1
) but for types of elements, so-called classes. We define a CSS style for all elements of a certain class
and assign this class
to elements we want to be displayed in a certain way via attributes.
Let’s say, for example, that we only want parts of “Some text I wrote” to be depicted in pink:
- I create a new
class
.text-pink
for text that should be pink within the<style>
element. Notice how<style>
is now defined at the beginning of the document, so not related to a specific element.
- For
.text-pink
, I want to define a color, so I include the propertycolor
. - For
.text-pink
, I want to set the color topink
, so I include the valuepink
for the propertycolor
. - To mark which words of “Some text I wrote” should be depicted in pink, I can use the
<span>
element.<span>
is used to mark specific parts of text. I only want the word “text” to be pink, so I include “text” between<span>
and</span>
.
<html>
<style>
.text-pink {
color:pink;
}
</style>
<body>
<h1 class="heading-new">My heading</h1>
<p>Some <span class="text-pink">text</span> I wrote</p>
</body>
</html>
On a website, this changes the color of the word “text”:
Output of HTML snippet
My heading
Some text I wrote
Again, you may ask yourself: Great - so why exactly should you know about CSS?
Again, knowing about the structure of CSS syntax is important to systematically parse relevant data from websites.
For example, articles in news websites may be formatted according to a specific style, e.g., the class
style-article
. We could look for style-article
when scraping news websites to only extract the text of the article (and ignore all “junk code” around it).
💡 Take Aways
HTML: a “markup language” that defines the structure of websites
- consists of nested elements (e.g.,
body
,title
) marked by tags (<>
,</>
) (for overview lists of elements, see here) - elements may include attributes including additional information (e.g., a link via
href
) (for overview lists of attributes, see here)
As an overview, see the most important types of elements and attributes in HTML (for a full list, see here):
Element | Meaning |
---|---|
<head> |
structure - defines meta data of a document |
<title> |
structure - defines the title of a document |
<body> |
structure - defines the document’s body |
<h1> |
structure - defines headings (h1, h2, hc, etc.) |
<p> |
structure - defines a paragraph |
<a> |
structure - defines a link |
<img> |
structure - defines an image |
<div> |
structure - defines a container in which elements can be styled via CSS/JavaScript |
<b> |
formatting - makes text bold |
<i> |
formatting - makes text italic |
CSS: a “style sheet” language that defines the design of websites
- consists of rules (e.g., this element should be green) specified by selectors and related declaration blocks
- rules can be specified for classes of elements
- selectors describe which element should be formatted (e.g., the heading
h1
) - declaration blocks define what property of the selector should be formatted (e.g., its color
color
) and what value should be used (e.g., `red``)
📚 More tutorials on this
You still have questions? The following tutorials & papers can help you with that:
- HTML Tutorial by W3 School as well as CSS Tutorial by W3 School
- Algorithmic thinking in the public interest by Luscombe et al., 2022) 2022.
- Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining by S. Munzert, C. Rubba, P. Meißner, & D.Nyhuis
- Fremde Daten sammeln by M. Haim, chapter in Computational Communication Science
Enough of the talk and the introduction: Let’s try Web Scraping in R in the next tutorial.
The first step may extend to crawling, i.e., automatically collecting all relevant links you then want to scrape separately. We will deal with this later.↩︎