Chapter 2 Data Sets


Throughout the semester, we will use various data sets to teach important data science techniques.

2.1 Penguins
(Computer Labs 1B+)

In many weeks, we will use the penguins data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020). This is an interesting data set on the characteristics of three species of penguin living on the Dream, Biscoe, and Torgersen islands in the Palmer archipelago, off the coast of Antarctica.

The three species of penguin are Gentoo Penguins, Gentoo Penguin2

Chinstrap Penguins, Chinstrap Penguin3

and Adelie Penguins. Adelie Penguin4

The penguins data set contains measurements for different characteristics of these penguins - take a look at Table 2.1 below.

Table 2.1: A glimpse of the penguins data set from the palmerpenguins package.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Dream 42.3 21.2 191 4150 male 2007
Gentoo Biscoe 50.5 15.9 225 5400 male 2008
Gentoo Biscoe 46.9 14.6 222 4875 female 2009
Chinstrap Dream 50.6 19.4 193 3800 male 2007
Chinstrap Dream 50.7 19.7 203 4050 male 2009

Namely, for each penguin, we have data on their species, the island on which they live, their bill length, bill depth and flipper length (all measured in mm), their body mass (measured in grams), their sex, and the year in which the recordings were made.

Over the course of the first four weeks, we will look at various data visualisation methods that can help us quickly and easily visually identify the differences between these species, using this data.


