Chapter 2 Data Sets

Antarctica 1

Throughout the semester, we will use various data sets to teach important data science techniques.

2.1 Penguins

One such data set, which we will use extensively, is the penguins data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020). This is an interesting data set on the characteristics of three species of penguin living on the Dream, Biscoe, and Torgersen islands in the Palmer archipelago, off the coast of Antarctica.

The three species of penguin are:

  • Gentoo Penguins Gentoo Penguin 2

  • Chinstrap Penguins Chinstrap Penguin 3

  • Adelie Penguins Adelie Penguin 4

The penguins data set contains measurements for different characteristics of 333 penguins5 - take a look at Table 2.1 below.

Table 2.1: A glimpse of the penguins data set from the palmerpenguins package.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Dream 42.3 21.2 191 4150 male 2007
Gentoo Biscoe 50.5 15.9 225 5400 male 2008
Gentoo Biscoe 46.9 14.6 222 4875 female 2009
Chinstrap Dream 50.6 19.4 193 3800 male 2007
Chinstrap Dream 50.7 19.7 203 4050 male 2009


Namely, for each penguin, we have data on their species, the island on which they live, their bill length, bill depth and flipper length (all measured in mm), their body mass (measured in grams), their sex, and the year in which the recordings were made.

In Computer Labs 2B, 3B and 4B we will look at various data visualisation methods that can help us quickly and easily visually identify the differences between these species, using this data.

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.

  1. “Antarctica 2013: Journey to the Crystal Desert” by Christopher.Michel is licensed under CC BY 2.0↩︎

  2. “Gentoo Penguins” by D-Stanley is licensed under CC BY 2.0↩︎

  3. “Chinstrap Penguins” by D-Stanley is licensed under CC BY 2.0↩︎

  4. “Adelie Penguin (Pygoscelis adeliae)” by Gregory ‘Slobirdr’ Smith is licensed under CC BY-SA 2.0↩︎

  5. The full data set is slightly larger at 344 penguins, but some penguins having missing values.↩︎