Chapter 2 Data Sets
Throughout the semester, we will use various data sets to teach important data science techniques.
2.1 Penguins
One such data set, which we will use extensively, is the penguins
data set from the palmerpenguins
R package (Horst, Hill, and Gorman 2020). This is an interesting data set on the characteristics of three species of penguin living on the Dream, Biscoe, and Torgersen islands in the Palmer archipelago, off the coast of Antarctica.
The three species of penguin are:
The penguins
data set contains measurements for different characteristics of 333 penguins5 - take a look at Table 2.1 below.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Dream | 42.3 | 21.2 | 191 | 4150 | male | 2007 |
Gentoo | Biscoe | 50.5 | 15.9 | 225 | 5400 | male | 2008 |
Gentoo | Biscoe | 46.9 | 14.6 | 222 | 4875 | female | 2009 |
Chinstrap | Dream | 50.6 | 19.4 | 193 | 3800 | male | 2007 |
Chinstrap | Dream | 50.7 | 19.7 | 203 | 4050 | male | 2009 |
Namely, for each penguin, we have data on their species, the island on which they live, their bill length, bill depth and flipper length (all measured in mm), their body mass (measured in grams), their sex, and the year in which the recordings were made.
In Computer Labs 2B, 3B and 4B we will look at various data visualisation methods that can help us quickly and easily visually identify the differences between these species, using this data.
References
“Antarctica 2013: Journey to the Crystal Desert” by Christopher.Michel is licensed under CC BY 2.0↩︎
“Gentoo Penguins” by D-Stanley is licensed under CC BY 2.0↩︎
“Chinstrap Penguins” by D-Stanley is licensed under CC BY 2.0↩︎
“Adelie Penguin (Pygoscelis adeliae)” by Gregory ‘Slobirdr’ Smith is licensed under CC BY-SA 2.0↩︎
The full data set is slightly larger at 344 penguins, but some penguins having missing values.↩︎