library(palmerpenguins)
library(ggthemes)
library(tidyverse)
library(knitr)
library(tinytex)
---
Load Libraries
Business Problem
- Do penguins with longer flippers weigh more or less than penguins with shorter flippers?
- What does the relationship between flipper length and body mass look like? - Is it positive? Negative? Linear? Nonlinear?
- Does the relationship vary by the species of the penguin?
- How about by the island where the penguin lives?
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
view(penguins)
Creating our plot
ggplot(data = penguins)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
+
) geom_point()
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
+
) geom_point()
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
+
) geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = penguins$body_mass_g)
+
) geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
<!--#
When aesthetic mappings are defined in ggplot()
, at the global level, they’re passed down to each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a mapping
argument, which allows for aesthetic mappings at the local level that are added to those inherited from the global level. Since we want points to be colored based on species but don’t want the lines to be separated out for them, we should specify color = species
for geom_point()
only.
-->
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
+
) geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
ggplot(
data = penguins,
mapping = aes(x = penguins$flipper_length_mm, y = penguins$body_mass_g)
+
) geom_point(aes(color = penguins$species, shape = penguins$species)) +
geom_smooth(method = "lm") +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
shape = "Species"
)
`geom_smooth()` using formula = 'y ~ x'
Exercises
1) How many rows are in penguins? How many columns?
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
- There are 344 rows and 8 columns in the penguins data set.
2) What does the bill_depth_mm variable in the penguins data frame describe?
?penguins
starting httpd help server ... done
- bill_depth_mm describes a number denoting bill length (millimeters)
3) Make a scatterplot of bill_depth_mm versus bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.
ggplot(
data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
+
) geom_point()
There does not appear to be a relationship between the two variables however, it is possible that the introduction of additional variables may help us realize a potential relationship. See below plot.
ggplot( data = penguins, mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = species) + geom_point() + ) geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
With the introduction of the species variable we now see that bill_length_mm and bill_depth_mm have a linear positive when grouped by species.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = species,
color = species,
shape = species)
+
) geom_point()
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = species,
color = species,
shape = species)
+
) geom_boxplot()
- The box plot is more informative than the scatter plot because we can now see quartiles clearly represented. This helps us understand the distribution more effectively than the scatter plot.
5) why does the following give an error and how would you fix it? #ggplot(data = penguins) + #geom_point()
We can see in the error message that the x and y variables are missing from our argument. This can be fixed with the following:
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = species) + ) geom_point()
6) What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = bill_depth_mm)
+
) geom_point(na.rm = TRUE)
- If na.rm is FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed. By default we receive an warning message however, selecting na.rm is FALSE then we do no receive a warning message (missing values are removed either way).
7) Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
+
) geom_point() +
labs(caption = "Data come from the palmerpenguins package.")
8) Re-create the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = bill_depth_mm)
+
) geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
9) Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
+
) geom_point() +
geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
- Prediction: similar to the plot above however, the island variable is new and the line of best fit will be solid blue without the error range.
10) Will these two graphs look different? Why/why not?
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
+
) geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
+
) geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Prediction: I think the plots will look different however, I’m not sure why. I say they will look different though because the mapping line is in a separate location. If I had to guess, the second plot will have redundant mapping of the x & y labels. Actual: the plots are identical.