Chapter 1. Data Visualization

R for Data Science Practice

This document is the start of a larger collection of my favorite resources, examples, and personal analysis.
Author

Mac Walker

Published

27-09-2023


---

Load Libraries

library(palmerpenguins)
library(ggthemes)
library(tidyverse)
library(knitr)
library(tinytex)

Business Problem

  • Do penguins with longer flippers weigh more or less than penguins with shorter flippers?
  • What does the relationship between flipper length and body mass look like? - Is it positive? Negative? Linear? Nonlinear?
  • Does the relationship vary by the species of the penguin?
  • How about by the island where the penguin lives?

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
view(penguins)

Creating our plot

Our ultimate goal in this chapter is to re-create the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.

ggplot(data = penguins)

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point()

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = penguins$body_mass_g)
) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

<!--#

When aesthetic mappings are defined in ggplot(), at the global level, they’re passed down to each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a mapping argument, which allows for aesthetic mappings at the local level that are added to those inherited from the global level. Since we want points to be colored based on species but don’t want the lines to be separated out for them, we should specify color = species for geom_point() only.

-->

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(
  data = penguins,
  mapping = aes(x = penguins$flipper_length_mm, y = penguins$body_mass_g)
  ) +
  geom_point(aes(color = penguins$species, shape = penguins$species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    shape = "Species"
  )
`geom_smooth()` using formula = 'y ~ x'

Exercises

1) How many rows are in penguins? How many columns?

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • There are 344 rows and 8 columns in the penguins data set.

2) What does the bill_depth_mm variable in the penguins data frame describe?

?penguins
starting httpd help server ... done
  • bill_depth_mm describes a number denoting bill length (millimeters)

3) Make a scatterplot of bill_depth_mm versus bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
  geom_point()

  • There does not appear to be a relationship between the two variables however, it is possible that the introduction of additional variables may help us realize a potential relationship. See below plot.

    ggplot(
      data = penguins,
      mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = species)
    ) + geom_point() +
      geom_smooth(method = "lm")
    `geom_smooth()` using formula = 'y ~ x'

  • With the introduction of the species variable we now see that bill_length_mm and bill_depth_mm have a linear positive when grouped by species.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = species, 
                     color = species, 
                     shape = species)
       ) +
  geom_point()

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = species, 
                     color = species, 
                     shape = species)
       ) +
  geom_boxplot()

  • The box plot is more informative than the scatter plot because we can now see quartiles clearly represented. This helps us understand the distribution more effectively than the scatter plot.

5) why does the following give an error and how would you fix it? #ggplot(data = penguins) + #geom_point()

  • We can see in the error message that the x and y variables are missing from our argument. This can be fixed with the following:

    ggplot(data = penguins,
           mapping = aes(x = bill_depth_mm, y = species)
           ) +
      geom_point()

    6) What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm,
                     y = bill_depth_mm)
       ) +
  geom_point(na.rm = TRUE)

  • If na.rm is FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed. By default we receive an warning message however, selecting na.rm is FALSE then we do no receive a warning message (missing values are removed either way).

7) Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)
       ) +
  geom_point() +
  labs(caption = "Data come from the palmerpenguins package.")

8) Re-create the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm,
                     y = body_mass_g,
                     color = bill_depth_mm)
       ) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

9) Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  • Prediction: similar to the plot above however, the island variable is new and the line of best fit will be solid blue without the error range.

10) Will these two graphs look different? Why/why not?

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Prediction: I think the plots will look different however, I’m not sure why. I say they will look different though because the mapping line is in a separate location. If I had to guess, the second plot will have redundant mapping of the x & y labels. Actual: the plots are identical.