A.3 Solutions (03)

ds4psy: Solutions 3

Here are the solutions of the exercises on essential dplyr commands of Chapter 3 (Section 3.5).

A.3.1 Exercise 1

Reshaping vs. reducing data

Discuss the main dplyr functions in terms of reshaping and reducing data (as introduced in Section 3.1.1).

Solution

Using the overview of essential dplyr functions (in Section 3.2), we can say:

  1. arrange() reshapes data by sorting cases (rows);
  2. filter() and slice() reduce data by selecting cases (rows) by logical conditions or number;
  3. select() reduces data by selecting or reshapes data by reordering variables (columns);
  4. mutate() reduces and enhances data by computing new variables (columns) and adding them to the existing ones;
  5. summarise() reduces data by aggregating/collapsing multiple values of a variable (rows of a column) to a single one;
  6. group_by() and ungroup() reshape data (in a subtle way by) changing the unit of aggregation (in combination with mutate() and summarise()). This is usually done as a preparatory step for reducing or enhancing data.

A.3.2 Exercise 2

Star and R wars

We start tackling the tidyverse by uncovering even more facts about the dplyr::starwars universe. Answer the following questions by using pipes of basic dplyr commands (i.e., by arranging, filtering, selecting, grouping, counting, summarizing).

  • Save the tibble dplyr::starwars as sw and report its dimensions.
## Load data: ----- 
sw <- dplyr::starwars
# ?dplyr::starwars  # codebook of variables

## Basic data properties: ---- 
# sw     # print tibble
dim(sw)  # 87 rows (denoting individuals) x 13 columns (variables) 
#> [1] 87 14

Known unknowns

  • How many missing (NA) values does sw contain?

  • Which variable (column) has the most missing values?

  • Which individuals come from an unknown (missing) homeworld but have a known birth_year or known mass?

Solution

## Missing data: ----- 

# How many missing data points?
sum(is.na(sw))  # => 101 missing values.
#> [1] 105

# Which individuals come from an unknown (missing) homeworld 
# but have a known birth_year or mass? 
sw %>% 
  filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Yoda             66    17 white   green   brown       896 male  mascu… <NA>   
#> 2 IG-88           200   140 none    metal   red          15 none  mascu… <NA>   
#> 3 Qui-Gon Jinn    193    89 brown   fair    blue         92 male  mascu… <NA>   
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#> #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

# Which variable (column) has the most missing values?
colSums(is.na(sw))  # => birth_year has 44 missing values
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>          0          6         28          5          0          0         44 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>          4          4         10          4          0          0          0
colMeans(is.na(sw)) #    (amounting to 50.1% of all cases). 
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 0.50574713 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#> 0.04597701 0.04597701 0.11494253 0.04597701 0.00000000 0.00000000 0.00000000

## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald": 
# sw$hair_color[is.na(sw$hair_color)] <- "bald"

Gender issues

  • How many humans are contained in sw overall and by gender?

  • How many and which individuals in sw are neither male nor female?

  • Of which species in sw exist at least 2 different gender values?

Solution

## (c) Gender issues: ----- 

# (+) How many humans are there of each gender?
sw %>% 
  filter(species == "Human") %>%
  group_by(gender) %>%
  count()
#> # A tibble: 2 × 2
#> # Groups:   gender [2]
#>   gender        n
#>   <chr>     <int>
#> 1 feminine      9
#> 2 masculine    26

## Answer: 35 Humans in total: 9 females, 26 male.

# (+) How many and which individuals are neither male nor female?
sw %>% 
  filter(gender != "male", gender != "female")
#> # A tibble: 83 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 73 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

# (+) Of which species are there at least 2 different gender values?
sw %>%
  group_by(species, gender) %>%
  count() %>%  # table shows species by gender: 
  group_by(species) %>%  # Which species appear more than once in this table? 
  count() %>%
  filter(n > 1)
#> # A tibble: 4 × 2
#> # Groups:   species [4]
#>   species      n
#>   <chr>    <int>
#> 1 Droid        2
#> 2 Human        2
#> 3 Kaminoan     2
#> 4 Twi'lek      2

# alternative (and shorter) solution:
sw %>%
  group_by(species) %>%
  summarise(n_gender_vals = n_distinct(gender)) %>%
  filter(n_gender_vals >= 2)
#> # A tibble: 4 × 2
#>   species  n_gender_vals
#>   <chr>            <int>
#> 1 Droid                2
#> 2 Human                2
#> 3 Kaminoan             2
#> 4 Twi'lek              2
  • Bonus task: R typically provides many ways to obtain a solution. Let’s gain an overview of the gender distribution in our sw dataset in three different ways:

    • Use a dplyr pipe to compute a summary table tb that counts the frequency of each gender in sw.
    • Use ggplot2 on the raw data of sw to create a bar chart (A) that shows the same gender distribution.
    • Use ggplot2 on the summary table tb to create a bar chart (B) that shows the same gender distribution.

Solution

  • Computing the summary table tb is straightforward:
# Summary table:
tb <- sw %>%
  group_by(gender) %>%
  count()

tb
#> # A tibble: 3 × 2
#> # Groups:   gender [3]
#>   gender        n
#>   <chr>     <int>
#> 1 feminine     17
#> 2 masculine    66
#> 3 <NA>          4
  • Creating the same gender distribution as two plots:

    • A: from the raw data sw
    • B: from the summary table tb
# Define some colors:
my_cols <- unikn::usecol(c(Pinky, Seegruen, Seeblau, Grau))

# Plot A: From raw data
ggplot(sw, aes(x = gender)) +
  geom_bar(aes(fill = gender)) + 
  # Set labels, colors, and theme: 
  labs(title = "Gender distribution of the sw Universe", tag = "A",  
       x = "Gender", y = "Frequency", caption = "Using raw data sw.") + 
  scale_fill_manual(name = "Gender:", values = my_cols, na.value = "black") + 
  ds4psy::theme_ds4psy()


# Plot B: From summary table:
ggplot(tb, aes(x = gender, y = n)) +
  geom_bar(aes(fill = gender), stat = "identity") + 
  # Set labels, colors, and theme: 
  labs(title = "Gender distribution of the sw Universe", tag = "B", 
       x = "Gender", y = "Frequency", caption = "Using summary data tb.") + 
  scale_fill_manual(name = "Gender:", values = my_cols, na.value = "black") + 
  ds4psy::theme_ds4psy()

Note: Take a moment to reflect on the differences between both plots:

Both plots may look the same, but are internally quite different:

  • Plot A uses the raw data sw as its input and computes the frequency counts on the y-axis (as the default setting of geom_bar is stat = "count" and the resulting counts are automatically mapped to the y-axis).

  • Plot B uses the summary table tb as its input and maps the existing values of the column n on the y-axis. This requires setting the aesthetics to y = n and using stat = "identity" for geom_bar.)

Which of these ways we decide to use to plot summary results is a matter of preference, but has consequences:

  • On the one hand, using both a dplyr pipe and a ggplot2 command that re-computes the same counts (in Plot A) could help us to double-check our results. For instance, a common error is omitting the na.value argument when choosing your own colors with scale_fill_manual. This would drop the NA values from the plot — which we may never notice, if we had not computed the summary table tb before.

  • On the other hand, if we already have a summary table tb, creating Plot B ensures that we are plotting exactly the contents of this table. Anything that changes the table will also change the plot.

Solution

## (d) Homeworld issues: ----- 

# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from? 
sw %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n))
#> # A tibble: 49 × 2
#> # Groups:   homeworld [49]
#>    homeworld     n
#>    <chr>     <int>
#>  1 Naboo        11
#>  2 Tatooine     10
#>  3 <NA>         10
#>  4 Alderaan      3
#>  5 Coruscant     3
#>  6 Kamino        3
#>  7 Corellia      2
#>  8 Kashyyyk      2
#>  9 Mirial        2
#> 10 Ryloth        2
#> # … with 39 more rows
# => Naboo (with 11 individuals)

# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld? 
sw %>% 
  filter(homeworld == "Naboo", eye_color == "orange") %>%
  summarise(n = n(),
            mn_height = mean(height))
#> # A tibble: 1 × 2
#>       n mn_height
#>   <int>     <dbl>
#> 1     3      209.

## Note: 
sw %>% 
  filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Jabba Desil…    175  1358 <NA>    green-… orange      600 herm… mascu… Nal Hu…
#> 2 Ackbar          180    83 none    brown … orange       41 male  mascu… Mon Ca…
#> 3 Jar Jar Bin…    196    66 none    orange  orange       52 male  mascu… Naboo  
#> 4 Roos Tarpals    224    82 none    grey    orange       NA male  mascu… Naboo  
#> 5 Rugor Nass      206    NA none    green   orange       NA male  mascu… Naboo  
#> 6 Sebulba         112    40 none    grey, … orange       NA male  mascu… Malast…
#> 7 Ben Quadina…    163    65 none    grey, … orange       NA male  mascu… Tund   
#> 8 Saesee Tiin     188    NA none    pale    orange       NA male  mascu… Iktotch
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#> #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

# (+) What is the mass and homeworld of the smallest droid?
sw %>% 
  filter(species == "Droid") %>%
  arrange(height)
#> # A tibble: 6 × 14
#>   name   height  mass hair_color skin_color eye_c…¹ birth…² sex   gender homew…³
#>   <chr>   <int> <dbl> <chr>      <chr>      <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 R2-D2      96    32 <NA>       white, bl… red          33 none  mascu… Naboo  
#> 2 R4-P17     96    NA none       silver, r… red, b…      NA none  femin… <NA>   
#> 3 R5-D4      97    32 <NA>       white, red red          NA none  mascu… Tatooi…
#> 4 C-3PO     167    75 <NA>       gold       yellow      112 none  mascu… Tatooi…
#> 5 IG-88     200   140 none       metal      red          15 none  mascu… <NA>   
#> 6 BB8        NA    NA none       none       black        NA none  mascu… <NA>   
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> #   starships <list>, and abbreviated variable names ¹​eye_color, ²​birth_year,
#> #   ³​homeworld

Size and mass issues

  • Compute the median, mean, and standard deviation of height for all droids.

  • Compute the average height and mass by species and save the result as h_m.

  • Sort h_m to list the three species with the smallest individuals (in terms of mean height).

  • Sort h_m to list the three species with the heaviest individuals (in terms of median mass).

Solution

## Size and mass issues (group summaries): ----- 

# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
  filter(species == "Droid") %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            md_height = median(height, na.rm = TRUE),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 × 5
#>       n not_NA_h md_height mn_height sd_height
#>   <int>    <int>     <int>     <dbl>     <dbl>
#> 1     6        5        97      131.      49.1

# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
  group_by(species) %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            not_NA_m = sum(!is.na(mass)),
            md_mass = median(mass, na.rm = TRUE)
            )
h_m
#> # A tibble: 38 × 6
#>    species       n not_NA_h mn_height not_NA_m md_mass
#>    <chr>     <int>    <int>     <dbl>    <int>   <dbl>
#>  1 Aleena        1        1       79         1    15  
#>  2 Besalisk      1        1      198         1   102  
#>  3 Cerean        1        1      198         1    82  
#>  4 Chagrian      1        1      196         0    NA  
#>  5 Clawdite      1        1      168         1    55  
#>  6 Droid         6        5      131.        4    53.5
#>  7 Dug           1        1      112         1    40  
#>  8 Ewok          1        1       88         1    20  
#>  9 Geonosian     1        1      183         1    80  
#> 10 Gungan        3        3      209.        2    74  
#> # … with 28 more rows

# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 × 6
#>   species            n not_NA_h mn_height not_NA_m md_mass
#>   <chr>          <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Yoda's species     1        1        66        1      17
#> 2 Aleena             1        1        79        1      15
#> 3 Ewok               1        1        88        1      20

# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>%  slice(1:3)
#> # A tibble: 3 × 6
#>   species     n not_NA_h mn_height not_NA_m md_mass
#>   <chr>   <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Hutt        1        1       175        1    1358
#> 2 Kaleesh     1        1       216        1     159
#> 3 Wookiee     2        2       231        2     124

Bonus tasks

The following bonus tasks are more difficult, but can be solved with a single dplyr pipe:

  • How many individuals come from the three most frequent (known) species?

  • Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?

Solution

# How many individuals come from the 3 most frequent (known) species?
sw %>%
  group_by(species) %>%
  count %>%
  arrange(desc(n)) %>%
  filter(n > 1)
#> # A tibble: 9 × 2
#> # Groups:   species [9]
#>   species      n
#>   <chr>    <int>
#> 1 Human       35
#> 2 Droid        6
#> 3 <NA>         4
#> 4 Gungan       3
#> 5 Kaminoan     2
#> 6 Mirialan     2
#> 7 Twi'lek      2
#> 8 Wookiee      2
#> 9 Zabrak       2

# Which individuals are more than 20% lighter (in terms of mass) 
# than the average mass of individuals of their own homeworld?
sw %>%
  select(name, homeworld, mass) %>%
  group_by(homeworld) %>%
  mutate(n_notNA_mass = sum(!is.na(mass)),  
         mn_mass = mean(mass, na.rm = TRUE),
         lighter = mass < (mn_mass - (.20 * mn_mass))
         ) %>%
  filter(lighter == TRUE)
#> # A tibble: 5 × 6
#> # Groups:   homeworld [4]
#>   name          homeworld  mass n_notNA_mass mn_mass lighter
#>   <chr>         <chr>     <dbl>        <int>   <dbl> <lgl>  
#> 1 R2-D2         Naboo        32            6    64.2 TRUE   
#> 2 Leia Organa   Alderaan     49            2    64   TRUE   
#> 3 R5-D4         Tatooine     32            8    85.4 TRUE   
#> 4 Yoda          <NA>         17            3    82   TRUE   
#> 5 Padmé Amidala Naboo        45            6    64.2 TRUE

A.3.3 Exercise 3

Sleeping mammals

The dataset ggplot2::msleep contains a mammals sleep dataset (see ?msleep for details and the definition of variables).

  • Save the data as sp and check the dimensions, variable types, and number of missing values in the dataset.

Solution

## Data: 
# ?msleep  # check variables     
sp <- ggplot2::msleep
# Check: 
dim(sp)        # 83 x 11 variables
#> [1] 83 11
glimpse(sp)    # 5 <chr> and 6 <dbl>
#> Rows: 83
#> Columns: 11
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
#> $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
sum(is.na(sp)) # 136 missing values
#> [1] 136

Arranging and filtering data

Use the dplyr-verbs arrange(), group_by(), and filter() to answer the following questions by creating ordered subsets of the data:

  • Arrange the rows (alphabetically) by vore, order, and name, and report the genus of the top three mammals.

  • What is the most common type of vore in the data? How many omnivores are there?

  • What is the most common order in the dataset? Are there more exemplars of the order “Carnivora” or “Primates”?

  • Which two mammals of the order “Primates” have the longest and shortest sleep_total times?

Solution

# Arranging rows:
sp1 <- sp %>%
  arrange(vore, order, name)
sp1$genus[1:3]  # => top 3: "Vulpes"   "Phoca"    "Acinonyx"
#> [1] "Vulpes"   "Phoca"    "Acinonyx"

# Counting common vores:
# (a) short solution:
sp %>%
  count(vore) %>%
  arrange(desc(n))  
#> # A tibble: 5 × 2
#>   vore        n
#>   <chr>   <int>
#> 1 herbi      32
#> 2 omni       20
#> 3 carni      19
#> 4 <NA>        7
#> 5 insecti     5

# (a) is the same as (b): 
sp %>% 
  group_by(vore) %>%
  tally() %>%
  arrange(desc(n))
#> # A tibble: 5 × 2
#>   vore        n
#>   <chr>   <int>
#> 1 herbi      32
#> 2 omni       20
#> 3 carni      19
#> 4 <NA>        7
#> 5 insecti     5

# (b) is the same as (c):  
sp %>% 
  group_by(vore) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
#> # A tibble: 5 × 2
#>   vore        n
#>   <chr>   <int>
#> 1 herbi      32
#> 2 omni       20
#> 3 carni      19
#> 4 <NA>        7
#> 5 insecti     5

# => 32 herbivores, 20 omnivores

# Counting common orders: 
sp %>% 
  count(order) %>%
  arrange(desc(n)) 
#> # A tibble: 19 × 2
#>    order               n
#>    <chr>           <int>
#>  1 Rodentia           22
#>  2 Carnivora          12
#>  3 Primates           12
#>  4 Artiodactyla        6
#>  5 Soricomorpha        5
#>  6 Cetacea             3
#>  7 Hyracoidea          3
#>  8 Perissodactyla      3
#>  9 Chiroptera          2
#> 10 Cingulata           2
#> 11 Didelphimorphia     2
#> 12 Diprotodontia       2
#> 13 Erinaceomorpha      2
#> 14 Proboscidea         2
#> 15 Afrosoricida        1
#> 16 Lagomorpha          1
#> 17 Monotremata         1
#> 18 Pilosa              1
#> 19 Scandentia          1

# OR: 
sp %>% 
  group_by(order) %>%
  count() %>%
  arrange(desc(n)) 
#> # A tibble: 19 × 2
#> # Groups:   order [19]
#>    order               n
#>    <chr>           <int>
#>  1 Rodentia           22
#>  2 Carnivora          12
#>  3 Primates           12
#>  4 Artiodactyla        6
#>  5 Soricomorpha        5
#>  6 Cetacea             3
#>  7 Hyracoidea          3
#>  8 Perissodactyla      3
#>  9 Chiroptera          2
#> 10 Cingulata           2
#> 11 Didelphimorphia     2
#> 12 Diprotodontia       2
#> 13 Erinaceomorpha      2
#> 14 Proboscidea         2
#> 15 Afrosoricida        1
#> 16 Lagomorpha          1
#> 17 Monotremata         1
#> 18 Pilosa              1
#> 19 Scandentia          1
# => 22 Rodentia, 12 Carnivora = 12 Primates

# Primates with longest and shortest sleep_total times:
sp %>% 
  filter(order == "Primates") %>%
  arrange(desc(sleep_total)) 
#> # A tibble: 12 × 11
#>    name   genus vore  order conse…¹ sleep…² sleep…³ sleep…⁴ awake brainwt bodywt
#>    <chr>  <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>
#>  1 Owl m… Aotus omni  Prim… <NA>       17       1.8  NA       7    0.0155  0.48 
#>  2 Slow … Nyct… carni Prim… <NA>       11      NA    NA      13    0.0125  1.4  
#>  3 Potto  Pero… omni  Prim… lc         11      NA    NA      13   NA       1.1  
#>  4 Patas… Eryt… omni  Prim… lc         10.9     1.1  NA      13.1  0.115  10    
#>  5 Macaq… Maca… omni  Prim… <NA>       10.1     1.2   0.75   13.9  0.179   6.8  
#>  6 Grivet Cerc… omni  Prim… lc         10       0.7  NA      14   NA       4.75 
#>  7 Galago Gala… omni  Prim… <NA>        9.8     1.1   0.55   14.2  0.005   0.2  
#>  8 Chimp… Pan   omni  Prim… <NA>        9.7     1.4   1.42   14.3  0.44   52.2  
#>  9 Squir… Saim… omni  Prim… <NA>        9.6     1.4  NA      14.4  0.02    0.743
#> 10 Mongo… Lemur herbi Prim… vu          9.5     0.9  NA      14.5 NA       1.67 
#> 11 Baboon Papio omni  Prim… <NA>        9.4     1     0.667  14.6  0.18   25.2  
#> 12 Human  Homo  omni  Prim… <NA>        8       1.9   1.5    16    1.32   62    
#> # … with abbreviated variable names ¹​conservation, ²​sleep_total, ³​sleep_rem,
#> #   ⁴​sleep_cycle
# => max: Owl monkey: 17 hours, min: Human 8 hours.

Computing new variables

Solve the following tasks by combining the dplyr commands mutate, group_by, and summarise:

  • Compute a variable sleep_awake_sum that adds the sleep_total time and the awake time of each mammal. What result do you expect and get?

  • Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?

  • What is the minimum, average (mean), and maximum sleep cycle length for each vore? (Hint: First group the data by group_by, then use summarise on the sleep_cycle variable, but also count the number of NA values for each vore. When computing grouped summaries, NA values can be removed by na.rm = TRUE.)

  • Replace your summarise verb in the previous task by mutate. What do you get as a result? (Hint: The last two tasks illustrate the difference between mutate and grouped mutate commands.)

Solution

# Computing `sleep_awake_sum`: 
sp2 <- sp %>%
  mutate(sleep_awake_sum = sleep_total + awake)
sp2$sleep_awake_sum  # => all 24 hours 
#>  [1] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [13] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [25] 24.00 24.00 24.00 24.00 24.00 24.00 24.05 24.00 24.00 24.00 24.00 24.00
#> [37] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [49] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.05
#> [61] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [73] 24.00 24.00 24.00
#>  [ reached getOption("max.print") -- omitted 8 entries ]

# Computing brain to body ratios:
sp3 <- sp %>%
  mutate(brain_to_body_wt = brainwt / bodywt) %>%
  arrange(brain_to_body_wt)

sp3$name[1] # => smallest ratio: Cow
#> [1] "Cow"

sp3 %>% 
  arrange(desc(brain_to_body_wt))
#> # A tibble: 83 × 12
#>    name   genus vore  order conse…¹ sleep…² sleep…³ sleep…⁴ awake brainwt bodywt
#>    <chr>  <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>
#>  1 Thirt… Sper… herbi Rode… lc         13.8     3.4   0.217  10.2 0.004    0.101
#>  2 Owl m… Aotus omni  Prim… <NA>       17       1.8  NA       7   0.0155   0.48 
#>  3 Lesse… Cryp… omni  Sori… lc          9.1     1.4   0.15   14.9 0.00014  0.005
#>  4 Squir… Saim… omni  Prim… <NA>        9.6     1.4  NA      14.4 0.02     0.743
#>  5 Macaq… Maca… omni  Prim… <NA>       10.1     1.2   0.75   13.9 0.179    6.8  
#>  6 Littl… Myot… inse… Chir… <NA>       19.9     2     0.2     4.1 0.00025  0.01 
#>  7 Galago Gala… omni  Prim… <NA>        9.8     1.1   0.55   14.2 0.005    0.2  
#>  8 Mole … Spal… <NA>  Rode… <NA>       10.6     2.4  NA      13.4 0.003    0.122
#>  9 Tree … Tupa… omni  Scan… <NA>        8.9     2.6   0.233  15.1 0.0025   0.104
#> 10 Human  Homo  omni  Prim… <NA>        8       1.9   1.5    16   1.32    62    
#> # … with 73 more rows, 1 more variable: brain_to_body_wt <dbl>, and abbreviated
#> #   variable names ¹​conservation, ²​sleep_total, ³​sleep_rem, ⁴​sleep_cycle
# => largest ratio: Thirteen-lined ground squirrel, 
#    9 mammals have larger brain_to_body_wt than Human.

# Computing sleep cycle length for each `vore`: 
sp %>%
  group_by(vore) %>%
  summarise(n = n(),
            n_NA = sum(is.na(sleep_cycle)), 
            non_NA = sum(!is.na(sleep_cycle)),
            min_cyc = min(sleep_cycle, na.rm = TRUE),
            mn_cyc = mean(sleep_cycle, na.rm = TRUE),
            max_cyc = max(sleep_cycle, na.rm = TRUE)
            )
#> # A tibble: 5 × 7
#>   vore        n  n_NA non_NA min_cyc mn_cyc max_cyc
#>   <chr>   <int> <int>  <int>   <dbl>  <dbl>   <dbl>
#> 1 carni      19    14      5   0.333  0.373   0.417
#> 2 herbi      32    20     12   0.117  0.418   1    
#> 3 insecti     5     2      3   0.117  0.161   0.2  
#> 4 omni       20     9     11   0.133  0.592   1.5  
#> 5 <NA>        7     6      1   0.183  0.183   0.183

# Replacing summarise by mutate: 
sp4 <- sp %>%
  group_by(vore) %>%
  mutate(n = n(),
         n_NA = sum(is.na(sleep_cycle)), 
         non_NA = sum(!is.na(sleep_cycle)),
         min_cyc = min(sleep_cycle, na.rm = TRUE),
         mn_cyc = mean(sleep_cycle, na.rm = TRUE),
         max_cyc = max(sleep_cycle, na.rm = TRUE)
  )

# => A tibble that contains grouped summaries as 6 new variables: 
sp4 %>% 
  select(vore, name, sleep_cycle, n:max_cyc) %>%
  arrange(vore)
#> # A tibble: 83 × 9
#> # Groups:   vore [5]
#>    vore  name                  sleep…¹     n  n_NA non_NA min_cyc mn_cyc max_cyc
#>    <chr> <chr>                   <dbl> <int> <int>  <int>   <dbl>  <dbl>   <dbl>
#>  1 carni Cheetah                NA        19    14      5   0.333  0.373   0.417
#>  2 carni Northern fur seal       0.383    19    14      5   0.333  0.373   0.417
#>  3 carni Dog                     0.333    19    14      5   0.333  0.373   0.417
#>  4 carni Long-nosed armadillo    0.383    19    14      5   0.333  0.373   0.417
#>  5 carni Domestic cat            0.417    19    14      5   0.333  0.373   0.417
#>  6 carni Pilot whale            NA        19    14      5   0.333  0.373   0.417
#>  7 carni Gray seal              NA        19    14      5   0.333  0.373   0.417
#>  8 carni Thick-tailed opposum   NA        19    14      5   0.333  0.373   0.417
#>  9 carni Slow loris             NA        19    14      5   0.333  0.373   0.417
#> 10 carni Northern grasshopper…  NA        19    14      5   0.333  0.373   0.417
#> # … with 73 more rows, and abbreviated variable name ¹​sleep_cycle

A.3.4 Exercise 4

Outliers

This exercise examines different possibilities for defining outliers and uses the outliers dataset of the ds4psy package (also available as out.csv at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them. With respect to your insights into dplyr, this exercise helps disentangling mutate from grouped mutate commands.

Data on outliers

Use the outliers data (from the ds4psy package) or use the following read_csv() command to load the data into an R object entitled outliers:

# Data: 
outliers <- ds4psy::outliers  # from ds4psy package

## Alternatively, load csv data from online source (as comma-separated file): 
# outliers_2 <- readr::read_csv("http://rpository.com/ds4psy/data/out.csv")  # from online source

## Verify equality: 
# all.equal(ds4psy::outliers, outliers_2)

## Alternatively from a local data file: 
# outliers <- read_csv("out.csv")  # from current directory

Not all outliers are alike

An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., two standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.

Basic tasks

  • Save the data into a tibble outliers and report its number of observations and variables, and their types.

  • How many missing data values are there in outliers?

  • What is the gender (or sex) distribution in this sample?

  • Create a plot that shows the distribution of height values for each gender.

Solution

# Load and inspect data:
# outliers <- read_csv("out.csv") # read in csv-file
# outliers <- as_tibble(data)   # if data is not already a tibble
dim(outliers)  # => 1000 observations (rows) x 3 variables (columns)
#> [1] 1000    3

# Missing data points: 
sum(is.na(outliers))  # => 18 missing values
#> [1] 18

# Gender distribution: 
outliers %>% 
  group_by(sex) %>% 
  count()
#> # A tibble: 2 × 2
#> # Groups:   sex [2]
#>   sex        n
#>   <chr>  <int>
#> 1 female   507
#> 2 male     493
# => 50.7% females, 49.3% males.

# Distributions of `height` as density plot: 
ggplot(outliers, aes(x = height)) +
  geom_density(fill = "gold", alpha = 2/3) +
  geom_density(aes(fill = sex), alpha = 1/3) +
  labs(title = "Distribution of heights overall and by gender", 
       fill = "Gender") + 
  # scale_fill_brewer(palette = "Set1") +  # using Brewer palette
  scale_fill_manual(name = "Gender:", values = unikn::usecol(c(Pinky, Seeblau))) +  # unikn colors
  theme_bw()


# Note: To avoid the warning about removing 18 cases with NA-values, 
#       we could first filter out those cases:
# non_NA_data <- filter(outliers, !is.na(height))

# Alternative solution as 2 histograms: 
ggplot(outliers) +
  facet_wrap(~sex) + 
  geom_histogram(aes(x = height, fill = sex), binwidth = 5, color = "grey10") +
  labs(title = "Distribution of heights by gender",
       x = "Height", y = "Frequency") +
  # scale_fill_brewer(name = "Gender:", palette = "Set1") +  # using Brewer palette
  scale_fill_manual(name = "Gender:", values = unikn::usecol(c(Pinky, Seeblau))) +  # unikn colors
  theme_bw()

Defining different outliers

Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:

  1. outliers relative to the height of the overall sample (i.e., individuals with height values deviating more than 2 SD from the overall mean of height);

  2. outliers relative to the height of some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals with height values deviating more than 2 SD from the mean height of their own gender).

Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).

Relative outliers

Now use the 2 new outlier variables to define (or filter) 2 subsets of the data that contain 2 subgroups of people:

  1. out_1: Individuals (females and males) with height values that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are in outliers?

  2. out_2: Individuals (females and males) with height values that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are in outliers?

Solution

## Defining different outliers: ----- 

# Included in data_out (below), but also possible to do separately:  
# Compute the number, means and SD of height values in 2 ways: 

# 1. overall: 
outliers %>%
  summarise(n = n(),
            n_not_NA = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 × 4
#>       n n_not_NA mn_height sd_height
#>   <int>    <int>     <dbl>     <dbl>
#> 1  1000      982      175.      11.3

# 2. by gender:
outliers %>%
  group_by(sex) %>%
  summarise(n = n(),
            n_not_NA = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 2 × 5
#>   sex        n n_not_NA mn_height sd_height
#>   <chr>  <int>    <int>     <dbl>     <dbl>
#> 1 female   507      501      169.      8.19
#> 2 male     493      481      181.     11.0


# Detecting and marking outliers (by logical variables): 
# Compute the means, SDs, and corresponding outliers in 2 ways:
crit <- 2  # criterion value for detecting outliers (in SD units)

data_out <- outliers %>%      
  # 1. Compute means, SD, and outliers for overall sample: 
  mutate(mn_height  = mean(height, na.rm = TRUE),  
         sd_height  = sd(height, na.rm = TRUE),
         out_height = abs(height - mn_height) > (crit * sd_height)) %>%
  group_by(sex) %>%       
  # 2. Compute same metrics for subgroups (by sex):
  mutate(mn_sex_height  = mean(height, na.rm = TRUE), 
         sd_sex_height  = sd(height, na.rm = TRUE),
         out_sex_height = abs(height - mn_sex_height) > (crit * sd_sex_height))

knitr::kable(head(data_out))
id sex height mn_height sd_height out_height mn_sex_height sd_sex_height out_sex_height
nr.1 female 164 174.7006 11.26015 FALSE 169.0679 8.193619 FALSE
nr.2 male 178 174.7006 11.26015 FALSE 180.5676 11.026677 FALSE
nr.3 female 177 174.7006 11.26015 FALSE 169.0679 8.193619 FALSE
nr.4 male 188 174.7006 11.26015 FALSE 180.5676 11.026677 FALSE
nr.5 male 193 174.7006 11.26015 FALSE 180.5676 11.026677 FALSE
nr.6 female 168 174.7006 11.26015 FALSE 169.0679 8.193619 FALSE

## Relative outliers: ----- 

# Filter specific combinations of outliers:

# 1. Outliers relative to the entire population AND to their own gender: 
out_1 <- data_out %>%
  filter(out_height & out_sex_height) %>%
  arrange(sex, height)

nrow(out_1) # => 21 individuals. 
#> [1] 21

# 2. Outliers relative to their own gender, but NOT relative to the entire population:
out_2 <- data_out %>%
  filter(!out_height & out_sex_height) %>%
  arrange(sex, height)  

nrow(out_2) # => 24 individuals.
#> [1] 24

Bonus plots

  • Visualize the raw values and distributions of height for both types of outliers (out_1 and out_2) in 2 separate plots.

  • Interpret both plots by describing the height and sex combination of the individuals shown in each plot.

Solution

# Visualization and interpretation of both types of outliers: 

# 1. Showing out_1: 
ggplot(out_1, aes(x = sex, y = height)) +
  geom_violin(aes(fill = sex)) + 
  geom_jitter(size = 4, alpha = 2/3) + 
  # scale_fill_manual(values = c("firebrick", "steelblue3")) +
  scale_fill_manual(name = "Gender:", values = unikn::usecol(c(Pinky, Seeblau))) +  # unikn colors
  labs(title = "Outliers relative to both overall sample and gender", 
       x = "Gender", y = "Height (in cm)", 
       fill = "Gender:") +
  theme_bw()


# Interpretation: 
# `out_1` contains mostly short women (except for 1 tall woman) 
#  and mostly tall men (except for 2 short men). 

# 2. Showing out_2: 
ggplot(out_2, aes(x = sex, y = height)) +
  geom_violin(aes(fill = sex)) + 
  geom_jitter(size = 4, alpha = 2/3) + 
  # scale_fill_manual(values = c("firebrick", "steelblue3")) +
  scale_fill_manual(name = "Gender:", values = unikn::usecol(c(Pinky, Seeblau))) +  # unikn colors
  labs(title = "Outliers relative to gender but not overall sample", 
       x = "Gender", y = "Height (in cm)", 
       fill = "Gender:") +
  theme_bw()


# Interpretation: 
# `out_2` contains individuals which are either tall women or short men.

A.3.5 Exercise 5

Revisiting positive psychology

In previous exercises, we used the p_info data — available as posPsy_p_info in the ds4psy package or as http://rpository.com/ds4psy/data/posPsy_participants.csv — from a study on the effectiveness of web-based positive psychology interventions (Woodworth et al., 2018). More specifically, we used this data in Exercise 6 of Chapter 1 and Exercise 5 of Chapter 2 to explore the participant information and create some corresponding plots. (See Section B.1 of Appendix B for background information on this data.)

Answer the same questions as in those exercises by verifying your earlier base R results and ggplot2 graphs by pipes of dplyr commands. Do your graphs and quantitative results support the same conclusions?

Data

# Data: 
p_info <- ds4psy::posPsy_p_info  # from ds4psy package

## Load data (from online source):
# p_info_2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")

## Verify equality:
# all.equal(p_info, p_info_2)

# p_info
dim(p_info)  # 295 rows, 6 columns
#> [1] 295   6

From Exercise 6 of Chapter 1

Questions from Exercise 6 of Chapter 1:

Examine the participant information in p_info by describing each of its variables:

  1. How many individuals are contained in the dataset?

  2. What percentage of them is female (i.e., has a sex value of 1)?

  3. How many participants were in one of the 3 treatment groups (i.e., have an intervention value of 1, 2, or 3)?

  4. What is the participants’ mean education level? What percentage has a university degree (i.e., an educ value of at least 4)?

  5. What is the age range (min to max) of participants? What is the average (mean and median) age?

  6. Describe the range of income levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., an income value of 1)?

Solution

# 1. How many individuals are contained in the dataset? 
N <- nrow(p_info) # OR 
N  # 295
#> [1] 295

# Note: 
p_info %>% count()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1   295
# would yield a tibble (with only 1 element: 295).

# 2. What percentage of them is female (i.e., has a `sex` value of 1)?
p_info %>%
  group_by(sex) %>% 
  summarise(n_sex = n(),            # number 
            pc_sex = n_sex/N * 100  # percentage
            )
#> # A tibble: 2 × 3
#>     sex n_sex pc_sex
#>   <dbl> <int>  <dbl>
#> 1     1   251   85.1
#> 2     2    44   14.9

# 3. How many participants were in one of the 3 treatment groups (i.e., have an `intervention` value of 1, 2, or 3)? 
p_info %>%
  filter(intervention < 4) %>%
  count()  
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1   222
# => 222 individuals

# OR: 
t_iv <- p_info %>%
  group_by(intervention) %>% 
  summarise(n_iv = n(),           # number 
            pc_iv = n_iv/N * 100  # percentage
            ) 
t_iv
#> # A tibble: 4 × 3
#>   intervention  n_iv pc_iv
#>          <dbl> <int> <dbl>
#> 1            1    72  24.4
#> 2            2    76  25.8
#> 3            3    74  25.1
#> 4            4    73  24.7
sum(t_iv$n_iv[1:3]) 
#> [1] 222
# => 222 in elements 1:3 of vector n_iv 


# 4. What is the participants' mean education level? 
p_info %>%
  summarise(mn_edu = mean(educ))
#> # A tibble: 1 × 1
#>   mn_edu
#>    <dbl>
#> 1   3.98

# What percentage has a university degree (i.e., an `educ` value of at least 4)?
p_info %>%
  group_by(educ) %>% 
  summarise(n_edu = n(),            # number 
            pc_edu = n_edu/N * 100  # percentage
            ) 
#> # A tibble: 5 × 3
#>    educ n_edu pc_edu
#>   <dbl> <int>  <dbl>
#> 1     1    14   4.75
#> 2     2    21   7.12
#> 3     3    39  13.2 
#> 4     4   104  35.3 
#> 5     5   117  39.7

# 5. What is the age range (`min` to `max`) of participants? 
#    What is the average (mean and median) age?
p_info %>%
  summarise(n = n(),
            min_age = min(age),
            mn_age = mean(age),
            max_age = max(age)) 
#> # A tibble: 1 × 4
#>       n min_age mn_age max_age
#>   <int>   <dbl>  <dbl>   <dbl>
#> 1   295      18   43.8      83

# 6. Describe the range of `income` levels present in this sample of participants. 
#    What percentage of participants self-identifies as a below-average income 
#    (i.e., an `income` value of 1)?
p_info %>%
  group_by(income) %>%
  summarise(n_income = n(),
            pc_income = n_income/N * 100) 
#> # A tibble: 3 × 3
#>   income n_income pc_income
#>    <dbl>    <int>     <dbl>
#> 1      1       73      24.7
#> 2      2      136      46.1
#> 3      3       86      29.2

From Exercise 5 of Chapter 2

Questions from Exercise 5 of Chapter 2:

Use the p_info data to create some plots that descripte the sample of participants:

  • A histogram that shows the distribution of participant age in 3 ways:
    • overall,
    • separately for each sex, and
    • separately for each intervention.

Solution

When using dplyr instead of ggplot2, we can replace the information contained in the histogram by a table of descriptives:

# Age distribution (as a table): 
age_overall <- p_info %>%
  summarise(n = n(),              # number 
            min_age = min(age),   # minimum
            mn_age = mean(age),   # mean
            md_age = median(age), # median           
            sd_age = sd(age),     # standard deviation
            max_age = max(age)    # maximum
  )
age_overall
#> # A tibble: 1 × 6
#>       n min_age mn_age md_age sd_age max_age
#>   <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1   295      18   43.8     44   12.4      83

# Age distribution by sex (as a table): 
age_by_sex <- p_info %>%
  group_by(sex) %>% 
  summarise(n = n(),              # number 
            min_age = min(age),   # minimum
            mn_age = mean(age),   # mean
            md_age = median(age), # median           
            sd_age = sd(age),     # standard deviation
            max_age = max(age)    # maximum
  )
age_by_sex
#> # A tibble: 2 × 7
#>     sex     n min_age mn_age md_age sd_age max_age
#>   <dbl> <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1     1   251      19   43.9     44   12.1      83
#> 2     2    44      18   43.0     44   14.1      71

# Age distribution by intervention (as a table): 
age_by_iv <- p_info %>%
  group_by(intervention) %>% 
  summarise(n = n(),              # number 
            min_age = min(age),   # minimum
            mn_age = mean(age),   # mean
            md_age = median(age), # median           
            sd_age = sd(age),     # standard deviation
            max_age = max(age)    # maximum
  )
age_by_iv
#> # A tibble: 4 × 7
#>   intervention     n min_age mn_age md_age sd_age max_age
#>          <dbl> <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1            1    72      22   44.6   45     12.1      71
#> 2            2    76      18   45.4   45.5   12.5      83
#> 3            3    74      19   43.3   44     12.2      71
#> 4            4    73      19   41.7   40     12.8      75
  • A bar plot that
    • shows how many participants took part in each intervention; or
    • shows how many participants of each sex took part in each intervention.

Solution

Using dplyr instead of ggplot2:

# Number of participants per intervention:
# (was contained in age_by_iv above): 
age_by_iv %>%
  select(intervention, n)
#> # A tibble: 4 × 2
#>   intervention     n
#>          <dbl> <int>
#> 1            1    72
#> 2            2    76
#> 3            3    74
#> 4            4    73

# N and percentage by intervention and sex:
p_info %>%
  group_by(intervention, sex) %>% 
  summarise(n_iv_sex  = n(),              # number 
            pc_iv_sex = n_iv_sex/N * 100  # percentage 
            ) 
#> # A tibble: 8 × 4
#> # Groups:   intervention [4]
#>   intervention   sex n_iv_sex pc_iv_sex
#>          <dbl> <dbl>    <int>     <dbl>
#> 1            1     1       62     21.0 
#> 2            1     2       10      3.39
#> 3            2     1       66     22.4 
#> 4            2     2       10      3.39
#> 5            3     1       62     21.0 
#> 6            3     2       12      4.07
#> 7            4     1       61     20.7 
#> 8            4     2       12      4.07

A.3.6 Exercise 6

Surviving the Titanic

The Titanic data in datasets contains basic information on the Age, Class, Sex, and Survival status for the people on board of the fatal maiden voyage of the Titanic. This data is saved as a 4-dimensional array resulting from cross-tabulating 2201 observations on four variables, but can easily be transformed into a tibble titanic by calling titanic <- as_tibble(datasets::Titanic).

# ?datasets::Titanic  # see documentation  
# dim(Titanic)        # see also dim(FFTrees::titanic)

titanic <- tibble::as_tibble(datasets::Titanic)
knitr::kable(head(titanic), caption = "The head of the Titanic dataset (as a tibble).")
Table A.1: The head of the Titanic dataset (as a tibble).
Class Sex Age Survived n
1st Male Child No 0
2nd Male Child No 0
3rd Male Child No 35
Crew Male Child No 0
1st Female Child No 0
2nd Female Child No 0

Use dplyr pipes to answer each of the following questions by a summary table that counts the sum of particular groups of survivors.

  1. Determine the number of survivors by Sex: Were female passengers more likely to survive than male passengers?
# Survivors by Sex:
# (a) simple solution:
titanic %>%
  group_by(Sex, Survived) %>%
  summarise(Number = sum(n))
#> # A tibble: 4 × 3
#> # Groups:   Sex [2]
#>   Sex    Survived Number
#>   <chr>  <chr>     <dbl>
#> 1 Female No          126
#> 2 Female Yes         344
#> 3 Male   No         1364
#> 4 Male   Yes         367

# (b) complex solution (computing percentages overall and per subgroup):
t_51 <- titanic %>% 
  group_by(Sex, Survived) %>%
  summarise(nr_all = sum(n),
            pc_all = round(nr_all/sum(titanic$n) * 100, 1)) %>%
  group_by(Sex) %>%
  mutate(nr_sex = sum(nr_all),
         pc_sex  = round(nr_all/nr_sex * 100, 1))
knitr::kable(t_51, caption = "Number of Titanic survivors by `Sex`.")
Table A.2: Number of Titanic survivors by Sex.
Sex Survived nr_all pc_all nr_sex pc_sex
Female No 126 5.7 470 26.8
Female Yes 344 15.6 470 73.2
Male No 1364 62.0 1731 78.8
Male Yes 367 16.7 1731 21.2

# Corresponding Chi-square test:
tab_2x2 <- xtabs(n ~ Survived + Sex, titanic)
tab_2x2
#>         Sex
#> Survived Female Male
#>      No     126 1364
#>      Yes    344  367
csq_tst <- chisq.test(tab_2x2, correct = FALSE)
csq_tst
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab_2x2
#> X-squared = 456.87, df = 1, p-value < 2.2e-16
csq_tst$expected
#>         Sex
#> Survived   Female      Male
#>      No  318.1736 1171.8264
#>      Yes 151.8264  559.1736
csq_tst$observed
#>         Sex
#> Survived Female Male
#>      No     126 1364
#>      Yes    344  367

Answer: Yes, the variables Sex and Survived are related. Whereas females were about three times as likely to survive than to die, males were about four times as likely to die than to survive.

  1. Determine the number of survivors by Age: Were children more likely to survive than adults?
# Survivors by Age:
# (a) simple solution:
titanic %>%
  group_by(Age, Survived) %>%
  summarise(Number = sum(n))
#> # A tibble: 4 × 3
#> # Groups:   Age [2]
#>   Age   Survived Number
#>   <chr> <chr>     <dbl>
#> 1 Adult No         1438
#> 2 Adult Yes         654
#> 3 Child No           52
#> 4 Child Yes          57

# (b) complex solution (computing percentages overall and per subgroup):
t_52 <- titanic %>% 
  group_by(Age, Survived) %>%
  summarise(nr_age_sur = sum(n),
            pc_age_sur = round(nr_age_sur/sum(titanic$n) * 100, 1)) %>%
  group_by(Age) %>%
  mutate(nr_age = sum(nr_age_sur),
         pc_age  = round(nr_age_sur/nr_age * 100, 1))
knitr::kable(t_52, caption = "Number of Titanic survivors by `Age`.")
Table A.3: Number of Titanic survivors by Age.
Age Survived nr_age_sur pc_age_sur nr_age pc_age
Adult No 1438 65.3 2092 68.7
Adult Yes 654 29.7 2092 31.3
Child No 52 2.4 109 47.7
Child Yes 57 2.6 109 52.3

# Corresponding Chi-square test:
tab_2x2 <- xtabs(n ~ Survived + Age, titanic)
tab_2x2
#>         Age
#> Survived Adult Child
#>      No   1438    52
#>      Yes   654    57
csq_tst <- chisq.test(tab_2x2, correct = FALSE)
csq_tst
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab_2x2
#> X-squared = 20.956, df = 1, p-value = 4.701e-06
csq_tst$expected
#>         Age
#> Survived     Adult    Child
#>      No  1416.2108 73.78919
#>      Yes  675.7892 35.21081
csq_tst$observed
#>         Age
#> Survived Adult Child
#>      No   1438    52
#>      Yes   654    57

Answer: Yes, the variables Age and Survived are related. Whereas adults were more than twice as likely to die than to survive, the chances of survival for children were almost 50%.

  1. Consider the number of survivors as a function of both Sex and Age. Does the pattern observed in 1. hold equally for children and adults?
# Survivors by Age and Sex:
t_53 <- titanic %>% 
  group_by(Age, Sex, Survived) %>%
  summarise(nr_age_sex_sur = sum(n),
            pc_age_sex_sur = round(nr_age_sex_sur/sum(titanic$n) * 100, 1)) %>%
  group_by(Age, Sex) %>%
  mutate(nr_age_sex = sum(nr_age_sex_sur),
         pc_age_sex = round(nr_age_sex_sur/nr_age_sex * 100, 1))

knitr::kable(t_53, caption = "Number of Titanic survivors by `Age` and `Sex`.")
Table A.4: Number of Titanic survivors by Age and Sex.
Age Sex Survived nr_age_sex_sur pc_age_sex_sur nr_age_sex pc_age_sex
Adult Female No 109 5.0 425 25.6
Adult Female Yes 316 14.4 425 74.4
Adult Male No 1329 60.4 1667 79.7
Adult Male Yes 338 15.4 1667 20.3
Child Female No 17 0.8 45 37.8
Child Female Yes 28 1.3 45 62.2
Child Male No 35 1.6 64 54.7
Child Male Yes 29 1.3 64 45.3

Answer: Female adults have a higher chance of survival than male adults (74.4% vs. 20.3%) and female children have a higher chance of survival than male children (62.2% vs. 45.3%, respectively). However, the relative advantage for being female rather than male is much higher for adults (as the survival chances of female adults are 54.1% percentage points higher than for male adults) than for children (as the survival chance of female children are only 16.9% percentage points higher than for male children). Thus, being female was more of an advantage for adults than for children (and the survival rate of female adults was higher than the survival rate for children of either sex).

Statistically, this suggests an interaction between the factors Age and Sex on the chances of survival, though note that there were only 109 children overall (i.e., 4.95% of the passenger population).

  1. The documentation of the Titanic data suggests that the policy women and children first was “not entirely successful in saving the women and children in the third class”. Verify this by creating corresponding contingency tables (i.e., counts of survivors).
# Survivors by Sex for each class:
t_54a <- titanic %>% 
  # filter(Class == "3rd") %>%
  group_by(Class, Sex, Survived) %>%
  summarise(count = sum(n), 
            pc = round(count/sum(titanic$n) * 100, 2))
knitr::kable(t_54a, caption = "Number of Titanic survivors by `Class` and `Sex`.")
Table A.5: Number of Titanic survivors by Class and Sex.
Class Sex Survived count pc
1st Female No 4 0.18
1st Female Yes 141 6.41
1st Male No 118 5.36
1st Male Yes 62 2.82
2nd Female No 13 0.59
2nd Female Yes 93 4.23
2nd Male No 154 7.00
2nd Male Yes 25 1.14
3rd Female No 106 4.82
3rd Female Yes 90 4.09
3rd Male No 422 19.17
3rd Male Yes 88 4.00
Crew Female No 3 0.14
Crew Female Yes 20 0.91
Crew Male No 670 30.44
Crew Male Yes 192 8.72

# Survivors by Age for each class:
t_54b <- titanic %>% 
  # filter(Class == "3rd") %>%
  group_by(Class, Age, Survived) %>%
  summarise(count = sum(n), 
            pc = round(count/sum(titanic$n) * 100, 2))
knitr::kable(t_54b, caption = "Number of Titanic survivors by `Class` and `Age`.")
Table A.5: Number of Titanic survivors by Class and Age.
Class Age Survived count pc
1st Adult No 122 5.54
1st Adult Yes 197 8.95
1st Child No 0 0.00
1st Child Yes 6 0.27
2nd Adult No 167 7.59
2nd Adult Yes 94 4.27
2nd Child No 0 0.00
2nd Child Yes 24 1.09
3rd Adult No 476 21.63
3rd Adult Yes 151 6.86
3rd Child No 52 2.36
3rd Child Yes 27 1.23
Crew Adult No 673 30.58
Crew Adult Yes 212 9.63
Crew Child No 0 0.00
Crew Child Yes 0 0.00

Answer: Both regularities (i.e., much higher survival chance of females and for children) are lower for 3rd class passengers (despite still showing the same trends).

This concludes our exercises on dplyr — but the topic of data transformation will stay relevant throughout this book.

References

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. Journal of Open Psychology Data, 6(1). https://doi.org/10.5334/jopd.35