3.5 Exercises

ds4psy: Exercises 3

The following exercises practice the essential dplyr commands and aim to show that show that simple pipes of them can solve quite intriguing puzzles about data.

3.5.1 Exercise 1

Reshaping vs. reducing data

Discuss the main dplyr functions in terms of reshaping and reducing data (as introduced in Section 3.1.1).

3.5.2 Exercise 2

Star and R wars

We start tackling the tidyverse by uncovering even more facts about the dplyr::starwars universe. Answer the following questions by using pipes of basic dplyr commands (i.e., by arranging, filtering, selecting, grouping, counting, summarizing).

  • Save the tibble dplyr::starwars as sw and report its dimensions.
## Load data: ----- 
sw <- dplyr::starwars
# ?dplyr::starwars  # codebook of variables

Known unknowns

  • How many missing (NA) values does sw contain?

  • Which variable (column) has the most missing values?

  • Which individuals come from an unknown (missing) homeworld but have a known birth_year or known mass?

Gender issues

  • How many humans are contained in sw overall and by gender?

  • How many and which individuals in sw are neither male nor female?

  • Of which species in sw exist at least two different gender values?

  • Bonus task: R typically provides many ways to obtain a solution. Let’s gain an overview of the gender distribution in our sw dataset in three different ways:

    • Use a dplyr pipe to compute a summary table tb that counts the frequency of each gender in sw.
    • Use ggplot2 on the raw data of sw to create a bar chart (A) that shows the same gender distribution.
    • Use ggplot2 on the summary table tb to create a bar chart (B) that shows the same gender distribution.

Size and mass issues

  • Compute the median, mean, and standard deviation of height for all droids.

  • Compute the average height and mass by species and save the result as h_m.

  • Sort h_m to list the three species with the smallest individuals (in terms of mean height).

  • Sort h_m to list the three species with the heaviest individuals (in terms of median mass).

Bonus tasks

The following bonus tasks are more difficult, but can be solved with a single dplyr pipe:

  • How many individuals come from the three most frequent (known) species?

  • Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?

3.5.3 Exercise 3

Sleeping mammals

The dataset ggplot2::msleep contains a mammals sleep dataset (see ?msleep for details and the definition of variables).

  • Save the data as sp and check the dimensions, variable types, and number of missing values in the dataset.
## Data: 
# ?msleep  # check variables     
sp <- ggplot2::msleep

Arranging and filtering data

Use the dplyr-verbs arrange(), group_by(), and filter() to answer the following questions by creating ordered subsets of the data:

  • Arrange the rows (alphabetically) by vore, order, and name, and report the genus of the top three mammals.

  • What is the most common type of vore in the data? How many omnivores are there?

  • What is the most common order in the dataset? Are there more exemplars of the order “Carnivora” or “Primates”?

  • Which two mammals of the order “Primates” have the longest and shortest sleep_total times?

Computing new variables

Solve the following tasks by combining the dplyr commands mutate(), group_by(), and summarise():

  • Compute a variable sleep_awake_sum that adds the sleep_total time and the awake time of each mammal. What result do you expect and get?

  • Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?

  • What is the minimum, average (mean), and maximum sleep cycle length for each vore? (Hint: First group the data by group_by, then use summarise on the sleep_cycle variable, but also count the number of NA values for each vore. When computing grouped summaries, NA values can be removed by na.rm = TRUE.)

  • Replace your summarise() verb in the previous task by mutate(). What do you get as a result? (Hint: The last two tasks illustrate the difference between mutate() and grouped mutate() commands.)

3.5.4 Exercise 4

Outliers

This exercise examines different possibilities for defining outliers and uses the outliers dataset of the ds4psy package (also available as out.csv at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them. With respect to your insights into dplyr, this exercise helps disentangling mutate from grouped mutate commands.

Data on outliers

Use the outliers data (from the ds4psy package) or use the following read_csv() command to load the data into an R object entitled outliers:

# From the ds4psy package:
outliers <- ds4psy::outliers

# Alternatively, load csv data from online source (as comma-separated file): 
# outliers_2 <- readr::read_csv("http://rpository.com/ds4psy/data/out.csv")  # from online source

# Verify equality: 
# all.equal(ds4psy::outliers, outliers_2)

# Alternatively, from a local data file: 
# outliers <- read_csv("out.csv")  # from current directory

Not all outliers are alike

An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., two standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.

Basic tasks

  • Save the data into a tibble outliers and report its number of observations and variables, and their types.

  • How many missing data values are there in outliers?

  • What is the gender (or sex) distribution in this sample?

  • Create a plot that shows the distribution of height values for each gender.

Defining different outliers

Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:

  1. outliers relative to the height of the overall sample (i.e., individuals with height values deviating more than 2 SD from the overall mean of height);

  2. outliers relative to the height of some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals with height values deviating more than 2 SD from the mean height of their own gender).

Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).

Relative outliers

Now use the 2 new outlier variables to define (or filter) 2 subsets of the data that contain 2 subgroups of people:

  1. out_1: Individuals (females and males) with height values that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are in outliers?

  2. out_2: Individuals (females and males) with height values that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are in outliers?

Bonus plots

  • Visualize the raw values and distributions of height for both types of outliers (out_1 and out_2) in 2 separate plots.

  • Interpret both plots by describing the height and sex combination of the individuals shown in each plot.

3.5.5 Exercise 5

Revisiting positive psychology

In previous exercises, we used the p_info data — available as posPsy_p_info in the ds4psy package or as http://rpository.com/ds4psy/data/posPsy_participants.csv — from a study on the effectiveness of web-based positive psychology interventions (Woodworth et al., 2018). More specifically, we used this data in Exercise 6 of Chapter 1 and Exercise 5 of Chapter 2 to explore the participant information and create some corresponding plots. (See Section B.1 of Appendix B for background information on this data.)

Answer the same questions as in those exercises by verifying your earlier base R results and ggplot2 graphs by pipes of dplyr commands. Do your graphs and quantitative results support the same conclusions?

Data

# From ds4psy package:
p_info <- ds4psy::posPsy_p_info

# Alternatively, load data from online source:
# p_info_2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")

# Verify equality:
# all.equal(p_info, p_info_2)

# p_info
dim(p_info)  # 295 rows, 6 columns
#> [1] 295   6

From Exercise 6 of Chapter 1

Questions from Exercise 6 of Chapter 1:

Examine the participant information in p_info by describing each of its variables:

  1. How many individuals are contained in the dataset?
  2. What percentage of them is female (i.e., has a sex value of 1)?
  3. How many participants were in one of the 3 treatment groups (i.e., have an intervention value of 1, 2, or 3)?
  4. What is the participants’ mean education level? What percentage has a university degree (i.e., an educ value of at least 4)?
  5. What is the age range (min to max) of participants? What is the average (mean and median) age?
  6. Describe the range of income levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., an income value of 1)?

From Exercise 5 of Chapter 2:

Questions from Exercise 5 of Chapter 2:

Use the p_info data to create some plots that descripte the sample of participants:

  • A histogram that shows the distribution of participant age in 3 ways:
    • overall,
    • separately for each sex, and
    • separately for each intervention.
  • A bar plot that
    • shows how many participants took part in each intervention; or
    • shows how many participants of each sex took part in each intervention.

3.5.6 Exercise 6

Surviving the Titanic

The Titanic data in datasets contains basic information on the Age, Class, Sex, and Survival status for the people on board of the fatal maiden voyage of the Titanic. This data is saved as a 4-dimensional array resulting from cross-tabulating 2201 observations on four variables, but can easily be transformed into a tibble titanic by evaluating titanic <- tibble::as_tibble(datasets::Titanic).

Table 3.1: The head of the Titanic dataset (as a tibble).
Class Sex Age Survived n
1st Male Child No 0
2nd Male Child No 0
3rd Male Child No 35
Crew Male Child No 0
1st Female Child No 0
2nd Female Child No 0

Use dplyr pipes to answer each of the following questions by a summary table that counts the sum of particular groups of survivors.

  1. Determine the number of survivors by Sex: Were female passengers more likely to survive than male passengers?

  2. Determine the number of survivors by Age: Were children more likely to survive than adults?

  3. Consider the number of survivors as a function of both Sex and Age. Does the pattern observed in 1. hold equally for children and adults?

  4. The documentation of the Titanic data suggests that the policy women and children first policy was “not entirely successful in saving the women and children in the third class”. Verify this by creating corresponding contingency tables (i.e., counts of survivors).

This concludes our exercises on dplyr — but the topic of data transformation will stay relevant throughout this book.

References

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. Journal of Open Psychology Data, 6(1). https://doi.org/10.5334/jopd.35