## 3.5 Exercises

The following exercises practice the essential **dplyr** commands and aim to show that show that simple pipes of them can solve quite intriguing puzzles about data.

### 3.5.1 Exercise 1

#### Star and R wars

We start tackling the **tidyverse** by uncovering even more facts about the `dplyr::starwars`

universe. Answer the following questions by using pipes of basic **dplyr** commands (i.e., by arranging, filtering, selecting, grouping, counting, summarizing).

- Save the tibble
`dplyr::starwars`

as`sw`

and report its dimensions.

```
## Load data: -----
sw <- dplyr::starwars
# ?dplyr::starwars # codebook of variables
```

#### Known unknowns

How many missing (

`NA`

) values does`sw`

contain?Which variable (column) has the most missing values?

Which individuals come from an unknown (missing)

`homeworld`

but have a known`birth_year`

or known`mass`

?

#### Gender issues

How many humans are contained in

`sw`

overall and by gender?How many and which individuals in

`sw`

are neither male nor female?Of which species in

`sw`

exist at least 2 different gender values?

#### Popular homes and heights

From which

`homeworld`

do the most indidividuals (rows) come from?What is the mean

`height`

of all individuals with orange eyes from the most popular homeworld?

#### Size and mass issues

Compute the median, mean, and standard deviation of

`height`

for all droids.Compute the average height and mass by species and save the result as

`h_m`

.Sort

`h_m`

to list the 3 species with the smallest individuals (in terms of mean`height`

).Sort

`h_m`

to list the 3 species with the heaviest individuals (in terms of median`mass`

).

#### Bonus tasks

How many individuals come from the 3 most frequent (known) species?

Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?

### 3.5.2 Exercise 2

#### Sleeping mammals

The dataset `ggplot2::msleep`

contains a mammals sleep dataset (see `?msleep`

for details and the definition of variables).

- Save the data as
`sp`

and check the dimensions, variable types, and number of missing values in the dataset.

```
## Data:
# ?msleep # check variables
sp <- ggplot2::msleep
```

#### Arranging and filtering data

Use the **dplyr**-verbs `arrange`

, `group_by`

, and `filter`

to answer the following questions by creating ordered subsets of the data:

Arrange the rows (alphabetically) by

`vore`

,`order`

, and`name`

, and report the`genus`

of the top 3 mammals.What is the most common type of

`vore`

in the data? How many omnivores are there?What is the most common

`order`

in the dataset? Are there more exemplars of the`order`

“Carnivora” or “Primates”?Which 2 mammals of the order “Primates” have the longest and shortest

`sleep_total`

times?

#### Computing new variables

Solve the following tasks by combining the **dplyr** commands `mutate`

, `group_by`

, and `summarise`

:

Compute a variable

`sleep_awake_sum`

that adds the`sleep_total`

time and the`awake`

time of each mammal. What result do you expect and get?Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?

What is the minimum, average (mean), and maximum sleep cycle length for each

`vore`

? (Hint: First group the data by`group_by`

, then use`summarise`

on the`sleep_cycle`

variable, but also count the number of`NA`

values for each`vore`

. When computing grouped summaries,`NA`

values can be removed by`na.rm = TRUE`

.)Replace your

`summarise`

verb in the previous task by`mutate`

. What do you get as a result? (Hint: The last two tasks illustrate the difference between`mutate`

and*grouped*`mutate`

commands.)

### 3.5.3 Exercise 3

#### Outliers

This exercise examines different possibilities for defining *outliers* and uses the `outliers`

dataset of the **ds4psy** package (also available as `out.csv`

at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them. With respect to your insights into **dplyr**, this exercise helps disentangling `mutate`

from *grouped* `mutate`

commands.

#### Data on `outliers`

Use the `outliers`

data (from the **ds4psy** package) or use the following `read_csv()`

command to load the data into an R object entitled `outliers`

:

```
# From the ds4psy package:
outliers <- ds4psy::outliers
# Alternatively, load csv data from online source (as comma-separated file):
outliers_2 <- readr::read_csv("http://rpository.com/ds4psy/data/out.csv") # from online source
# Verify equality:
all.equal(ds4psy::outliers, outliers_2)
```

`#> [1] TRUE`

```
# Alternatively, from a local data file:
# outliers <- read_csv("out.csv") # from current directory
```

#### Not all outliers are alike

An *outlier* can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., 2 standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the *reference group* over which the means and deviations are computed. In the following, we explore the implications of different reference groups.

#### Basic tasks

Save the data into a tibble

`outliers`

and report its number of observations and variables, and their types.How many missing data values are there in

`outliers`

?What is the gender (or

`sex`

) distribution in this sample?Create a plot that shows the distribution of

`height`

values for each gender.

#### Defining different outliers

Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of `height`

:

outliers relative to the

`height`

of the*overall sample*(i.e., individuals with`height`

values deviating more than 2 SD from the overall mean of`height`

);outliers relative to the

`height`

of*some subgroup*’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals with`height`

values deviating more than 2 SD from the mean`height`

of their own gender).

**Hints:** As both variable signal whether or not someone is an outlier they should be defined as logicals (being either `TRUE`

or `FALSE`

) and added as new columns to `data`

(via appropriate `mutate`

commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping `outliers`

by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values `x`

and `y`

is provided by `abs(x - y)`

.

#### Relative outliers

Now use the 2 new outlier variables to define (or `filter`

) 2 subsets of the data that contain 2 subgroups of people:

`out_1`

: Individuals (females and males) with`height`

values that are outliers relative to*both*the entire sample*and*the sample of their own gender. How many such individuals are in`outliers`

?`out_2`

: Individuals (females and males) with`height`

values that are*not*outliers relative to the entire population, but*are*outliers relative to their own gender. How many such individuals are in`outliers`

?

#### Bonus plots

Visualize the raw values and distributions of

`height`

for both types of outliers (`out_1`

and`out_2`

) in 2 separate plots.Interpret both plots by describing the

`height`

and`sex`

combination of the individuals shown in each plot.

### 3.5.4 Exercise 4

#### Revisiting positive psychology

In Exercise 6 of Chapter 1 and Exercise 5 of Chapter 2, we used the `p_info`

data — available as `posPsy_p_info`

in the **ds4psy** package or as http://rpository.com/ds4psy/data/posPsy_participants.csv — from a study on the effectiveness of web-based positive psychology interventions (Woodworth et al., 2018) to explore the participant information and create some corresponding plots. (See Section B.1 of Appendix B for background information on this data.)

Answer the same questions as in those exercises by verifying your earlier base R results and **ggplot2** graphs by pipes of **dplyr** commands. Do your graphs and quantitative results support the same conclusions?

#### Data

```
# From ds4psy package:
p_info <- ds4psy::posPsy_p_info
# Load data (from online source):
p_info_2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
# Verify equality:
all.equal(p_info, p_info_2)
```

`#> [1] TRUE`

```
# p_info
dim(p_info) # 295 rows, 6 columns
```

`#> [1] 295 6`

#### From Exercise 6 of Chapter 1

Questions from Exercise 6 of Chapter 1:

Examine the participant information in `p_info`

by describing each of its variables:

- How many individuals are contained in the dataset?
- What percentage of them is female (i.e., has a
`sex`

value of 1)? - How many participants were in one of the 3 treatment groups (i.e., have an
`intervention`

value of 1, 2, or 3)? - What is the participants’ mean education level? What percentage has a university degree (i.e., an
`educ`

value of at least 4)? - What is the age range (
`min`

to`max`

) of participants? What is the average (mean and median) age? - Describe the range of
`income`

levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., an`income`

value of 1)?

#### From Exercise 5 of Chapter 2:

Questions from Exercise 5 of Chapter 2:

Use the `p_info`

data to create some plots that descripte the sample of participants:

- A
*histogram*that shows the distribution of participant`age`

in 3 ways:- overall,
- separately for each
`sex`

, and - separately for each
`intervention`

.

- A
*bar plot*that- shows how many participants took part in each
`intervention`

; or - shows how many participants of each
`sex`

took part in each`intervention`

.

- shows how many participants took part in each

This concludes our current exercises on **dplyr** — but the topic of data transformation will stay relevant throughout this book.

### 3.5.5 Exercise 5

#### Surviving the Titanic

The `Titanic`

data in **datasets** contains basic information on the `Age`

, `Class`

, `Sex`

, and `Survival`

status for the people on board of the fatal maiden voyage of the Titanic. This data is saved as a 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables, but can easily be transformed into a tibble `titanic`

by calling `titanic <- as_tibble(datasets::Titanic)`

.

```
# ?datasets::Titanic # see documentation
# dim(Titanic) # see also dim(FFTrees::titanic)
# titanic <- tibble::as_tibble(as.data.frame(datasets::Titanic))
titanic <- tibble::as_tibble(datasets::Titanic)
knitr::kable(head(titanic), caption = "The head of the Titanic dataset (as a tibble).")
```

Class | Sex | Age | Survived | n |
---|---|---|---|---|

1st | Male | Child | No | 0 |

2nd | Male | Child | No | 0 |

3rd | Male | Child | No | 35 |

Crew | Male | Child | No | 0 |

1st | Female | Child | No | 0 |

2nd | Female | Child | No | 0 |

Use **dplyr** pipes to answer each of the following questions by a summary table that counts the sum of particular groups of survivors.

Determine the number of survivors by

`Sex`

: Were female passengers more likely to survive than male passengers?Determine the number of survivors by

`Age`

: Were children more likely to survive than adults?Consider the number of survivors as a function of both

`Sex`

and`Age`

. Does the pattern observed in 1. hold equally for children and adults?The documentation of the

`Titanic`

data suggests that the policy*women and children first*policy was “not entirely successful in saving the women and children in the third class”. Verify this by creating corresponding contingency tables.

This concludes our exercises on **dplyr** — but the topic of data transformation will stay relevant throughout this book.

### References

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. *Journal of Open Psychology Data*, *6*(1). https://doi.org/10.5334/jopd.35