3.5 Exercises
The following exercises practice the essential dplyr commands and aim to show that show that simple pipes of them can solve quite intriguing puzzles about data.
3.5.1 Exercise 1
Reshaping vs. reducing data
Discuss the main dplyr functions in terms of reshaping and reducing data (as introduced in Section 3.1.1).
3.5.2 Exercise 2
Star and R wars
We start tackling the tidyverse by uncovering even more facts about the dplyr::starwars
universe.
Answer the following questions by using pipes of basic dplyr commands (i.e., by arranging, filtering, selecting, grouping, counting, summarizing).
- Save the tibble
dplyr::starwars
assw
and report its dimensions.
## Load data: -----
<- dplyr::starwars
sw # ?dplyr::starwars # codebook of variables
Known unknowns
How many missing (
NA
) values doessw
contain?Which variable (column) has the most missing values?
Which individuals come from an unknown (missing)
homeworld
but have a knownbirth_year
or knownmass
?
Gender issues
How many humans are contained in
sw
overall and by gender?How many and which individuals in
sw
are neither male nor female?Of which species in
sw
exist at least two different gender values?Bonus task: R typically provides many ways to obtain a solution. Let’s gain an overview of the gender distribution in our
sw
dataset in three different ways:- Use a dplyr pipe to compute a summary table
tb
that counts the frequency of each gender insw
.
- Use ggplot2 on the raw data of
sw
to create a bar chart (A) that shows the same gender distribution.
- Use ggplot2 on the summary table
tb
to create a bar chart (B) that shows the same gender distribution.
- Use a dplyr pipe to compute a summary table
Popular homes and heights
From which
homeworld
do the most indidividuals (rows) come from?What is the mean
height
of all individuals with orange eyes from the most popular homeworld?
Size and mass issues
Compute the median, mean, and standard deviation of
height
for all droids.Compute the average height and mass by species and save the result as
h_m
.Sort
h_m
to list the three species with the smallest individuals (in terms of meanheight
).Sort
h_m
to list the three species with the heaviest individuals (in terms of medianmass
).
Bonus tasks
The following bonus tasks are more difficult, but can be solved with a single dplyr pipe:
How many individuals come from the three most frequent (known) species?
Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?
3.5.3 Exercise 3
Sleeping mammals
The dataset ggplot2::msleep
contains a mammals sleep dataset (see ?msleep
for details and the definition of variables).
- Save the data as
sp
and check the dimensions, variable types, and number of missing values in the dataset.
## Data:
# ?msleep # check variables
<- ggplot2::msleep sp
Arranging and filtering data
Use the dplyr-verbs arrange()
, group_by()
, and filter()
to answer the following questions by creating ordered subsets of the data:
Arrange the rows (alphabetically) by
vore
,order
, andname
, and report thegenus
of the top three mammals.What is the most common type of
vore
in the data? How many omnivores are there?What is the most common
order
in the dataset? Are there more exemplars of theorder
“Carnivora” or “Primates?”Which two mammals of the order “Primates” have the longest and shortest
sleep_total
times?
Computing new variables
Solve the following tasks by combining the dplyr commands mutate()
, group_by()
, and summarise()
:
Compute a variable
sleep_awake_sum
that adds thesleep_total
time and theawake
time of each mammal. What result do you expect and get?Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?
What is the minimum, average (mean), and maximum sleep cycle length for each
vore
? (Hint: First group the data bygroup_by
, then usesummarise
on thesleep_cycle
variable, but also count the number ofNA
values for eachvore
. When computing grouped summaries,NA
values can be removed byna.rm = TRUE
.)Replace your
summarise()
verb in the previous task bymutate()
. What do you get as a result? (Hint: The last two tasks illustrate the difference betweenmutate()
and groupedmutate()
commands.)
3.5.4 Exercise 4
Outliers
This exercise examines different possibilities for defining outliers and uses the outliers
dataset of the ds4psy package (also available as out.csv
at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them.
With respect to your insights into dplyr, this exercise helps disentangling mutate
from grouped mutate
commands.
Data on outliers
Use the outliers
data (from the ds4psy package) or use the following read_csv()
command to load the data into an R object entitled outliers
:
# From the ds4psy package:
<- ds4psy::outliers
outliers
# Alternatively, load csv data from online source (as comma-separated file):
# outliers_2 <- readr::read_csv("http://rpository.com/ds4psy/data/out.csv") # from online source
# Verify equality:
# all.equal(ds4psy::outliers, outliers_2)
# Alternatively, from a local data file:
# outliers <- read_csv("out.csv") # from current directory
Not all outliers are alike
An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., two standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.
Basic tasks
Save the data into a tibble
outliers
and report its number of observations and variables, and their types.How many missing data values are there in
outliers
?What is the gender (or
sex
) distribution in this sample?Create a plot that shows the distribution of
height
values for each gender.
Defining different outliers
Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height
:
outliers relative to the
height
of the overall sample (i.e., individuals withheight
values deviating more than 2 SD from the overall mean ofheight
);outliers relative to the
height
of some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals withheight
values deviating more than 2 SD from the meanheight
of their own gender).
Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE
or FALSE
) and added as new columns to data
(via appropriate mutate
commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers
by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x
and y
is provided by abs(x - y)
.
Relative outliers
Now use the 2 new outlier variables to define (or filter
) 2 subsets of the data that contain 2 subgroups of people:
out_1
: Individuals (females and males) withheight
values that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are inoutliers
?out_2
: Individuals (females and males) withheight
values that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are inoutliers
?
Bonus plots
Visualize the raw values and distributions of
height
for both types of outliers (out_1
andout_2
) in 2 separate plots.Interpret both plots by describing the
height
andsex
combination of the individuals shown in each plot.
3.5.5 Exercise 5
Revisiting positive psychology
In Exercise 6 of Chapter 1 and
Exercise 5 of Chapter 2,
we used the p_info
data — available as posPsy_p_info
in the ds4psy package or as http://rpository.com/ds4psy/data/posPsy_participants.csv — from a study on the effectiveness of web-based positive psychology interventions (Woodworth, O’Brien-Malone, Diamond, & Schüz, 2018) to explore the participant information and create some corresponding plots.
(See Section B.1 of Appendix B for background information on this data.)
Answer the same questions as in those exercises by verifying your earlier base R results and ggplot2 graphs by pipes of dplyr commands. Do your graphs and quantitative results support the same conclusions?
Data
# From ds4psy package:
<- ds4psy::posPsy_p_info
p_info
# Alternatively, load data from online source:
# p_info_2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
# Verify equality:
# all.equal(p_info, p_info_2)
# p_info
dim(p_info) # 295 rows, 6 columns
#> [1] 295 6
From Exercise 6 of Chapter 1
Questions from Exercise 6 of Chapter 1:
Examine the participant information in p_info
by describing each of its variables:
- How many individuals are contained in the dataset?
- What percentage of them is female (i.e., has a
sex
value of 1)? - How many participants were in one of the 3 treatment groups (i.e., have an
intervention
value of 1, 2, or 3)? - What is the participants’ mean education level? What percentage has a university degree (i.e., an
educ
value of at least 4)? - What is the age range (
min
tomax
) of participants? What is the average (mean and median) age? - Describe the range of
income
levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., anincome
value of 1)?
From Exercise 5 of Chapter 2:
Questions from Exercise 5 of Chapter 2:
Use the p_info
data to create some plots that descripte the sample of participants:
- A histogram that shows the distribution of participant
age
in 3 ways:- overall,
- separately for each
sex
, and - separately for each
intervention
.
- A bar plot that
- shows how many participants took part in each
intervention
; or - shows how many participants of each
sex
took part in eachintervention
.
- shows how many participants took part in each
3.5.6 Exercise 6
Surviving the Titanic
The Titanic
data in datasets contains basic information on the Age
, Class
, Sex
, and Survival
status for the people on board of the fatal maiden voyage of the Titanic. This data is saved as a 4-dimensional array resulting from cross-tabulating 2201 observations on four variables, but can easily be transformed into a tibble titanic
by evaluating titanic <- tibble::as_tibble(datasets::Titanic)
.
Class | Sex | Age | Survived | n |
---|---|---|---|---|
1st | Male | Child | No | 0 |
2nd | Male | Child | No | 0 |
3rd | Male | Child | No | 35 |
Crew | Male | Child | No | 0 |
1st | Female | Child | No | 0 |
2nd | Female | Child | No | 0 |
Use dplyr pipes to answer each of the following questions by a summary table that counts the sum of particular groups of survivors.
Determine the number of survivors by
Sex
: Were female passengers more likely to survive than male passengers?Determine the number of survivors by
Age
: Were children more likely to survive than adults?Consider the number of survivors as a function of both
Sex
andAge
. Does the pattern observed in 1. hold equally for children and adults?The documentation of the
Titanic
data suggests that the policy women and children first policy was “not entirely successful in saving the women and children in the third class.” Verify this by creating corresponding contingency tables (i.e., counts of survivors).
This concludes our exercises on dplyr — but the topic of data transformation will stay relevant throughout this book.