7.4 Exercises
The following exercises help you to detect and create tidy data by practicing the essential tidyr functions
7.4.1 Exercise 1
Four messes and one tidy table
This exercise asks you to inspect some tables and use tidyr commands for turning some messy ones into tidy data.
The four tables t_1
to t_4
are available in the ds4psy package (Neth, 2023).
Alternatively, you can load csv-versions of these files from the following links:
For each of these files:
Describe the data (i.e., its dimensions, observations, variables, DVs and IVs).
Transform any non-tidy table one into a tidy one.
Verify the equality of the resulting tidy tables.
7.4.2 Exercise 2
Moving stocks (from wide to long to wide)
Let’s enter and transform some stock-related data.49
The following table shows the start and end price of three stocks on three days (d1
, d2
, d3
):
stock | d1_start | d1_end | d2_start | d2_end | d3_start | d3_end |
---|---|---|---|---|---|---|
Amada | 2.5 | 3.6 | 3.5 | 4.2 | 4.4 | 2.8 |
Betix | 3.3 | 2.9 | 3.0 | 2.1 | 2.3 | 2.5 |
Cevis | 4.2 | 4.8 | 4.6 | 3.1 | 3.2 | 3.7 |
Create a tibble
st
that contains this data in this (wide) format.Transform
st
into a longer tablest_long
that contains 18 rows and only one numeric variable for all stock prices. Adjust this table so that theday
andtime
appear as two separate columns.Create a (line) graph that shows the three stocks’
end
prices (on the y-axis) over the three days (on the x-axis).Spread or pivot
st_long
into a wider table that containsstart
andend
prices as two distinct variables (columns) for each stock and day.
7.4.3 Exercise 3
In this exercise, we use tidyr to solve a problem that prevented us from creating a plot in the tibble chapter (Chapter 5).
A posPsy tibble reloaded
In Exercise 3 of Chapter 5,
we created a tibble of mean depression scores (my_tbl
and my_tbl_2
) in 2 different ways (by entering the data directly into R with tibble()
and by using dplyr to compute a summary table from the posPsy_wide
data).
intervention | mn_cesd_0 | mn_cesd_1 | mn_cesd_2 | mn_cesd_3 | mn_cesd_4 | mn_cesd_5 |
---|---|---|---|---|---|---|
1 | 15.1 | 15.3 | 13.6 | 12.0 | 11.2 | 13.5 |
2 | 16.2 | 14.6 | 11.4 | 12.5 | 13.4 | 14.6 |
3 | 16.1 | 12.3 | 14.8 | 13.9 | 14.9 | 13.0 |
4 | 12.8 | 9.9 | 9.5 | 9.1 | 7.7 | 10.2 |
See Section B.1 of Appendix B for details on the data.
When trying to create a plot that shows the trends of mean depression scores (over different occasions by intervention) we noted that it is impossible to directly plot the values of my_tbl_2
. For plotting the mean depression scores with ggplot
we would need these scores as 1 dependent variable, rather than as six different variables.
Earlier, we solved this problem by creating an alternative tibble my_tbl_3
— which expressed mean_cesd
as a function of occasion
and intervention
(in long format) — from the raw data in posPsy_long
. Given our new skills in tidyr, we now are in a position to transform my_tbl_2
(or my_tbl
) into the required format of my_tbl_3
. Thus, your task is:
Re-create one of the original tibbles (either
my_tbl
ormy_tbl_2
) and use tidyr to transform it into the long format ofmy_tbl_3
.Now do the reverse: Use the long version
my_tbl_3
to (re-)create a wider versionmy_tbl_4
that is equal tomy_tbl_2
.
7.4.4 Exercise 4
Wide and long psychology
In previous chapters, we have seen 2 sets of data for the positive psychology experiment:
ds4psy::posPsy_long
with 990 x 50 variables (aka.posPsy_AHI_CESD_corrected.csv
, available online at http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv)ds4psy::posPsy_wide
with 295 x 294 variables (aka.posPsy_data_wide.csv
, available online at http://rpository.com/ds4psy/data/posPsy_data_wide.csv)
(See Section B.1 of Appendix B for details on the data.)
Both of these datasets contain the same information, but one is in long format and one in wide format. With tidyr, we are able to transform the long format into wide format (and vice versa) on our own.
1. From long to wide
Load the first file
posPsy_AHI_CESD_corrected.csv
into a tibbleposPsy_long
. To make things simpler, drop all columns exceptid
,occasion
,intervention
, andahiTotal
.Transform the resulting table from long to wide format (spreading
ahiTotal
values over differentoccasion
s).
2. From wide to long
- Load the second file
posPsy_data_wide.csv
into a tibbleposPsy_wide
and drop all variables that contain values of individual happiness or depression items (i.e., all score variables not containing “Total” in their names).
Then transform this wide format tibble into long format.
Your result table should contain all demographic information (in separate columns), the type of scale (ahiTotal vs. cesdTotal), number of occasion
(0 to 5), and the scale value
(as dependent variable).
Hint: First gather all Total
variables into a single value
variable, then separate the key column into two variables scale
and occasion
.
7.4.5 Exercise 5
Plotting relatives
This exercise relies on the main dataset for false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information):
- Load the data file
falsePosPsy_all
(78 x 19 variables):
Parents’ age?
Let’s see whether we can detect some relationship between the parents’ age values.
Plot the relationship between the age values of each participant’s
dad
andmom
(e.g., as a scatterplot).Plot the age distributions of moms vs. dads (i.e., the distributions of age values among moms vs. dads) using histograms or
geom_bar
.
(Hint: As both ages are in two separate variables, you need to use gather
to collect the ages of both parents in one variable.)
- Can you think of a way of plotting the relationship between (or difference between) the age of both parents for each participant?
(Hint: Again, you need to use gather
or a pivot-function to collect the ages of both parents into one variable.)
7.4.6 Exercise 6
Experiment with wider data
This is a bonus task — for the ambitious or curious — which requires transforming a dataset with multiple dependent variables into tidy data.
This task extends beyond the scope of the current chapter, but can be solved with our current base R and tidyverse
commands.
(See Section 7.2.6 for additional information.)
Data
The data table exp_wide contains data from \(n = 10\) participants.
Each participant completed 2 tasks (and the task position p
was randomized).
For each task, we measured 2 dependent variables: The correctness c
of the response, and the time t
(in msec) to complete the task:
subj | p_1 | p_2 | c_1 | c_2 | t_1 | t_2 |
---|---|---|---|---|---|---|
1 | 1 | 2 | FALSE | FALSE | 4873.7 | 9230.0 |
2 | 1 | 2 | FALSE | FALSE | 3963.9 | 2948.8 |
3 | 2 | 1 | FALSE | TRUE | 2868.4 | 8348.3 |
4 | 2 | 1 | FALSE | FALSE | 2561.3 | 1290.2 |
5 | 1 | 2 | TRUE | FALSE | 5762.2 | 9330.7 |
6 | 1 | 2 | FALSE | FALSE | 4873.7 | 9230.0 |
7 | 2 | 1 | FALSE | TRUE | 3963.9 | 2948.8 |
8 | 1 | 2 | TRUE | TRUE | 2868.4 | 8348.3 |
9 | 2 | 1 | TRUE | TRUE | 2561.3 | 1290.2 |
10 | 2 | 1 | TRUE | FALSE | 5762.2 | 9330.7 |
Tasks
Import the data from
ds4psy::exp_wide
(or http://rpository.com/ds4psy/data/exp_wide.csv) into a tibbleexp_wide
.Use 2 different ways to transform (or reshape)
exp_wide
into a table of tidy dataexp_tidy
.Verify the equality of both solutions.
This concludes our exercises on tidying data with tidyr.
References
In case you find financial data not psychological enough: Imagine that the data describe two daily mood measurements of three people owning stocks…↩︎