The following exercises help you to detect and create tidy data by practicing the essential tidyr commands.
7.4.1 Exercise 1
This exercise asks you to inspect some tables and turn some messy ones into tidy data.
7.4.2 Exercise 2
Let’s enter and transform some stock-related data.31
Moving stocks from wide to long to wide
The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):
Create a tibble
stthat contains this data in this (wide) format.
stinto a longer table
st_longthat contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the
timeappear as 2 separate columns.
Create a (line) graph that shows the 3 stocks’
endprices (on the y-axis) over the 3 days (on the x-axis).
st_longinto a wider table that contains
endprices as 2 distinct variables (columns) for each stock and day.
7.4.3 Exercise 3
In this exercise, we use tidyr to solve a problem that prevented us from creating a plot in the tibble chapter (Chapter 5).
A posPsy tibble reloaded
In Exercise 3 of Chapter 5, we created a tibble of mean depression scores (
my_tbl_2) in 2 different ways (by entering the data directly into R with
tibble() and by using dplyr to compute a summary table from the
When trying to create a plot that shows the trends of mean depression scores (over different occasions by intervention) we noted that it is impossible to directly plot the values of
my_tbl_2. For plotting the mean depression scores with
ggplot we would need these scores as 1 dependent variable, rather than as 6 different variables.
Earlier, we solved this problem by creating an alternative tibble
my_tbl_3 — which expressed
mean_cesd as a function of
intervention (in long format) — from the raw data in
posPsy_long. Given our new skills in tidyr, we now are in a position to transform
my_tbl) into the required format of
my_tbl_3. Thus, your task is:
Re-create one of the original tibbles (either
my_tbl_2) and use tidyr to transform it into the long format of
Now do the reverse: Use
my_tbl_3to re-create a longer version
my_tbl_4that is equal to
7.4.4 Exercise 4
In previous chapters, we have seen 2 sets of data for the positive psychology experiment:
ds4psy::posPsy_longwith 990 x 50 variables (aka.
posPsy_AHI_CESD_corrected.csv, available online at http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv)
ds4psy::posPsy_widewith 295 x 294 variables (aka.
posPsy_data_wide.csv, available online at http://rpository.com/ds4psy/data/posPsy_data_wide.csv)
Both of these datasets contain the same information, but one is in long format and one in wide format. With tidyr, we are able to transform the long format into wide format (and vice versa) on our own.
1. From long to wide
Load the first file
posPsy_AHI_CESD_corrected.csvinto a tibble
posPsy_long. To make things simpler, drop all columns except
Transform the resulting table from long to wide format (spreading
ahiTotalvalues over different
2. From wide to long
- Load the second file
posPsy_data_wide.csvinto a tibble
posPsy_wideand drop all variables that contain values of individual happiness or depression items (i.e., all score variables not containing “Total” in their names).
Then transform this wide format tibble into long format. Your result table should contain all demographic information (in separate columns), the type of scale (ahiTotal vs. cesdTotal), number of
occasion (0 to 5), and the scale
value (as dependent variable).
Hint: First gather all
Total variables into a single
value variable, then separate the key column into 2 variables
7.4.5 Exercise 5
- Load the data file
falsePosPsy_all(78 x 19 variables): http://rpository.com/ds4psy/data/falsePosPsy_all.csv:
# Import the Dataset falsePosPsy_all <- read_csv("http://rpository.com/ds4psy/data/falsePosPsy_all.csv") # online # Check: # dim(falsePosPsy_all) # 78 x 19 # str(falsePosPsy)
Let’s see whether we can detect some relationship between the parents’ age values.
Plot the relationship between the age of each participant’s
mom(e.g., as a scatterplot).
Plot how many moms resp. dads are have which age (i.e., the distributions of age values among moms vs. dads) using
(Hint: As both ages are in 2 separate variables, you need to use
gather to collect the ages of both parents in 1 variable.)
- Can you think of a way of plotting the relationship between (or difference between) the age of both parents for each participant?
(Hint: Again, you need to use
gather to collect the ages of both parents into 1 variable.)
7.4.6 Exercise 6
This is a bonus task — for the ambitious or curious — which requires transforming a dataset with multiple dependent variables into tidy data. This task extends beyond the scope of the current chapter, but can be solved with our current base R and
tidyverse commands. (See Section 7.2.6 for additional information.)
The data table exp_wide contains data from \(n = 10\) participants. Each participant completed 2 tasks (and the task position
p was randomized). For each task, we measured 2 dependent variables: The correctness
c of the response, and the time
t (in msec) to complete the task:
Import the data from
ds4psy::exp_wide(or http://rpository.com/ds4psy/data/exp_wide.csv) into a tibble
Use 2 different ways to transform (or reshape)
exp_wideinto a table of tidy data
Verify the equality of both solutions.
This concludes our exercises on tidying data with tidyr.
In case you find financial data not psychological enough: Imagine that the data describe 2 daily mood measurements of 3 people owning stocks…↩