7.4 Exercises

ds4psy: Exercises 7

The following exercises help you to detect and create tidy data by practicing the essential tidyr commands.

7.4.1 Exercise 1

This exercise asks you to inspect some tables and turn some messy ones into tidy data.

Tidying messy tables

Here are the paths to data files used in this exercise:

For each of these files:

  • Describe the data (i.e., its dimensions, observations, variables, DVs and IVs).

  • Transform any non-tidy table one into a tidy one.

7.4.2 Exercise 2

Let’s enter and transform some stock-related data.31

Moving stocks from wide to long to wide

The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):

Table 7.13: Start and end prices of 3 shares on 3 days.
stock d1_start d1_end d2_start d2_end d3_start d3_end
Amada 2.5 3.6 3.5 4.2 4.4 2.8
Betix 3.3 2.9 3.0 2.1 2.3 2.5
Cevis 4.2 4.8 4.6 3.1 3.2 3.7
  1. Create a tibble st that contains this data in this (wide) format.

  2. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns.

  3. Create a (line) graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis).

  4. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day.

7.4.3 Exercise 3

In this exercise, we use tidyr to solve a problem that prevented us from creating a plot in the tibble chapter (Chapter 5).

A posPsy tibble reloaded

In Exercise 3 of Chapter 5, we created a tibble of mean depression scores (my_tbl and my_tbl_2) in 2 different ways (by entering the data directly into R with tibble() and by using dplyr to compute a summary table from the posPsy_wide data).

Table 7.14: Mean depression scores by intervention and occasion.
intervention mn_cesd_0 mn_cesd_1 mn_cesd_2 mn_cesd_3 mn_cesd_4 mn_cesd_5
1 15.1 15.3 13.6 12.0 11.2 13.5
2 16.2 14.6 11.4 12.5 13.4 14.6
3 16.1 12.3 14.8 13.9 14.9 13.0
4 12.8 9.9 9.5 9.1 7.7 10.2

See Section B.1 of Appendix B for details on the data.

When trying to create a plot that shows the trends of mean depression scores (over different occasions by intervention) we noted that it is impossible to directly plot the values of my_tbl_2. For plotting the mean depression scores with ggplot we would need these scores as 1 dependent variable, rather than as 6 different variables.

Earlier, we solved this problem by creating an alternative tibble my_tbl_3 — which expressed mean_cesd as a function of occasion and intervention (in long format) — from the raw data in posPsy_long. Given our new skills in tidyr, we now are in a position to transform my_tbl_2 (or my_tbl) into the required format of my_tbl_3. Thus, your task is:

  1. Re-create one of the original tibbles (either my_tbl or my_tbl_2) and use tidyr to transform it into the long format of my_tbl_3.

  2. Now do the reverse: Use my_tbl_3 to re-create a longer version my_tbl_4 that is equal to my_tbl_2.

7.4.4 Exercise 4

In previous chapters, we have seen 2 sets of data for the positive psychology experiment:

(See Section B.1 of Appendix B for details on the data.)

Both of these datasets contain the same information, but one is in long format and one in wide format. With tidyr, we are able to transform the long format into wide format (and vice versa) on our own.

1. From long to wide

  • Load the first file posPsy_AHI_CESD_corrected.csv into a tibble posPsy_long. To make things simpler, drop all columns except id, occasion, intervention, and ahiTotal.

  • Transform the resulting table from long to wide format (spreading ahiTotal values over different occasions).

2. From wide to long

  • Load the second file posPsy_data_wide.csv into a tibble posPsy_wide and drop all variables that contain values of individual happiness or depression items (i.e., all score variables not containing “Total” in their names).

Then transform this wide format tibble into long format. Your result table should contain all demographic information (in separate columns), the type of scale (ahiTotal vs. cesdTotal), number of occasion (0 to 5), and the scale value (as dependent variable).

Hint: First gather all Total variables into a single value variable, then separate the key column into 2 variables scale and occasion.

7.4.5 Exercise 5

This exercise relies on the main dataset for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information):

# Import the Dataset
falsePosPsy_all <- read_csv("http://rpository.com/ds4psy/data/falsePosPsy_all.csv")  # online

# Check: 
# dim(falsePosPsy_all)  # 78 x 19
# str(falsePosPsy)

Parents’ age?

Let’s see whether we can detect some relationship between the parents’ age values.

  1. Plot the relationship between the age of each participant’s dad and mom (e.g., as a scatterplot).

  2. Plot how many moms resp. dads are have which age (i.e., the distributions of age values among moms vs. dads) using geom_bar.

(Hint: As both ages are in 2 separate variables, you need to use gather to collect the ages of both parents in 1 variable.)

  1. Can you think of a way of plotting the relationship between (or difference between) the age of both parents for each participant?

(Hint: Again, you need to use gather to collect the ages of both parents into 1 variable.)

7.4.6 Exercise 6

This is a bonus task — for the ambitious or curious — which requires transforming a dataset with multiple dependent variables into tidy data. This task extends beyond the scope of the current chapter, but can be solved with our current base R and tidyverse commands. (See Section 7.2.6 for additional information.)

Data

The data table exp_wide contains data from \(n = 10\) participants. Each participant completed 2 tasks (and the task position p was randomized). For each task, we measured 2 dependent variables: The correctness c of the response, and the time t (in msec) to complete the task:

Table 7.15: Example data from an experiment containing 2 tasks (and 2 DVs).
subj p_1 p_2 c_1 c_2 t_1 t_2
1 1 2 FALSE FALSE 4873.7 9230.0
2 1 2 FALSE FALSE 3963.9 2948.8
3 2 1 FALSE TRUE 2868.4 8348.3
4 2 1 FALSE FALSE 2561.3 1290.2
5 1 2 TRUE FALSE 5762.2 9330.7
6 1 2 FALSE FALSE 4873.7 9230.0
7 2 1 FALSE TRUE 3963.9 2948.8
8 1 2 TRUE TRUE 2868.4 8348.3
9 2 1 TRUE TRUE 2561.3 1290.2
10 2 1 TRUE FALSE 5762.2 9330.7

Tasks

  1. Import the data from ds4psy::exp_wide (or http://rpository.com/ds4psy/data/exp_wide.csv) into a tibble exp_wide.

  2. Use 2 different ways to transform (or reshape) exp_wide into a table of tidy data exp_tidy.

  3. Verify the equality of both solutions.

This concludes our exercises on tidying data with tidyr.


  1. In case you find financial data not psychological enough: Imagine that the data describe 2 daily mood measurements of 3 people owning stocks…