8.4 Exercises

ds4psy: Exercises 8

The following exercises practice the essential dplyr commands on joining two tables.

8.4.1 Exercise 1

Given the definitions of data_t1 and data_t2 (also available in the ds4psy package), and its variants t1 to t4 above, evaluate and explain the result of:

all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))
all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))
all.equal(data_t1, semi_join(data_t1, data_t2))
all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))

Hint: If the base R function all.equal() complains about differences between tables that you expect to be equal, you may have to arrange their rows and columns before comparing them. Alternatively, use the dplyr function all_equal() to check the equality of data frames or tibbles.

Note

We load data_t1 and data_t2 (from the ds4psy package or the CSV files online) and arrange them to re-create t1 and t2:

# Load data:
data_t1 <- ds4psy::data_t1  # from ds4psy package
data_t2 <- ds4psy::data_t2  # from ds4psy package
# data_t1 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t1.csv")  # online
# data_t2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t2.csv")  # online

# Copy data:
t1 <- data_t1
t2 <- data_t2

# Arrange tables:
t1 <- t1 %>% arrange(name, gender)
t2 <- t2 %>% arrange(name, gender)

# Inspect data:
# t1
# t2

Alternatively, we can load t3.csv and t4.csv from the ds4psy package (or the CSV files online):

# Re-load tables t3 and t4 (from online source):
t3 <- ds4psy::t3  # from ds4psy package
t4 <- ds4psy::t4  # from ds4psy package
# t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t3.csv")  # online
# t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t4.csv")  # online

# Inspect data:
# t3
# t4

We now can evaluate the statements of this exercise and explain the results returned…

8.4.2 Exercise 2

In the previous Chapter 7 on Tidying data we encountered the tidyr::table4a and tidyr::table4b.

Join these tables into one table that contains the same information as tidyr::table1.
Use tidyr and dplyr commands to transform the resulting join into tidyr::table1.

8.4.3 Exercise 3

This exercise uses the following data files:

ds4psy::data_t3 (also available as CSV file from http://rpository.com/ds4psy/data/data_t3.csv)
ds4psy::data_t4 (also available as CSV file from http://rpository.com/ds4psy/data/data_t4.csv)

Preparation

Import both tables of data (as data_t3 and data_t4) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?

Joining tables

Join both tables by using only the name variable of data_t3 and initials variable of data_t4 as keys. Why is this difficult or impossible?

Alternative join

Use a different command to join both tables into a 20 x 6 table that contains all data in one table.

8.4.4 Exercise 4

This exercise uses the datasets on web-based interventions in positive psychology that we have used before. (See Section B.1 of Appendix B for details on the data.)

So far, we always distinguished between three main data files:

ds4psy::posPsy_p_info (also available as posPsy_participants.csv: Original participant data, 295 x 6 variables:
http://rpository.com/ds4psy/data/posPsy_participants.csv).
ds4psy::posPsy_long (also available as posPsy_AHI_CESD_corrected.csv: Corrected version of dependent measures in long format, 990 x 50 variables:
http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv).
ds4psy::posPsy_wide (also available as posPsy_data_wide.csv: Corrected version of all data joined in wide format, 295 x 294 variables:
http://rpository.com/ds4psy/data/posPsy_data_wide.csv).
Different measurement occasions are suffixed by .0, .1, …, .5.

Given your knowledge of dplyr and tidyr, we can now transform ds4psy::posPsy_long (posPsy_AHI_CESD_corrected.csv) and combine its parts with ds4psy::posPsy_p_info (posPsy_participants.csv) to obtain ds4psy::posPsy_wide (posPsy_data_wide.csv). Do this in the following steps:

Preparation

Import the data files posPsy_participants.csv as p_info, posPsy_AHI_CESD_corrected.csv as AHI_CESD, and posPsy_data_wide.csv as data_wide, and inspect the dimensions of these tables.

# Load data:
p_info    <- ds4psy::posPsy_p_info    # from ds4psy package 
AHI_CESD  <- ds4psy::posPsy_long
data_wide <- ds4psy::posPsy_wide

# p_info    <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_participants.csv")  # online 
# AHI_CESD  <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv") 
# data_wide <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_data_wide.csv") 

# Check: 
dim(p_info)     # 295 x   6 
#> [1] 295   6
dim(AHI_CESD)   # 990 x  50
#> [1] 990  50
dim(data_wide)  # 295 x 294
#> [1] 295 294

Extracting occasions

Create six individual tibbles that contains the data of each participant (one row per id) seperately for each occasion (0 to 5).

Joining tables

Combine the variables of the 6 tibbles into one wide block (in wide format) and compare your resulting table with data_wide.

Hint: As you cannot combine all six blocks in one command, they have to be combined in a stepwise fashion.

Cleaning up

As you might have noticed the number of columns of our newly created data set and the data in wide format (data_wide) are not the same. That is because:

information on each participants’ intervention group is included six times in our created data;
data_wide also includes the information on the participants (contained in p_info).

Exclude any redundant variables on the intervention group (keeping this variable only once) and include the participant information from p_info to create a new table complete_data. Check if and ensure that complete_data is equal to data_wide.

Alternative join

How would you change the steps to join the six blocks (in 3.) if you only wanted to keep the participants who took part on all six possible occasions? How many of them exist?

Filter join

Use a dplyr summary table on the data in AHI_CESD to determine the id of all participants that took part in all six occasions. Then use a filter join on p_info and on AHI_CESD to keep only the (rows of) data of these participants.

8.4.5 Exercise 5

This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.

Preparation

Download the two raw data files and load them into R as study_1 and study_2 (fixing any obvious errors).

Extracting occasions

As both data files contain different participants, but the same variables about each participant, we want to save all data in one table. Combine both datasets, but mark each dataset by a study variable (that has a value of either 1 or 2).

Cleaning up

Identify individuals and arrange the dataset: Add a numeric ID variable identifying each individual participant and sort the dataset in an intuitive way.

Splitting and joining

The following code splits the data of falsePosPsy into two parts (pt_1 and pt_2):

# Data:
# falsePosPsy <- readr::read_csv("http://rpository.com/ds4psy/data/falsePosPsy_all.csv")  # online
falsePosPsy <- ds4psy::falsePosPsy_all  # ds4psy pkg
dim(falsePosPsy)
#> [1] 78 19

# Create 2 different datasets: 
pt_1 <- falsePosPsy %>%
  select(study, ID, aged, aged365, female, dad, mom)

pt_2 <- falsePosPsy %>%
  select(ID, potato, when64, kalimba, cond, root, bird, political, 
         quarterback, olddays, feelold, computer, diner)

Re-create the full data set (from pt_1 and pt_2) in at least three different ways (and verify that each of them yields the same data as falsePosPsy).

This concludes our exercises on using dplyr commands for joining two tables.