The following exercises practice the essential dplyr commands on joining two tables.
8.4.1 Exercise 1
all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))
all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))
all.equal(data_t1, semi_join(data_t1, data_t2))
all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))
Hint: If the base R function
all.equal() complains about differences between tables that you expect to be equal, you may have to arrange their rows and columns before comparing them. Alternatively, use the dplyr function
all_equal() to check the equality of data frames or tibbles.
- We load data_t1 and data_t2 (from the ds4psy package or the CSV files online)
and arrange them to re-create
# Load data: <- ds4psy::data_t1 # from ds4psy package data_t1 <- ds4psy::data_t2 # from ds4psy package data_t2 # data_t1 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t1.csv") # online # data_t2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t2.csv") # online # Copy data: <- data_t1 t1 <- data_t2 t2 # Arrange tables: <- t1 %>% arrange(name, gender) t1 <- t2 %>% arrange(name, gender) t2 # Inspect data: # t1 # t2
# Re-load tables t3 and t4 (from online source): <- ds4psy::t3 # from ds4psy package t3 <- ds4psy::t4 # from ds4psy package t4 # t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t3.csv") # online # t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t4.csv") # online # Inspect data: # t3 # t4
We now can evaluate the statements of this exercise and explain the results returned…
8.4.2 Exercise 2
Join these tables into one table that contains the same information as
Use tidyr and dplyr commands to transform the resulting join into
8.4.3 Exercise 3
This exercise uses the following data files:
ds4psy::data_t3(also available as CSV file from http://rpository.com/ds4psy/data/data_t3.csv)
ds4psy::data_t4(also available as CSV file from http://rpository.com/ds4psy/data/data_t4.csv)
- Import both tables of data (as
data_t4) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?
- Join both tables by using only the
data_t4as keys. Why is this difficult or impossible?
- Use a different command to join both tables into a 20 x 6 table that contains all data in one table.
8.4.4 Exercise 4
So far, we always distinguished between three main data files:
ds4psy::posPsy_p_info(also available as
posPsy_participants.csv: Original participant data, 295 x 6 variables:
ds4psy::posPsy_long(also available as
posPsy_AHI_CESD_corrected.csv: Corrected version of dependent measures in long format, 990 x 50 variables:
ds4psy::posPsy_wide(also available as
posPsy_data_wide.csv: Corrected version of all data joined in wide format, 295 x 294 variables:
Different measurement occasions are suffixed by .0, .1, …, .5.
Given your knowledge of dplyr and tidyr, we can now transform
posPsy_AHI_CESD_corrected.csv) and combine its parts with
posPsy_participants.csv) to obtain
Do this in the following steps:
- Import the data files
data_wide, and inspect the dimensions of these tables.
# Load data: <- ds4psy::posPsy_p_info # from ds4psy package p_info <- ds4psy::posPsy_long AHI_CESD <- ds4psy::posPsy_wide data_wide # p_info <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_participants.csv") # online # AHI_CESD <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv") # data_wide <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_data_wide.csv") # Check: dim(p_info) # 295 x 6 #>  295 6 dim(AHI_CESD) # 990 x 50 #>  990 50 dim(data_wide) # 295 x 294 #>  295 294
- Create six individual tibbles that contains the data of each participant (one row per
id) seperately for each
occasion(0 to 5).
- Combine the variables of the 6 tibbles into one wide block (in wide format) and compare your resulting table with data_wide.
Hint: As you cannot combine all six blocks in one command, they have to be combined in a stepwise fashion.
- As you might have noticed the number of columns of our newly created data set and the data in wide format (
data_wide) are not the same. That is because:
information on each participants’ intervention group is included six times in our created data;
data_widealso includes the information on the participants (contained in
Exclude any redundant variables on the
intervention group (keeping this variable only once) and include the participant information from
p_info to create a new table
complete_data. Check if and ensure that
complete_data is equal to
- How would you change the steps to join the six blocks (in 3.) if you only wanted to keep the participants who took part on all six possible occasions? How many of them exist?
- Use a dplyr summary table on the data in
AHI_CESDto determine the
idof all participants that took part in all six occasions. Then use a filter join on
AHI_CESDto keep only the (rows of) data of these participants.
8.4.5 Exercise 5
This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.
- Download the two raw data files and load them into R as
study_2(fixing any obvious errors).
- As both data files contain different participants, but the same variables about each participant, we want to save all data in one table.
Combine both datasets, but mark each dataset by a
studyvariable (that has a value of either 1 or 2).
- Identify individuals and arrange the dataset: Add a numeric
IDvariable identifying each individual participant and sort the dataset in an intuitive way.
Splitting and joining
- The following code splits the data of
falsePosPsyinto two parts (
# Data: # falsePosPsy <- readr::read_csv("http://rpository.com/ds4psy/data/falsePosPsy_all.csv") # online <- ds4psy::falsePosPsy_all # ds4psy pkg falsePosPsy dim(falsePosPsy) #>  78 19 # Create 2 different datasets: <- falsePosPsy %>% pt_1 select(study, ID, aged, aged365, female, dad, mom) <- falsePosPsy %>% pt_2 select(ID, potato, when64, kalimba, cond, root, bird, political, quarterback, olddays, feelold, computer, diner)
- Re-create the full data set (from
pt_2) in at least three different ways (and verify that each of them yields the same data as
This concludes our exercises on using dplyr commands for joining two tables.