8.4 Exercises
The following exercises practice the essential dplyr commands on joining two tables.
8.4.1 Exercise 1
Given the definitions of data_t1 and data_t2 (also available in the ds4psy package), and its variants t1
to t4
above, evaluate and explain the result of:
all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))
all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))
all.equal(data_t1, semi_join(data_t1, data_t2))
all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))
Hint: If the base R function all.equal()
complains about differences between tables that you expect to be equal, you may have to arrange their rows and columns before comparing them. Alternatively, use the dplyr function all_equal()
to check the equality of data frames or tibbles.
Note
- We load data_t1 and data_t2 (from the ds4psy package or the CSV files online)
and arrange them to re-create
t1
andt2
:
# Load data:
<- ds4psy::data_t1 # from ds4psy package
data_t1 <- ds4psy::data_t2 # from ds4psy package
data_t2 # data_t1 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t1.csv") # online
# data_t2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t2.csv") # online
# Copy data:
<- data_t1
t1 <- data_t2
t2
# Arrange tables:
<- t1 %>% arrange(name, gender)
t1 <- t2 %>% arrange(name, gender)
t2
# Inspect data:
# t1
# t2
# Re-load tables t3 and t4 (from online source):
<- ds4psy::t3 # from ds4psy package
t3 <- ds4psy::t4 # from ds4psy package
t4 # t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t3.csv") # online
# t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t4.csv") # online
# Inspect data:
# t3
# t4
We now can evaluate the statements of this exercise and explain the results returned…
8.4.2 Exercise 2
In the previous Chapter 7 on Tidying data we encountered the tidyr::table4a
and tidyr::table4b
.
Join these tables into one table that contains the same information as
tidyr::table1
.Use tidyr and dplyr commands to transform the resulting join into
tidyr::table1
.
8.4.3 Exercise 3
This exercise uses the following data files:
ds4psy::data_t3
(also available as CSV file from http://rpository.com/ds4psy/data/data_t3.csv)
ds4psy::data_t4
(also available as CSV file from http://rpository.com/ds4psy/data/data_t4.csv)
Preparation
- Import both tables of data (as
data_t3
anddata_t4
) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?
Joining tables
- Join both tables by using only the
name
variable ofdata_t3
andinitials
variable ofdata_t4
as keys. Why is this difficult or impossible?
Alternative join
- Use a different command to join both tables into a 20 x 6 table that contains all data in one table.
8.4.4 Exercise 4
This exercise uses the datasets on web-based interventions in positive psychology that we have used before. (See Section B.1 of Appendix B for details on the data.)
So far, we always distinguished between three main data files:
ds4psy::posPsy_p_info
(also available asposPsy_participants.csv
: Original participant data, 295 x 6 variables:
http://rpository.com/ds4psy/data/posPsy_participants.csv).ds4psy::posPsy_long
(also available asposPsy_AHI_CESD_corrected.csv
: Corrected version of dependent measures in long format, 990 x 50 variables:
http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv).ds4psy::posPsy_wide
(also available asposPsy_data_wide.csv
: Corrected version of all data joined in wide format, 295 x 294 variables:
http://rpository.com/ds4psy/data/posPsy_data_wide.csv).
Different measurement occasions are suffixed by .0, .1, …, .5.
Given your knowledge of dplyr and tidyr, we can now transform ds4psy::posPsy_long
(posPsy_AHI_CESD_corrected.csv
) and combine its parts with ds4psy::posPsy_p_info
(posPsy_participants.csv
) to obtain ds4psy::posPsy_wide
(posPsy_data_wide.csv
).
Do this in the following steps:
Preparation
- Import the data files
posPsy_participants.csv
asp_info
,posPsy_AHI_CESD_corrected.csv
asAHI_CESD
, andposPsy_data_wide.csv
asdata_wide
, and inspect the dimensions of these tables.
# Load data:
<- ds4psy::posPsy_p_info # from ds4psy package
p_info <- ds4psy::posPsy_long
AHI_CESD <- ds4psy::posPsy_wide
data_wide
# p_info <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_participants.csv") # online
# AHI_CESD <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv")
# data_wide <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_data_wide.csv")
# Check:
dim(p_info) # 295 x 6
#> [1] 295 6
dim(AHI_CESD) # 990 x 50
#> [1] 990 50
dim(data_wide) # 295 x 294
#> [1] 295 294
Extracting occasions
- Create six individual tibbles that contains the data of each participant (one row per
id
) seperately for eachoccasion
(0 to 5).
Joining tables
- Combine the variables of the 6 tibbles into one wide block (in wide format) and compare your resulting table with data_wide.
Hint: As you cannot combine all six blocks in one command, they have to be combined in a stepwise fashion.
Cleaning up
- As you might have noticed the number of columns of our newly created data set and the data in wide format (
data_wide
) are not the same. That is because:
information on each participants’ intervention group is included six times in our created data;
data_wide
also includes the information on the participants (contained inp_info
).
Exclude any redundant variables on the intervention
group (keeping this variable only once) and include the participant information from p_info
to create a new table complete_data
. Check if and ensure that complete_data
is equal to data_wide
.
Alternative join
- How would you change the steps to join the six blocks (in 3.) if you only wanted to keep the participants who took part on all six possible occasions? How many of them exist?
Filter join
- Use a dplyr summary table on the data in
AHI_CESD
to determine theid
of all participants that took part in all six occasions. Then use a filter join onp_info
and onAHI_CESD
to keep only the (rows of) data of these participants.
8.4.5 Exercise 5
This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.
Preparation
- Download the two raw data files and load them into R as
study_1
andstudy_2
(fixing any obvious errors).
Extracting occasions
- As both data files contain different participants, but the same variables about each participant, we want to save all data in one table.
Combine both datasets, but mark each dataset by a
study
variable (that has a value of either 1 or 2).
Cleaning up
- Identify individuals and arrange the dataset: Add a numeric
ID
variable identifying each individual participant and sort the dataset in an intuitive way.
Splitting and joining
- The following code splits the data of
falsePosPsy
into two parts (pt_1
andpt_2
):
# Data:
# falsePosPsy <- readr::read_csv("http://rpository.com/ds4psy/data/falsePosPsy_all.csv") # online
<- ds4psy::falsePosPsy_all # ds4psy pkg
falsePosPsy dim(falsePosPsy)
#> [1] 78 19
# Create 2 different datasets:
<- falsePosPsy %>%
pt_1 select(study, ID, aged, aged365, female, dad, mom)
<- falsePosPsy %>%
pt_2 select(ID, potato, when64, kalimba, cond, root, bird, political,
quarterback, olddays, feelold, computer, diner)
- Re-create the full data set (from
pt_1
andpt_2
) in at least three different ways (and verify that each of them yields the same data asfalsePosPsy
).
This concludes our exercises on using dplyr commands for joining two tables.