A.8 Solutions (08)
Here are the solutions to the exercises on using essential dplyr commands for joining two tables (from Section 8.4 of Chapter 8).
A.8.1 Exercise 1
Given the definitions of data_t1 and data_t2,
and t1
to t4
above, evaluate and explain the result of:
all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))
all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))
all.equal(data_t1, semi_join(data_t1, data_t2))
all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))
Hint: If the base R function all.equal()
complains about differences between tables that you expect to be equal, you may have to arrange their rows and columns before comparing them. Alternatively, use the dplyr function all_equal()
to check the equality of data frames or tibbles.
Note
To do this exercise, we first re-load and re-create various data tables from the chapter (see Section 8.2.1 above):
- We load data_t1 and data_t2 (from the ds4psy package or the CSV files online) and arrange them to re-create
t1
andt2
:
# Load data:
<- ds4psy::data_t1 # from ds4psy package
data_t1 <- ds4psy::data_t2 # from ds4psy package
data_t2 # data_t1 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t1.csv") # online
# data_t2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t2.csv") # online
# Copy data:
<- data_t1
t1 <- data_t2
t2
# Arrange tables:
<- t1 %>% arrange(name, gender)
t1 <- t2 %>% arrange(name, gender)
t2
# Inspect data:
# t1
# t2
- To obtain
t3
andt4
, we can either re-create them (with the code used above):
# Re-create t3 and t4: ------
<- data_t1 # copy
t3 <- data_t2 # copy
t4
# Modify both tables: ------
# (a) Draw n random rows (samples):
set.seed(12) # for replicability
<- 8
n <- sample(1:nrow(t3), size = n, replace = FALSE) # n random values from 1 to nrow(t3)
s1 <- sample(1:nrow(t4), size = n, replace = FALSE) # n random values from 1 to nrow(t4)
s2
# Reduce rows and arrange both tables:
<- t3[s1, ]
t3 <- t4[s2, ]
t4
# (b) Duplicate m random rows:
set.seed(23) # for replicability
<- 2
m <- sample(1:nrow(t3), size = m, replace = FALSE) # n random values from 1 to nrow(t3)
d1 <- sample(1:nrow(t4), size = m, replace = FALSE) # n random values from 1 to nrow(t4)
d2
# Duplicate sampled rows:
<- rbind(t3, t3[d1, ])
t3 <- rbind(t4, t4[d2, ])
t4
# Arrange tables:
<- t3 %>% arrange(name, gender)
t3 <- t4 %>% arrange(name, gender) t4
- Alternatively, we can load two CSV files (i.e., t3.csv and t4.csv) from an online server or simply get the files from the ds4psy package:
# Re-load tables t3 and t4 (from online source):
# t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t3.csv") # online
# t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t4.csv") # online
<- ds4psy::t3 # from ds4psy package
t3 <- ds4psy::t4 # from ds4psy package t4
- As a summary, here are the current definitions of
data_t1
anddata_t2
, and their four variantst1
,t2
,t3
andt4
:
# Show current data tables:
# data_t1 and data_t2:
data_t1
data_t2
# Sorted versions t1 and t2:
t1
t2
# Variants t3 and t4:
t3 t4
Solutions
We now can evaluate the statements of this exercise and explain the results returned:
# 1. left_join and right_join:
all_equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1)) # => TRUE,
#> [1] TRUE
# as left_join(x, y) is the same as right_join(y, x) irrespective of x and y.
# Note: In case you get strange inequalities about mismatches in
# "Attributes:", you may have to sort the rows and columns
# of both tibbles in the same way prior to comparing them.
<- left_join(data_t1, data_t2) %>%
j1 arrange(name, gender) %>%
select(name, gender, bnt_1, bnt_2, like_1, like_2)
<- right_join(data_t2, data_t1) %>%
j2 arrange(name, gender) %>%
select(name, gender, bnt_1, bnt_2, like_1, like_2)
all.equal(j1, j2) # should be TRUE.
#> [1] TRUE
# 2. full_join and inner_join:
all_equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1)) # => TRUE,
#> [1] TRUE
# as data_t1 and data_t2 contain the same cases (despite different orders).
# Proof: See comparison of arranged tables (from above):
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE
# 3. semi_join:
all_equal(data_t1, semi_join(data_t1, data_t2)) # => TRUE,
#> [1] TRUE
# as data_t1 and data_t2 contain the same cases (despite different orders).
# Hence, semi_join does not drop any cases of data_t1 and remains the same as data_t1.
# 4. semi_join and anti_join:
all_equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3))) # => TRUE,
#> [1] TRUE
# Explanation:
# - semi_join(t1, t3) # contains 8 cases of t1 with a match in t3
# - anti_join(t1, t3) # contains 12 cases of t1 with NO match in t3
# - full_join() [or rbind()] combine both parts (rows) into the cases of t1 again.
A.8.2 Exercise 2
In the previous Chapter 7 on Tidying data we encountered the tidyr::table4a
and tidyr::table4b
.
- Join these tables into one table that contains the same information as
tidyr::table1
.
Solution
# Data:
<- tidyr::table4a
t4a <- tidyr::table4b
t4b
# Inspection:
t4a#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
t4b#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
# Analysis: Both 3 x 3 tables contain the same 3 instances of "country".
# When we join them by "country", all 4 mutating joins will yield the same result.
# However, we must make sure that only "country" is used as key,
# as the other 2 matching variable names (`1999` and `2000`) denote different variables
# in different tables.
# 1. Alternative solutions for joining both tables: ------
<- left_join(t4a, t4b, by = "country", suffix = c("_cases", "_population"))
j1 <- right_join(t4a, t4b, by = "country", suffix = c("_cases", "_population"))
j2 <- full_join(t4a, t4b, by = "country", suffix = c("_cases", "_population"))
j3 <- inner_join(t4a, t4b, by = "country", suffix = c("_cases", "_population"))
j4
# Show equality:
all_equal(j1, j2)
#> [1] TRUE
all_equal(j1, j3)
#> [1] TRUE
all_equal(j1, j4)
#> [1] TRUE
# Print result:
::kable(j1, caption = "Joined table of table4a and table4b.") knitr
country | 1999_cases | 2000_cases | 1999_population | 2000_population |
---|---|---|---|---|
Afghanistan | 745 | 2666 | 19987071 | 20595360 |
Brazil | 37737 | 80488 | 172006362 | 174504898 |
China | 212258 | 213766 | 1272915272 | 1280428583 |
- Use tidyr and dplyr commands to transform the resulting join into
tidyr::table1
.
Solution
# 2. Tidying/transforming result table: ------
<- j1 %>%
j1_trans ::gather(key = "key", value = "count", `1999_cases`:`2000_population`) %>%
tidyr::separate(col = key, into = c("year", "type")) %>%
tidyr::arrange(country, year) %>%
dplyr::spread(key = type, value = count)
tidyr
j1_trans#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <chr> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Change variable type of year:
$year <- as.integer(j1_trans$year) # from character to integer
j1_trans
# Show equality:
all_equal(tidyr::table1, j1_trans) # TRUE
#> [1] TRUE
A.8.3 Exercise 3
This exercise uses the following data files:
ds4psy::data_t3
(also available as CSV file from http://rpository.com/ds4psy/data/data_t3.csv)
ds4psy::data_t4
(also available as CSV file from http://rpository.com/ds4psy/data/data_t4.csv)
Preparation
- Import both tables of data (as
data_t3
anddata_t4
) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?
Solution
# Load data:
<- ds4psy::data_t3 # from ds4psy package
data_t3 <- ds4psy::data_t4 # from ds4psy package
data_t4 # data_t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t3.csv") # online
# data_t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t4.csv") # online
# Inspect tables:
# 20 x 4
data_t3 #> # A tibble: 20 × 4
#> name gender like_1 bnt_1
#> <chr> <chr> <dbl> <dbl>
#> 1 R.S. female 2 NA
#> 2 M.Y. male 2 4
#> 3 K.R. male 4 1
#> 4 A.V. female 3 2
#> 5 Z.Y. female NA NA
#> 6 X.D. female 4 4
#> 7 A.F. female 5 1
#> 8 X.Y. female 5 4
#> 9 A.V. male 3 3
#> 10 M.O. male 4 1
#> 11 T.V. male 6 4
#> 12 X.P. female 4 1
#> 13 Z.D. male 5 4
#> 14 T.D. male 5 2
#> 15 C.N. female 4 3
#> 16 H.V. female 1 4
#> 17 Q.N. female 6 1
#> 18 Y.G. male 4 4
#> 19 L.S. male 4 4
#> 20 L.V. female 4 2
# 20 x 4
data_t4 #> # A tibble: 20 × 4
#> initials sex like_2 bnt_2
#> <chr> <chr> <dbl> <dbl>
#> 1 X.D. female 5 1
#> 2 T.D. male 3 NA
#> 3 Z.D. male 7 4
#> 4 L.S. male 7 4
#> 5 H.V. female 5 2
#> 6 Q.N. female 6 2
#> 7 X.P. female 4 1
#> 8 K.R. male 6 1
#> 9 R.S. female 7 3
#> 10 C.N. female 2 2
#> 11 Z.Y. female 7 1
#> 12 T.V. male 4 4
#> 13 Y.G. male 4 4
#> 14 M.O. male NA NA
#> 15 M.Y. male 6 2
#> 16 A.F. female 4 1
#> 17 A.V. male 5 4
#> 18 L.V. female 2 4
#> 19 X.Y. female 5 2
#> 20 A.V. female 5 3
# Both tables seem similar, but do not share variable names.
# Are names and initials identical?
# Arrange both tables by first 2 variables:
<- data_t3 %>% arrange(name, gender)
data_t3 <- data_t4 %>% arrange(initials, sex)
data_t4
all.equal(data_t3$name, data_t4$initials) # TRUE
#> [1] TRUE
all.equal(data_t3$gender, data_t4$sex) # TRUE
#> [1] TRUE
# => Both tables seem to contain the same people as cases!
Joining tables
- Join both tables by using only the
name
variable ofdata_t3
andinitials
variable ofdata_t4
as keys. Why is this difficult or impossible?
Solution
# Copy data: Make sure that cases are arranged in the same way:
<- data_t3 %>% arrange(name, gender)
t3 <- data_t4 %>% arrange(initials, sex)
t4
# Note 1:
# left_join(t3, t4) # would yield an error:
# `by` required, because the data sources have no common variables
# Let's specify both primary and foreign key as:
# by = c("name" = "initials")
# Note 2: Both of the following
left_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 × 7
#> name gender like_1 bnt_1 sex like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 A.F. female 5 1 female 4 1
#> 2 A.V. female 3 2 female 5 3
#> 3 A.V. female 3 2 male 5 4
#> 4 A.V. male 3 3 female 5 3
#> 5 A.V. male 3 3 male 5 4
#> 6 C.N. female 4 3 female 2 2
#> 7 H.V. female 1 4 female 5 2
#> 8 K.R. male 4 1 male 6 1
#> 9 L.S. male 4 4 male 7 4
#> 10 L.V. female 4 2 female 2 4
#> # … with 12 more rows
full_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 × 7
#> name gender like_1 bnt_1 sex like_2 bnt_2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 A.F. female 5 1 female 4 1
#> 2 A.V. female 3 2 female 5 3
#> 3 A.V. female 3 2 male 5 4
#> 4 A.V. male 3 3 female 5 3
#> 5 A.V. male 3 3 male 5 4
#> 6 C.N. female 4 3 female 2 2
#> 7 H.V. female 1 4 female 5 2
#> 8 K.R. male 4 1 male 6 1
#> 9 L.S. male 4 4 male 7 4
#> 10 L.V. female 4 2 female 2 4
#> # … with 12 more rows
# yield 22 x 7 tables.
# Problem: The name/initials "A.V." appears TWICE data_t3 and data_t4:
length(unique(data_t3$name)) # => only 19 unique names
#> [1] 19
$name[duplicated(data_t3$name)] # => "A.V." appears twice.
data_t3#> [1] "A.V."
# ==> The combination lists 2x2 instances of "A.V.".
Alternative join
- Use a different command to join both tables into a 20 x 6 table that contains all data in one table.
Solution
The following solution first re-names two variables in one table (to have two matching key variables):
# Copy data: Make sure that cases are arranged in the same way:
<- data_t3 %>% arrange(name, gender)
t3 <- data_t4 %>% arrange(initials, sex)
t4
# Solution 1:
# (a) Rename both key variables in one table:
names(t4)[1:2] <- names(t3)[c(1, 2)]
# (b) Join both tables using both key variables:
<- left_join(t3, t4, by = c("name", "gender")) # => 20 x 6
s1 <- right_join(t3, t4, by = c("name", "gender")) # => 20 x 6
s2 <- full_join(t3, t4, by = c("name", "gender")) # => 20 x 6
s3 <- inner_join(t3, t4, by = c("name", "gender")) # => 20 x 6
s4
# Show equality of solutions:
all_equal(s1, s2)
#> [1] TRUE
all_equal(s1, s3)
#> [1] TRUE
all_equal(s1, s4)
#> [1] TRUE
## Print solution:
# s1
An alternative solution pre-supposes that both tables contain the same cases in the same order and introduces a temporary unique key variable (e.g., the row number) as key to link both tables:
# Copy data: Make sure that cases are arranged in the same way:
<- data_t3 %>% arrange(name, gender)
t3 <- data_t4 %>% arrange(initials, sex)
t4
# Solution 2: When we know that
# - both tables contain the same cases AND
# - both tables have all cases arranged in the same way:
all.equal(t3$name, t4$initials) # TRUE
#> [1] TRUE
# Add a new variable (with unique values) to both tables:
$nr <- 1:nrow(t3) # add unique row number to each row
t3$nr <- 1:nrow(t4) # add unique row number to each row
t4
# Use this new and unambiguous variable as key:
<- left_join(t3, t4, by = c("nr")) %>% # use nr as key
s5 select(name:bnt_1, like_2, bnt_2) # and drop some variables => 20 x 6
<- full_join(t3, t4, by = c("nr")) %>% # use nr as key
s6 select(name:bnt_1, like_2, bnt_2) # and drop some variables => 20 x 6
# Show equality of solutions:
all_equal(s1, s5)
#> [1] TRUE
all_equal(s1, s6)
#> [1] TRUE
A.8.4 Exercise 4
This exercise uses the datasets on web-based interventions in positive psychology that we have used before. (See Section B.1 of Appendix B for details on the data.)
So far, we always distinguished between three main data files:
ds4psy::posPsy_p_info
(also available asposPsy_participants.csv
: Original participant data, 295 x 6 variables:
http://rpository.com/ds4psy/data/posPsy_participants.csv).ds4psy::posPsy_long
(also available asposPsy_AHI_CESD_corrected.csv
: Corrected version of dependent measures in long format, 990 x 50 variables:
http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv).ds4psy::posPsy_wide
(also available asposPsy_data_wide.csv
: Corrected version of all data joined in wide format, 295 x 294 variables:
http://rpository.com/ds4psy/data/posPsy_data_wide.csv).
Different measurement occasions are suffixed by .0, .1, …, .5.
Given your knowledge of dplyr and tidyr, we can now transform ds4psy::posPsy_long
(posPsy_AHI_CESD_corrected.csv
) and combine its parts with ds4psy::posPsy_p_info
(posPsy_participants.csv
) to obtain ds4psy::posPsy_wide
(posPsy_data_wide.csv
).
Do this in the following steps:
Preparation
- Import the data files
posPsy_participants.csv
asp_info
,posPsy_AHI_CESD_corrected.csv
asAHI_CESD
, andposPsy_data_wide.csv
asdata_wide
, and inspect the dimensions of these tables.
# Data:
<- ds4psy::posPsy_p_info # from ds4psy package
p_info <- ds4psy::posPsy_long
AHI_CESD <- ds4psy::posPsy_wide
data_wide
# p_info <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_participants.csv") # online
# AHI_CESD <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv")
# data_wide <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_data_wide.csv")
# Check:
dim(p_info) # 295 x 6
#> [1] 295 6
dim(AHI_CESD) # 990 x 50
#> [1] 990 50
dim(data_wide) # 295 x 294
#> [1] 295 294
Extracting occasions
- Create six individual tibbles that contains the data of each participant (one row per
id
) seperately for eachoccasion
(0 to 5).
Solution
# Create 6 blocks
# (for occasion 0 to 5):
<- AHI_CESD %>%
occ_0 group_by(id, occasion) %>%
filter(occasion == 0)
dim(occ_0) # 295 x 50
#> [1] 295 50
<- AHI_CESD %>%
occ_1 group_by(id, occasion) %>%
filter(occasion == 1)
dim(occ_1) # 148 x 50
#> [1] 148 50
<- AHI_CESD %>%
occ_2 group_by(id, occasion) %>%
filter(occasion == 2)
dim(occ_2) # 156 x 50
#> [1] 156 50
<- AHI_CESD %>%
occ_3 group_by(id, occasion) %>%
filter(occasion == 3)
dim(occ_3) # 138 x 50
#> [1] 138 50
<- AHI_CESD %>%
occ_4 group_by(id, occasion) %>%
filter(occasion == 4)
dim(occ_4) # 133 x 50
#> [1] 133 50
<- AHI_CESD %>%
occ_5 group_by(id, occasion) %>%
filter(occasion == 5)
dim(occ_5) # 120 x 50
#> [1] 120 50
# Do all blocks contain the same variable names?
all.equal(names(occ_0), names(occ_5)) # TRUE
#> [1] TRUE
Note that only the first block (occ_0
) contains the data of all 295 participants.
The other five blocks of data (occ_1
to occ_5
) contain the same 50 variables, but were each reduced by drop-out cases.
Joining tables
- Combine the variables of the six tibbles into one wide block (in wide format) and compare your resulting table with data_wide.
Hint: As you cannot combine all six blocks in one command, they have to be combined in a stepwise fashion.
Solution
# Join parts:
<- full_join(occ_0, occ_1, by = "id", suffix = c(".0", ".1")) # add occ_1 to occ_0
occ_0_1 <- full_join(occ_0_1, occ_2, by = "id", suffix = c(".1", ".2")) # add occ_2 to this
occ_0_2 <- full_join(occ_0_2, occ_3, by = "id", suffix = c(".2", ".3")) # add occ_3 to this
occ_0_3 <- full_join(occ_0_3, occ_4, by = "id", suffix = c(".3", ".4")) # add occ_4 to this
occ_0_4 <- full_join(occ_0_4, occ_5, by = "id", suffix = c(".4", ".5")) # add occ_5 to this
occ_0_5
<- occ_0_5 # copy
occ_all
## Check:
dim(occ_all) # 295 x 295
#> [1] 295 295
# names(occ_all)
Cleaning up
- As you might have noticed the number of columns of our newly created data set and the data in wide format (
data_wide
) are not the same. That is because:
information on each participants’ intervention group is included six times in our created data;
data_wide
also includes the information on the participants (contained inp_info
).
Exclude any redundant variables on the intervention
group (keeping this variable only once) and include the participant information from p_info
to create a new table complete_data
. Check if and ensure that complete_data
is equal to data_wide
.
Solution
# Delete redundant variables on intervention groups:
<- occ_all %>%
occ_all_clean select(-intervention.1,
-intervention.2,
-intervention.3,
-intervention.4,
-intervention.5) %>%
rename(intervention = intervention.0)
# Check:
dim(occ_all_clean) # 295 x 290
#> [1] 295 290
# Join with participant information:
dim(p_info) # 295 x 6
#> [1] 295 6
<- full_join(p_info, occ_all_clean) # same as:
complete_data <- full_join(p_info, occ_all_clean, by = c("id", "intervention"))
complete_data
## Check:
# complete_data
dim(complete_data) # 295 x 294
#> [1] 295 294
## Compare with data_wide:
dim(data_wide) # 295 x 294
#> [1] 295 294
## In case of any differences:
## Which variables occur in complete_data but not in data_wide?
# names(complete_data)[!(names(complete_data) %in% names(data_wide))] # => intervention.x, intervention.y
#
## Are these variables all equal?
# all.equal(complete_data$intervention.x, complete_data$intervention.y) # TRUE
#
## Delete intervention.y and rename intervention.x:
# my_data_wide <- complete_data %>%
# select(-intervention.y) %>%
# rename(intervention = intervention.x)
## Check:
all.equal(complete_data, data_wide) # TRUE (qed).
#> [1] "Attributes: < Component \"spec\": Component \"cols\": Length mismatch: comparison on first 6 components >"
Alternative join
- How would you change the steps to join the six blocks (in 3.) if you only wanted to keep the participants who took part on all six possible occasions? How many of them exist?
Solution
# We would use multiple inner_join() commands instead of full_join():
# Join parts:
<- inner_join(occ_0, occ_1, by = "id", suffix = c(".0", ".1")) # add occ_1 to occ_0
in_occ_0_1 <- inner_join(in_occ_0_1, occ_2, by = "id", suffix = c(".1", ".2")) # add occ_2 to this
in_occ_0_2 <- inner_join(in_occ_0_2, occ_3, by = "id", suffix = c(".2", ".3")) # add occ_3 to this
in_occ_0_3 <- inner_join(in_occ_0_3, occ_4, by = "id", suffix = c(".3", ".4")) # add occ_4 to this
in_occ_0_4 <- inner_join(in_occ_0_4, occ_5, by = "id", suffix = c(".4", ".5")) # add occ_5 to this
in_occ_0_5
dim(in_occ_0_5) # 72 participants x 295 variables
#> [1] 72 295
sum(is.na(in_occ_0_5)) # 0 missing values
#> [1] 0
Filter join
- Use a dplyr summary table on the data in
AHI_CESD
to determine theid
of all participants that took part in all six occasions. Then use a filter join onp_info
and onAHI_CESD
to keep only the (rows of) data of these participants.
Solution
# Summary table of id values with 6 occasions:
<- AHI_CESD %>%
id_6_occ group_by(id) %>%
summarise(n = n()) %>%
filter(n == 6)
id_6_occ#> # A tibble: 72 × 2
#> id n
#> <dbl> <int>
#> 1 2 6
#> 2 11 6
#> 3 12 6
#> 4 14 6
#> 5 24 6
#> 6 27 6
#> 7 28 6
#> 8 37 6
#> 9 38 6
#> 10 40 6
#> # … with 62 more rows
# Use id_6_occ$id to filter cases of p_info and AHI_CESD:
semi_join(p_info, id_6_occ, by = "id") # 72 x 6
#> # A tibble: 72 × 6
#> id intervention sex age educ income
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 1 59 1 1
#> 2 11 2 1 56 5 1
#> 3 12 2 1 46 4 3
#> 4 14 4 1 41 1 2
#> 5 24 3 1 51 4 3
#> 6 27 4 1 34 4 3
#> 7 28 1 1 62 4 1
#> 8 37 2 1 47 5 2
#> 9 38 3 1 50 4 3
#> 10 40 2 1 53 4 3
#> # … with 62 more rows
semi_join(AHI_CESD, id_6_occ, by = "id") # 432 x 50
#> # A tibble: 432 × 50
#> id occasion elapsed.days intervention ahi01 ahi02 ahi03 ahi04 ahi05 ahi06
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 0 0 1 3 4 3 4 2 3
#> 2 2 1 8.02 1 3 4 4 4 3 3
#> 3 2 2 14.3 1 3 4 4 4 3 3
#> 4 2 3 32.0 1 3 4 4 4 4 4
#> 5 2 4 92.2 1 3 3 2 3 3 3
#> 6 2 5 182. 1 3 3 3 4 2 3
#> 7 11 0 0 2 2 2 3 4 2 3
#> 8 11 1 8.90 2 2 2 3 4 2 3
#> 9 11 2 14.8 2 3 2 3 3 2 2
#> 10 11 3 34.0 2 2 2 3 4 2 2
#> # … with 422 more rows, and 40 more variables: ahi07 <dbl>, ahi08 <dbl>,
#> # ahi09 <dbl>, ahi10 <dbl>, ahi11 <dbl>, ahi12 <dbl>, ahi13 <dbl>,
#> # ahi14 <dbl>, ahi15 <dbl>, ahi16 <dbl>, ahi17 <dbl>, ahi18 <dbl>,
#> # ahi19 <dbl>, ahi20 <dbl>, ahi21 <dbl>, ahi22 <dbl>, ahi23 <dbl>,
#> # ahi24 <dbl>, cesd01 <dbl>, cesd02 <dbl>, cesd03 <dbl>, cesd04 <dbl>,
#> # cesd05 <dbl>, cesd06 <dbl>, cesd07 <dbl>, cesd08 <dbl>, cesd09 <dbl>,
#> # cesd10 <dbl>, cesd11 <dbl>, cesd12 <dbl>, cesd13 <dbl>, cesd14 <dbl>, …
A.8.5 Exercise 5
This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.
Preparation
- Download the two raw data files and load them into R as
study_1
andstudy_2
(fixing any obvious errors).
Solution
# Downloaded the archive `FalsePositive_Data_in_.txt_format.zip` and unzipped its contents
# into a folder "/data/_falsePosPsy/raw_data/Study 1 .txt".
# Specify paths to 2 txt files:
# (a) absolute paths:
# wd <- "/Users/hneth/Desktop/stuff/Dropbox/_code/R/_teachR/ds4psy"
# raw_1 <- paste0(wd, "/data/_falsePosPsy/raw_data/Study 1 .txt") # Note weird spacing in file name!
# raw_2 <- paste0(wd, "/data/_falsePosPsy/raw_data/Study 2.txt")
# (b) using here:
# library(here)
<- here::here("data", "_falsePosPsy", "raw_data", "Study 1 .txt") # Note weird spacing in file name!
raw_1 <- here::here("data", "_falsePosPsy", "raw_data", "Study 2.txt")
raw_2
# Load data:
<- readr::read_tsv(raw_1)
study_1 <- readr::read_tsv(raw_2)
study_2
# Inspect data:
dim(study_1) # 44 x 17
#> [1] 44 17
dim(study_2) # 34 x 17
#> [1] 34 17
## Are all variable names equal?
names(study_1) == names(study_2) # => mostly TRUE, but 1 FALSE value
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
#> [13] TRUE TRUE TRUE TRUE TRUE
names(study_1)[names(study_1) != names(study_2)] # => "kalimba"
#> [1] "kalimba"
names(study_2)[names(study_1) != names(study_2)] # => "kaimba" !!!
#> [1] "kaimba"
# Correct name in study_2 to that of study_1:
names(study_2)[names(study_1) != names(study_2)] <- names(study_1)[names(study_1) != names(study_2)]
all.equal(names(study_1), names(study_2)) # TRUE
#> [1] TRUE
Extracting occasions
- As both data files contain different participants, but the same variables about each participant, we want to save all data in one table.
Combine both datasets, but mark each dataset by a
study
variable (that has a value of either 1 or 2).
Solution
# Mark each data set by a variable:
$study <- 1
study_1$study <- 2
study_2
dim(study_1) # 44 x 18
#> [1] 44 18
dim(study_2) # 34 x 18
#> [1] 34 18
# (1) Use bind_rows():
<- bind_rows(study_1, study_2)
fpositive_all dim(fpositive_all) # 78 x 18 variables
#> [1] 78 18
names(fpositive_all)
#> [1] "aged" "dad" "mom" "female" "root"
#> [6] "bird" "political" "quarterback" "olddays" "potato"
#> [11] "when64" "kalimba" "feelold" "computer" "diner"
#> [16] "cond" "aged365" "study"
# (2) Use full_join() on all variables:
<- full_join(study_1, study_2)
fpositive_all_2 dim(fpositive_all_2) # 78 x 18 variables
#> [1] 78 18
names(fpositive_all)
#> [1] "aged" "dad" "mom" "female" "root"
#> [6] "bird" "political" "quarterback" "olddays" "potato"
#> [11] "when64" "kalimba" "feelold" "computer" "diner"
#> [16] "cond" "aged365" "study"
# Are both solutions equal?
all.equal(fpositive_all, fpositive_all_2) # TRUE
#> [1] TRUE
Cleaning up
- Identify individuals and arrange the dataset: Add a numeric
ID
variable identifying each individual participant and sort the dataset in an intuitive way.
Solution
## Data:
dim(fpositive_all) # 78 x 18
#> [1] 78 18
sum(is.na(fpositive_all)) # No missing values.
#> [1] 0
# Add a factor variable ID to the dataset,
# which runs from 1 to the total number of participants in the dataset:
<- nrow(fpositive_all)
n_rows
<- fpositive_all %>%
fpositive_all mutate(ID = 1:n_rows)
# Order columns for having ID and participant info first, followd by condition and the other items:
<- fpositive_all %>%
fpositive_all select(study, ID, aged, aged365, female, dad, mom, cond, potato, when64, kalimba, everything()) %>%
arrange(study, ID)
# Check re-ordered tibble:
head(fpositive_all)
#> # A tibble: 6 × 19
#> study ID aged aged365 female dad mom cond potato when64 kalimba root
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 6765 18.5 0 49 45 cont… 0 0 1 1
#> 2 1 2 7715 21.1 1 63 62 64 0 1 0 1
#> 3 1 3 7630 20.9 0 61 59 64 0 1 0 1
#> 4 1 4 7543 20.7 0 54 51 cont… 0 0 1 1
#> 5 1 5 7849 21.5 0 47 43 64 0 1 0 1
#> 6 1 6 7581 20.8 1 49 50 64 0 1 0 1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> # olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>
tail(fpositive_all)
#> # A tibble: 6 × 19
#> study ID aged aged365 female dad mom cond potato when64 kalimba root
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2 73 7559 20.7 0 52 50 cont… 0 0 1 1
#> 2 2 74 7864 21.5 1 47 65 64 0 1 0 1
#> 3 2 75 7214 19.8 0 55 50 64 0 1 0 1
#> 4 2 76 7173 19.7 1 56 55 64 0 1 0 1
#> 5 2 77 6984 19.1 1 52 43 pota… 1 0 0 1
#> 6 2 78 7082 19.4 0 63 51 pota… 1 0 0 1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> # olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>
Splitting and joining
- The following code splits the data of
falsePosPsy
into two parts (pt_1
andpt_2
):
# Data:
# falsePosPsy <- readr::read_csv(file = "http://rpository.com/ds4psy/data/falsePosPsy_all.csv") # online
<- ds4psy::falsePosPsy_all # from ds4psy package
falsePosPsy
falsePosPsy#> # A tibble: 78 × 19
#> study ID aged aged365 female dad mom potato when64 kalimba cond
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 6765 18.5 0 49 45 0 0 1 control
#> 2 1 2 7715 21.1 1 63 62 0 1 0 64
#> 3 1 3 7630 20.9 0 61 59 0 1 0 64
#> 4 1 4 7543 20.7 0 54 51 0 0 1 control
#> 5 1 5 7849 21.5 0 47 43 0 1 0 64
#> 6 1 6 7581 20.8 1 49 50 0 1 0 64
#> 7 1 7 7534 20.6 1 56 55 0 0 1 control
#> 8 1 8 6678 18.3 1 45 45 0 1 0 64
#> 9 1 9 6970 19.1 0 53 51 1 0 0 potato
#> 10 1 10 7681 21.0 0 53 51 0 1 0 64
#> # … with 68 more rows, and 8 more variables: root <dbl>, bird <dbl>,
#> # political <dbl>, quarterback <dbl>, olddays <dbl>, feelold <dbl>,
#> # computer <dbl>, diner <dbl>
dim(falsePosPsy)
#> [1] 78 19
# Create 2 different datasets:
<- falsePosPsy %>%
pt_1 select(study, ID, aged, aged365, female, dad, mom)
<- falsePosPsy %>%
pt_2 select(ID, potato, when64, kalimba, cond, root, bird, political, quarterback, olddays, feelold, computer, diner)
- Re-create the full data set (from
pt_1
andpt_2
) in at least three different ways (and verify that each of them yields the same data asfalsePosPsy
).
Solution
# Note that
all.equal(pt_1$ID, pt_2$ID) # TRUE
#> [1] TRUE
# => pt_1 and pt_2 contain the same cases (observations).
# Thus, ALL 4 mutating joins yield the same result:
<- left_join(pt_1, pt_2, by = "ID")
m_1 <- right_join(pt_1, pt_2, by = "ID")
m_2 <- full_join(pt_1, pt_2, by = "ID")
m_3 <- inner_join(pt_1, pt_2, by = "ID")
m_4
# Also, different merges would yield the same result:
<- merge(pt_1, pt_2, by = "ID") # as data.frame
m_5 <- as_tibble(m_5) # convert to tibble
m_5 <- m_5 %>% select(study, ID, everything()) # sort columns
m_5 # m_5
# bind_cols():
<- bind_cols(pt_1, pt_2)
m_6 # Note that some column/variable names have changed:
m_6 #> # A tibble: 78 × 20
#> study ID...2 aged aged365 female dad mom ID...8 potato when64 kalimba
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 6765 18.5 0 49 45 1 0 0 1
#> 2 1 2 7715 21.1 1 63 62 2 0 1 0
#> 3 1 3 7630 20.9 0 61 59 3 0 1 0
#> 4 1 4 7543 20.7 0 54 51 4 0 0 1
#> 5 1 5 7849 21.5 0 47 43 5 0 1 0
#> 6 1 6 7581 20.8 1 49 50 6 0 1 0
#> 7 1 7 7534 20.6 1 56 55 7 0 0 1
#> 8 1 8 6678 18.3 1 45 45 8 0 1 0
#> 9 1 9 6970 19.1 0 53 51 9 1 0 0
#> 10 1 10 7681 21.0 0 53 51 10 0 1 0
#> # … with 68 more rows, and 9 more variables: cond <chr>, root <dbl>,
#> # bird <dbl>, political <dbl>, quarterback <dbl>, olddays <dbl>,
#> # feelold <dbl>, computer <dbl>, diner <dbl>
# Reset column/variable names:
<- m_6 %>% rename(ID = ID...2,
m_6 ID1 = ID...8)
<- m_6 %>% select(-ID1) # remove duplicate column ID1
m_6 # m_6
# Verify equality of all 6 solutions:
all.equal(m_1, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_2, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_3, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_4, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_5, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_6, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
This concludes our exercises on using dplyr commands for joining two tables.