A.8 Solutions (08)

ds4psy: Solutions 8

Here are the solutions to the exercises on using essential dplyr commands for joining two tables (from Section 8.4 of Chapter 8).

A.8.1 Exercise 1

Given the definitions of data_t1 and data_t2, and t1 to t4 above, evaluate and explain the result of:

all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))
all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))
all.equal(data_t1, semi_join(data_t1, data_t2))
all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))

Hint: If the base R function all.equal() complains about differences between tables that you expect to be equal, you may have to arrange their rows and columns before comparing them. Alternatively, use the dplyr function all_equal() to check the equality of data frames or tibbles.

Note

To do this exercise, we first re-load and re-create various data tables from the chapter (see Section 8.2.1 above):

We load data_t1 and data_t2 (from the ds4psy package or the CSV files online) and arrange them to re-create t1 and t2:

# Load data:
data_t1 <- ds4psy::data_t1  # from ds4psy package
data_t2 <- ds4psy::data_t2  # from ds4psy package
# data_t1 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t1.csv")  # online
# data_t2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t2.csv")  # online

# Copy data:
t1 <- data_t1
t2 <- data_t2

# Arrange tables:
t1 <- t1 %>% arrange(name, gender)
t2 <- t2 %>% arrange(name, gender)

# Inspect data:
# t1
# t2

To obtain t3 and t4, we can either re-create them (with the code used above):

# Re-create t3 and t4: ------ 
t3 <- data_t1  # copy
t4 <- data_t2  # copy 

# Modify both tables: ------ 
# (a) Draw n random rows (samples):
set.seed(12)  # for replicability
n <- 8
s1 <- sample(1:nrow(t3), size = n, replace = FALSE)  # n random values from 1 to nrow(t3)
s2 <- sample(1:nrow(t4), size = n, replace = FALSE)  # n random values from 1 to nrow(t4)

# Reduce rows and arrange both tables:
t3 <- t3[s1, ]
t4 <- t4[s2, ]

# (b) Duplicate m random rows:
set.seed(23)  # for replicability
m <- 2
d1 <- sample(1:nrow(t3), size = m, replace = FALSE)  # n random values from 1 to nrow(t3)
d2 <- sample(1:nrow(t4), size = m, replace = FALSE)  # n random values from 1 to nrow(t4)

# Duplicate sampled rows:
t3 <- rbind(t3, t3[d1, ])
t4 <- rbind(t4, t4[d2, ])

# Arrange tables:
t3 <- t3 %>% arrange(name, gender)
t4 <- t4 %>% arrange(name, gender)

Alternatively, we can load two CSV files (i.e., t3.csv and t4.csv) from an online server or simply get the files from the ds4psy package:

# Re-load tables t3 and t4 (from online source):
# t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t3.csv")  # online
# t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/t4.csv")  # online

t3 <- ds4psy::t3  # from ds4psy package
t4 <- ds4psy::t4  # from ds4psy package

As a summary, here are the current definitions of data_t1 and data_t2, and their four variants t1, t2, t3 and t4:

# Show current data tables:

# data_t1 and data_t2:
data_t1
data_t2

# Sorted versions t1 and t2:
t1
t2

# Variants t3 and t4:
t3
t4

Solutions

We now can evaluate the statements of this exercise and explain the results returned:

# 1. left_join and right_join: 
all_equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))  # => TRUE, 
#> [1] TRUE
# as left_join(x, y) is the same as right_join(y, x) irrespective of x and y. 

# Note: In case you get strange inequalities about mismatches in 
#       "Attributes:", you may have to sort the rows and columns
#       of both tibbles in the same way prior to comparing them.

j1 <- left_join(data_t1, data_t2) %>%
  arrange(name, gender) %>%
  select(name, gender, bnt_1, bnt_2, like_1, like_2)

j2 <- right_join(data_t2, data_t1) %>%
  arrange(name, gender) %>%
  select(name, gender, bnt_1, bnt_2, like_1, like_2)

all.equal(j1, j2)  # should be TRUE.
#> [1] TRUE


# 2. full_join and inner_join:
all_equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))  # => TRUE, 
#> [1] TRUE
# as data_t1 and data_t2 contain the same cases (despite different orders).
# Proof: See comparison of arranged tables (from above):
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE

# 3. semi_join: 
all_equal(data_t1, semi_join(data_t1, data_t2))  # => TRUE, 
#> [1] TRUE
# as data_t1 and data_t2 contain the same cases (despite different orders).
# Hence, semi_join does not drop any cases of data_t1 and remains the same as data_t1.

# 4. semi_join and anti_join: 
all_equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))  # => TRUE, 
#> [1] TRUE
# Explanation:
# - semi_join(t1, t3) # contains  8 cases of t1 with a match in t3
# - anti_join(t1, t3) # contains 12 cases of t1 with NO match in t3
# - full_join() [or rbind()] combine both parts (rows) into the cases of t1 again.

A.8.2 Exercise 2

In the previous Chapter 7 on Tidying data we encountered the tidyr::table4a and tidyr::table4b.

Join these tables into one table that contains the same information as tidyr::table1.

Solution

# Data: 
t4a <- tidyr::table4a
t4b <- tidyr::table4b

# Inspection: 
t4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
t4b
#> # A tibble: 3 × 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

# Analysis:  Both 3 x 3 tables contain the same 3 instances of "country".
# When we join them by "country", all 4 mutating joins will yield the same result. 
# However, we must make sure that only "country" is used as key, 
# as the other 2 matching variable names (`1999` and `2000`) denote different variables
# in different tables. 

# 1. Alternative solutions for joining both tables: ------  
j1 <- left_join(t4a, t4b,  by = "country", suffix = c("_cases", "_population"))
j2 <- right_join(t4a, t4b, by = "country", suffix = c("_cases", "_population"))
j3 <- full_join(t4a, t4b,  by = "country", suffix = c("_cases", "_population"))
j4 <- inner_join(t4a, t4b, by = "country", suffix = c("_cases", "_population"))

# Show equality:
all_equal(j1, j2)
#> [1] TRUE
all_equal(j1, j3)
#> [1] TRUE
all_equal(j1, j4)
#> [1] TRUE

# Print result:
knitr::kable(j1, caption = "Joined table of table4a and table4b.")

Table A.26: Joined table of table4a and table4b.
country	1999_cases	2000_cases	1999_population	2000_population
Afghanistan	745	2666	19987071	20595360
Brazil	37737	80488	172006362	174504898
China	212258	213766	1272915272	1280428583

Use tidyr and dplyr commands to transform the resulting join into tidyr::table1.

Solution

# 2. Tidying/transforming result table: ------ 
j1_trans <- j1 %>%
  tidyr::gather(key = "key", value = "count", `1999_cases`:`2000_population`) %>%
  tidyr::separate(col = key, into = c("year", "type")) %>%
  dplyr::arrange(country, year) %>%
  tidyr::spread(key = type, value = count)
j1_trans
#> # A tibble: 6 × 4
#>   country     year   cases population
#>   <chr>       <chr>  <int>      <int>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583

# Change variable type of year:
j1_trans$year <- as.integer(j1_trans$year)  # from character to integer

# Show equality: 
all_equal(tidyr::table1, j1_trans)  # TRUE
#> [1] TRUE

A.8.3 Exercise 3

This exercise uses the following data files:

ds4psy::data_t3 (also available as CSV file from http://rpository.com/ds4psy/data/data_t3.csv)
ds4psy::data_t4 (also available as CSV file from http://rpository.com/ds4psy/data/data_t4.csv)

Preparation

Import both tables of data (as data_t3 and data_t4) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?

Solution

# Load data: 
data_t3 <- ds4psy::data_t3  # from ds4psy package
data_t4 <- ds4psy::data_t4  # from ds4psy package
# data_t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t3.csv")  # online
# data_t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t4.csv")  # online

# Inspect tables:
data_t3  # 20 x 4
#> # A tibble: 20 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 A.V.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2
data_t4  # 20 x 4
#> # A tibble: 20 × 4
#>    initials sex    like_2 bnt_2
#>    <chr>    <chr>   <dbl> <dbl>
#>  1 X.D.     female      5     1
#>  2 T.D.     male        3    NA
#>  3 Z.D.     male        7     4
#>  4 L.S.     male        7     4
#>  5 H.V.     female      5     2
#>  6 Q.N.     female      6     2
#>  7 X.P.     female      4     1
#>  8 K.R.     male        6     1
#>  9 R.S.     female      7     3
#> 10 C.N.     female      2     2
#> 11 Z.Y.     female      7     1
#> 12 T.V.     male        4     4
#> 13 Y.G.     male        4     4
#> 14 M.O.     male       NA    NA
#> 15 M.Y.     male        6     2
#> 16 A.F.     female      4     1
#> 17 A.V.     male        5     4
#> 18 L.V.     female      2     4
#> 19 X.Y.     female      5     2
#> 20 A.V.     female      5     3

# Both tables seem similar, but do not share variable names.
# Are names and initials identical?

# Arrange both tables by first 2 variables:
data_t3 <- data_t3 %>% arrange(name, gender)
data_t4 <- data_t4 %>% arrange(initials, sex)

all.equal(data_t3$name, data_t4$initials)  # TRUE
#> [1] TRUE
all.equal(data_t3$gender, data_t4$sex)     # TRUE
#> [1] TRUE
# => Both tables seem to contain the same people as cases!

Joining tables

Join both tables by using only the name variable of data_t3 and initials variable of data_t4 as keys. Why is this difficult or impossible?

Solution

# Copy data: Make sure that cases are arranged in the same way:
t3 <- data_t3 %>% arrange(name, gender)
t4 <- data_t4 %>% arrange(initials, sex)

# Note 1:
# left_join(t3, t4)  # would yield an error:
# `by` required, because the data sources have no common variables

# Let's specify both primary and foreign key as:  
# by = c("name" = "initials")

# Note 2: Both of the following 
left_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 × 7
#>    name  gender like_1 bnt_1 sex    like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1 female      4     1
#>  2 A.V.  female      3     2 female      5     3
#>  3 A.V.  female      3     2 male        5     4
#>  4 A.V.  male        3     3 female      5     3
#>  5 A.V.  male        3     3 male        5     4
#>  6 C.N.  female      4     3 female      2     2
#>  7 H.V.  female      1     4 female      5     2
#>  8 K.R.  male        4     1 male        6     1
#>  9 L.S.  male        4     4 male        7     4
#> 10 L.V.  female      4     2 female      2     4
#> # … with 12 more rows
full_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 × 7
#>    name  gender like_1 bnt_1 sex    like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1 female      4     1
#>  2 A.V.  female      3     2 female      5     3
#>  3 A.V.  female      3     2 male        5     4
#>  4 A.V.  male        3     3 female      5     3
#>  5 A.V.  male        3     3 male        5     4
#>  6 C.N.  female      4     3 female      2     2
#>  7 H.V.  female      1     4 female      5     2
#>  8 K.R.  male        4     1 male        6     1
#>  9 L.S.  male        4     4 male        7     4
#> 10 L.V.  female      4     2 female      2     4
#> # … with 12 more rows
# yield 22 x 7 tables.

# Problem: The name/initials "A.V." appears TWICE data_t3 and data_t4:
length(unique(data_t3$name))            # => only 19 unique names
#> [1] 19
data_t3$name[duplicated(data_t3$name)]  # => "A.V." appears twice.
#> [1] "A.V."
# ==> The combination lists 2x2 instances of "A.V.".

Alternative join

Use a different command to join both tables into a 20 x 6 table that contains all data in one table.

Solution

The following solution first re-names two variables in one table (to have two matching key variables):

# Copy data: Make sure that cases are arranged in the same way:
t3 <- data_t3 %>% arrange(name, gender)
t4 <- data_t4 %>% arrange(initials, sex)

# Solution 1:
# (a) Rename both key variables in one table:
names(t4)[1:2] <- names(t3)[c(1, 2)] 

# (b) Join both tables using both key variables:
s1 <- left_join(t3,  t4, by = c("name", "gender"))  # => 20 x 6
s2 <- right_join(t3, t4, by = c("name", "gender"))  # => 20 x 6
s3 <- full_join(t3,  t4, by = c("name", "gender"))  # => 20 x 6
s4 <- inner_join(t3, t4, by = c("name", "gender"))  # => 20 x 6

# Show equality of solutions:
all_equal(s1, s2)
#> [1] TRUE
all_equal(s1, s3)
#> [1] TRUE
all_equal(s1, s4)
#> [1] TRUE

## Print solution:
# s1

An alternative solution pre-supposes that both tables contain the same cases in the same order and introduces a temporary unique key variable (e.g., the row number) as key to link both tables:

# Copy data: Make sure that cases are arranged in the same way:
t3 <- data_t3 %>% arrange(name, gender)
t4 <- data_t4 %>% arrange(initials, sex)

# Solution 2: When we know that 
# - both tables contain the same cases AND 
# - both tables have all cases arranged in the same way:
all.equal(t3$name, t4$initials)  # TRUE
#> [1] TRUE

# Add a new variable (with unique values) to both tables:
t3$nr <- 1:nrow(t3)  # add unique row number to each row
t4$nr <- 1:nrow(t4)  # add unique row number to each row

# Use this new and unambiguous variable as key:
s5 <- left_join(t3, t4, by = c("nr")) %>%  # use nr as key
  select(name:bnt_1, like_2, bnt_2)  # and drop some variables => 20 x 6 
s6 <- full_join(t3, t4, by = c("nr")) %>%  # use nr as key
  select(name:bnt_1, like_2, bnt_2)  # and drop some variables => 20 x 6 

# Show equality of solutions:
all_equal(s1, s5)
#> [1] TRUE
all_equal(s1, s6)
#> [1] TRUE

A.8.4 Exercise 4

This exercise uses the datasets on web-based interventions in positive psychology that we have used before. (See Section B.1 of Appendix B for details on the data.)

So far, we always distinguished between three main data files:

ds4psy::posPsy_p_info (also available as posPsy_participants.csv: Original participant data, 295 x 6 variables:
http://rpository.com/ds4psy/data/posPsy_participants.csv).
ds4psy::posPsy_long (also available as posPsy_AHI_CESD_corrected.csv: Corrected version of dependent measures in long format, 990 x 50 variables:
http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv).
ds4psy::posPsy_wide (also available as posPsy_data_wide.csv: Corrected version of all data joined in wide format, 295 x 294 variables:
http://rpository.com/ds4psy/data/posPsy_data_wide.csv).
Different measurement occasions are suffixed by .0, .1, …, .5.

Given your knowledge of dplyr and tidyr, we can now transform ds4psy::posPsy_long (posPsy_AHI_CESD_corrected.csv) and combine its parts with ds4psy::posPsy_p_info (posPsy_participants.csv) to obtain ds4psy::posPsy_wide (posPsy_data_wide.csv). Do this in the following steps:

Preparation

Import the data files posPsy_participants.csv as p_info, posPsy_AHI_CESD_corrected.csv as AHI_CESD, and posPsy_data_wide.csv as data_wide, and inspect the dimensions of these tables.

# Data:
p_info    <- ds4psy::posPsy_p_info    # from ds4psy package 
AHI_CESD  <- ds4psy::posPsy_long
data_wide <- ds4psy::posPsy_wide

# p_info    <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_participants.csv")  # online 
# AHI_CESD  <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv") 
# data_wide <- readr::read_csv("http://rpository.com/ds4psy/data/posPsy_data_wide.csv") 

# Check: 
dim(p_info)     # 295 x   6 
#> [1] 295   6
dim(AHI_CESD)   # 990 x  50
#> [1] 990  50
dim(data_wide)  # 295 x 294
#> [1] 295 294

Extracting occasions

Create six individual tibbles that contains the data of each participant (one row per id) seperately for each occasion (0 to 5).

Solution

# Create 6 blocks 
# (for occasion 0 to 5):

occ_0 <- AHI_CESD %>%
  group_by(id, occasion) %>%
  filter(occasion == 0)
dim(occ_0)  # 295 x 50
#> [1] 295  50

occ_1 <- AHI_CESD %>%
  group_by(id, occasion) %>%
  filter(occasion == 1)
dim(occ_1)  # 148 x 50
#> [1] 148  50

occ_2 <- AHI_CESD %>%
  group_by(id, occasion) %>%
  filter(occasion == 2)
dim(occ_2)  # 156 x 50
#> [1] 156  50

occ_3 <- AHI_CESD %>%
  group_by(id, occasion) %>%
  filter(occasion == 3)
dim(occ_3)  # 138 x 50
#> [1] 138  50

occ_4 <- AHI_CESD %>%
  group_by(id, occasion) %>%
  filter(occasion == 4)
dim(occ_4)  # 133 x 50
#> [1] 133  50

occ_5 <- AHI_CESD %>%
  group_by(id, occasion) %>%
  filter(occasion == 5)
dim(occ_5)  # 120 x 50
#> [1] 120  50

# Do all blocks contain the same variable names? 
all.equal(names(occ_0), names(occ_5))  # TRUE
#> [1] TRUE

Note that only the first block (occ_0) contains the data of all 295 participants. The other five blocks of data (occ_1 to occ_5) contain the same 50 variables, but were each reduced by drop-out cases.

Joining tables

Combine the variables of the six tibbles into one wide block (in wide format) and compare your resulting table with data_wide.

Hint: As you cannot combine all six blocks in one command, they have to be combined in a stepwise fashion.

Solution

# Join parts: 
occ_0_1 <- full_join(occ_0,   occ_1, by = "id", suffix = c(".0", ".1"))  # add occ_1 to occ_0
occ_0_2 <- full_join(occ_0_1, occ_2, by = "id", suffix = c(".1", ".2"))  # add occ_2 to this
occ_0_3 <- full_join(occ_0_2, occ_3, by = "id", suffix = c(".2", ".3"))  # add occ_3 to this
occ_0_4 <- full_join(occ_0_3, occ_4, by = "id", suffix = c(".3", ".4"))  # add occ_4 to this
occ_0_5 <- full_join(occ_0_4, occ_5, by = "id", suffix = c(".4", ".5"))  # add occ_5 to this

occ_all <- occ_0_5 # copy 

## Check: 
dim(occ_all)  # 295 x 295
#> [1] 295 295
# names(occ_all)

Cleaning up

As you might have noticed the number of columns of our newly created data set and the data in wide format (data_wide) are not the same. That is because:

information on each participants’ intervention group is included six times in our created data;
data_wide also includes the information on the participants (contained in p_info).

Exclude any redundant variables on the intervention group (keeping this variable only once) and include the participant information from p_info to create a new table complete_data. Check if and ensure that complete_data is equal to data_wide.

Solution

# Delete redundant variables on intervention groups:
occ_all_clean <- occ_all %>%
  select(-intervention.1, 
         -intervention.2, 
         -intervention.3, 
         -intervention.4, 
         -intervention.5) %>%
  rename(intervention = intervention.0)

# Check: 
dim(occ_all_clean)  # 295 x 290
#> [1] 295 290

# Join with participant information:
dim(p_info)  # 295 x 6
#> [1] 295   6
complete_data <- full_join(p_info, occ_all_clean)  # same as: 
complete_data <- full_join(p_info, occ_all_clean,  by = c("id", "intervention"))

## Check:
# complete_data
dim(complete_data)  # 295 x 294  
#> [1] 295 294

## Compare with data_wide:
dim(data_wide) # 295 x 294
#> [1] 295 294

## In case of any differences:
## Which variables occur in complete_data but not in data_wide?
# names(complete_data)[!(names(complete_data) %in% names(data_wide))] # => intervention.x, intervention.y
#
## Are these variables all equal?
# all.equal(complete_data$intervention.x, complete_data$intervention.y)  # TRUE
#
## Delete intervention.y and rename intervention.x:
# my_data_wide <- complete_data %>%
#   select(-intervention.y) %>%
#   rename(intervention = intervention.x)

## Check:
all.equal(complete_data, data_wide)  # TRUE (qed).
#> [1] "Attributes: < Component \"spec\": Component \"cols\": Length mismatch: comparison on first 6 components >"

Alternative join

How would you change the steps to join the six blocks (in 3.) if you only wanted to keep the participants who took part on all six possible occasions? How many of them exist?

Solution

# We would use multiple inner_join() commands instead of full_join():
# Join parts: 
in_occ_0_1 <- inner_join(occ_0,      occ_1, by = "id", suffix = c(".0", ".1"))  # add occ_1 to occ_0
in_occ_0_2 <- inner_join(in_occ_0_1, occ_2, by = "id", suffix = c(".1", ".2"))  # add occ_2 to this
in_occ_0_3 <- inner_join(in_occ_0_2, occ_3, by = "id", suffix = c(".2", ".3"))  # add occ_3 to this
in_occ_0_4 <- inner_join(in_occ_0_3, occ_4, by = "id", suffix = c(".3", ".4"))  # add occ_4 to this
in_occ_0_5 <- inner_join(in_occ_0_4, occ_5, by = "id", suffix = c(".4", ".5"))  # add occ_5 to this

dim(in_occ_0_5)         # 72 participants x 295 variables
#> [1]  72 295
sum(is.na(in_occ_0_5))  #  0 missing values
#> [1] 0

Filter join

Use a dplyr summary table on the data in AHI_CESD to determine the id of all participants that took part in all six occasions. Then use a filter join on p_info and on AHI_CESD to keep only the (rows of) data of these participants.

Solution

# Summary table of id values with 6 occasions:
id_6_occ <- AHI_CESD %>%
  group_by(id) %>%
  summarise(n = n()) %>%
  filter(n == 6)
id_6_occ
#> # A tibble: 72 × 2
#>       id     n
#>    <dbl> <int>
#>  1     2     6
#>  2    11     6
#>  3    12     6
#>  4    14     6
#>  5    24     6
#>  6    27     6
#>  7    28     6
#>  8    37     6
#>  9    38     6
#> 10    40     6
#> # … with 62 more rows

# Use id_6_occ$id to filter cases of p_info and AHI_CESD: 
semi_join(p_info,   id_6_occ, by = "id")  #  72 x  6
#> # A tibble: 72 × 6
#>       id intervention   sex   age  educ income
#>    <dbl>        <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     2            1     1    59     1      1
#>  2    11            2     1    56     5      1
#>  3    12            2     1    46     4      3
#>  4    14            4     1    41     1      2
#>  5    24            3     1    51     4      3
#>  6    27            4     1    34     4      3
#>  7    28            1     1    62     4      1
#>  8    37            2     1    47     5      2
#>  9    38            3     1    50     4      3
#> 10    40            2     1    53     4      3
#> # … with 62 more rows
semi_join(AHI_CESD, id_6_occ, by = "id")  # 432 x 50
#> # A tibble: 432 × 50
#>       id occasion elapsed.days intervention ahi01 ahi02 ahi03 ahi04 ahi05 ahi06
#>    <dbl>    <dbl>        <dbl>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     2        0         0               1     3     4     3     4     2     3
#>  2     2        1         8.02            1     3     4     4     4     3     3
#>  3     2        2        14.3             1     3     4     4     4     3     3
#>  4     2        3        32.0             1     3     4     4     4     4     4
#>  5     2        4        92.2             1     3     3     2     3     3     3
#>  6     2        5       182.              1     3     3     3     4     2     3
#>  7    11        0         0               2     2     2     3     4     2     3
#>  8    11        1         8.90            2     2     2     3     4     2     3
#>  9    11        2        14.8             2     3     2     3     3     2     2
#> 10    11        3        34.0             2     2     2     3     4     2     2
#> # … with 422 more rows, and 40 more variables: ahi07 <dbl>, ahi08 <dbl>,
#> #   ahi09 <dbl>, ahi10 <dbl>, ahi11 <dbl>, ahi12 <dbl>, ahi13 <dbl>,
#> #   ahi14 <dbl>, ahi15 <dbl>, ahi16 <dbl>, ahi17 <dbl>, ahi18 <dbl>,
#> #   ahi19 <dbl>, ahi20 <dbl>, ahi21 <dbl>, ahi22 <dbl>, ahi23 <dbl>,
#> #   ahi24 <dbl>, cesd01 <dbl>, cesd02 <dbl>, cesd03 <dbl>, cesd04 <dbl>,
#> #   cesd05 <dbl>, cesd06 <dbl>, cesd07 <dbl>, cesd08 <dbl>, cesd09 <dbl>,
#> #   cesd10 <dbl>, cesd11 <dbl>, cesd12 <dbl>, cesd13 <dbl>, cesd14 <dbl>, …

A.8.5 Exercise 5

This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.

Preparation

Download the two raw data files and load them into R as study_1 and study_2 (fixing any obvious errors).

Solution

# Downloaded the archive `FalsePositive_Data_in_.txt_format.zip` and unzipped its contents  
# into a folder "/data/_falsePosPsy/raw_data/Study 1 .txt".

# Specify paths to 2 txt files:
# (a) absolute paths:
# wd <- "/Users/hneth/Desktop/stuff/Dropbox/_code/R/_teachR/ds4psy"
# raw_1 <- paste0(wd, "/data/_falsePosPsy/raw_data/Study 1 .txt")  # Note weird spacing in file name!
# raw_2 <- paste0(wd, "/data/_falsePosPsy/raw_data/Study 2.txt")

# (b) using here: 
# library(here)
raw_1 <- here::here("data", "_falsePosPsy", "raw_data", "Study 1 .txt") # Note weird spacing in file name!
raw_2 <- here::here("data", "_falsePosPsy", "raw_data", "Study 2.txt")

# Load data: 
study_1 <- readr::read_tsv(raw_1)
study_2 <- readr::read_tsv(raw_2)

# Inspect data:
dim(study_1)  # 44 x 17
#> [1] 44 17
dim(study_2)  # 34 x 17
#> [1] 34 17

## Are all variable names equal?
names(study_1) == names(study_2)  # => mostly TRUE, but 1 FALSE value
#>  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#> [13]  TRUE  TRUE  TRUE  TRUE  TRUE

names(study_1)[names(study_1) != names(study_2)] # => "kalimba"
#> [1] "kalimba"
names(study_2)[names(study_1) != names(study_2)] # => "kaimba" !!!
#> [1] "kaimba"
# Correct name in study_2 to that of study_1:
names(study_2)[names(study_1) != names(study_2)] <- names(study_1)[names(study_1) != names(study_2)]

all.equal(names(study_1), names(study_2))  # TRUE
#> [1] TRUE

Extracting occasions

As both data files contain different participants, but the same variables about each participant, we want to save all data in one table. Combine both datasets, but mark each dataset by a study variable (that has a value of either 1 or 2).

Solution

# Mark each data set by a variable:
study_1$study <- 1
study_2$study <- 2

dim(study_1)  # 44 x 18
#> [1] 44 18
dim(study_2)  # 34 x 18
#> [1] 34 18

# (1) Use bind_rows(): 
fpositive_all <- bind_rows(study_1, study_2)
dim(fpositive_all)  # 78 x 18 variables
#> [1] 78 18
names(fpositive_all)
#>  [1] "aged"        "dad"         "mom"         "female"      "root"       
#>  [6] "bird"        "political"   "quarterback" "olddays"     "potato"     
#> [11] "when64"      "kalimba"     "feelold"     "computer"    "diner"      
#> [16] "cond"        "aged365"     "study"

# (2) Use full_join() on all variables:
fpositive_all_2 <- full_join(study_1, study_2)
dim(fpositive_all_2)  # 78 x 18 variables
#> [1] 78 18
names(fpositive_all)
#>  [1] "aged"        "dad"         "mom"         "female"      "root"       
#>  [6] "bird"        "political"   "quarterback" "olddays"     "potato"     
#> [11] "when64"      "kalimba"     "feelold"     "computer"    "diner"      
#> [16] "cond"        "aged365"     "study"

# Are both solutions equal? 
all.equal(fpositive_all, fpositive_all_2)  # TRUE
#> [1] TRUE

Cleaning up

Identify individuals and arrange the dataset: Add a numeric ID variable identifying each individual participant and sort the dataset in an intuitive way.

Solution

## Data: 
dim(fpositive_all)         # 78 x 18
#> [1] 78 18
sum(is.na(fpositive_all))  # No missing values.
#> [1] 0

# Add a factor variable ID to the dataset, 
# which runs from 1 to the total number of participants in the dataset: 
n_rows <- nrow(fpositive_all)

fpositive_all <- fpositive_all %>%
  mutate(ID = 1:n_rows) 

# Order columns for having ID and participant info first, followd by condition and the other items: 
fpositive_all <- fpositive_all %>%
  select(study, ID, aged, aged365, female, dad, mom, cond, potato, when64, kalimba, everything()) %>%
  arrange(study, ID)

# Check re-ordered tibble:
head(fpositive_all)
#> # A tibble: 6 × 19
#>   study    ID  aged aged365 female   dad   mom cond  potato when64 kalimba  root
#>   <dbl> <int> <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl>   <dbl> <dbl>
#> 1     1     1  6765    18.5      0    49    45 cont…      0      0       1     1
#> 2     1     2  7715    21.1      1    63    62 64         0      1       0     1
#> 3     1     3  7630    20.9      0    61    59 64         0      1       0     1
#> 4     1     4  7543    20.7      0    54    51 cont…      0      0       1     1
#> 5     1     5  7849    21.5      0    47    43 64         0      1       0     1
#> 6     1     6  7581    20.8      1    49    50 64         0      1       0     1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> #   olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>
tail(fpositive_all)
#> # A tibble: 6 × 19
#>   study    ID  aged aged365 female   dad   mom cond  potato when64 kalimba  root
#>   <dbl> <int> <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl>   <dbl> <dbl>
#> 1     2    73  7559    20.7      0    52    50 cont…      0      0       1     1
#> 2     2    74  7864    21.5      1    47    65 64         0      1       0     1
#> 3     2    75  7214    19.8      0    55    50 64         0      1       0     1
#> 4     2    76  7173    19.7      1    56    55 64         0      1       0     1
#> 5     2    77  6984    19.1      1    52    43 pota…      1      0       0     1
#> 6     2    78  7082    19.4      0    63    51 pota…      1      0       0     1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> #   olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>

Splitting and joining

The following code splits the data of falsePosPsy into two parts (pt_1 and pt_2):

# Data: 
# falsePosPsy <- readr::read_csv(file = "http://rpository.com/ds4psy/data/falsePosPsy_all.csv")  # online
falsePosPsy <- ds4psy::falsePosPsy_all  # from ds4psy package
falsePosPsy
#> # A tibble: 78 × 19
#>    study    ID  aged aged365 female   dad   mom potato when64 kalimba cond   
#>    <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>   <dbl> <chr>  
#>  1     1     1  6765    18.5      0    49    45      0      0       1 control
#>  2     1     2  7715    21.1      1    63    62      0      1       0 64     
#>  3     1     3  7630    20.9      0    61    59      0      1       0 64     
#>  4     1     4  7543    20.7      0    54    51      0      0       1 control
#>  5     1     5  7849    21.5      0    47    43      0      1       0 64     
#>  6     1     6  7581    20.8      1    49    50      0      1       0 64     
#>  7     1     7  7534    20.6      1    56    55      0      0       1 control
#>  8     1     8  6678    18.3      1    45    45      0      1       0 64     
#>  9     1     9  6970    19.1      0    53    51      1      0       0 potato 
#> 10     1    10  7681    21.0      0    53    51      0      1       0 64     
#> # … with 68 more rows, and 8 more variables: root <dbl>, bird <dbl>,
#> #   political <dbl>, quarterback <dbl>, olddays <dbl>, feelold <dbl>,
#> #   computer <dbl>, diner <dbl>
dim(falsePosPsy)
#> [1] 78 19

# Create 2 different datasets: 
pt_1 <- falsePosPsy %>%
  select(study, ID, aged, aged365, female, dad, mom)

pt_2 <- falsePosPsy %>%
  select(ID, potato, when64, kalimba, cond, root, bird, political, quarterback, olddays, feelold, computer, diner)

Re-create the full data set (from pt_1 and pt_2) in at least three different ways (and verify that each of them yields the same data as falsePosPsy).

Solution

# Note that
all.equal(pt_1$ID, pt_2$ID)  # TRUE
#> [1] TRUE
# => pt_1 and pt_2 contain the same cases (observations).

# Thus, ALL 4 mutating joins yield the same result:
m_1 <-  left_join(pt_1, pt_2, by = "ID")
m_2 <- right_join(pt_1, pt_2, by = "ID")
m_3 <-  full_join(pt_1, pt_2, by = "ID")
m_4 <- inner_join(pt_1, pt_2, by = "ID")

# Also, different merges would yield the same result:
m_5 <- merge(pt_1, pt_2, by = "ID")  # as data.frame
m_5 <- as_tibble(m_5)                # convert to tibble
m_5 <- m_5 %>% select(study, ID, everything())  # sort columns
# m_5

# bind_cols(): 
m_6 <- bind_cols(pt_1, pt_2)
m_6  # Note that some column/variable names have changed: 
#> # A tibble: 78 × 20
#>    study ID...2  aged aged365 female   dad   mom ID...8 potato when64 kalimba
#>    <dbl>  <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#>  1     1      1  6765    18.5      0    49    45      1      0      0       1
#>  2     1      2  7715    21.1      1    63    62      2      0      1       0
#>  3     1      3  7630    20.9      0    61    59      3      0      1       0
#>  4     1      4  7543    20.7      0    54    51      4      0      0       1
#>  5     1      5  7849    21.5      0    47    43      5      0      1       0
#>  6     1      6  7581    20.8      1    49    50      6      0      1       0
#>  7     1      7  7534    20.6      1    56    55      7      0      0       1
#>  8     1      8  6678    18.3      1    45    45      8      0      1       0
#>  9     1      9  6970    19.1      0    53    51      9      1      0       0
#> 10     1     10  7681    21.0      0    53    51     10      0      1       0
#> # … with 68 more rows, and 9 more variables: cond <chr>, root <dbl>,
#> #   bird <dbl>, political <dbl>, quarterback <dbl>, olddays <dbl>,
#> #   feelold <dbl>, computer <dbl>, diner <dbl>

# Reset column/variable names:
m_6 <- m_6 %>% rename(ID = ID...2,
                      ID1 = ID...8)
m_6 <- m_6 %>% select(-ID1)  # remove duplicate column ID1
# m_6

# Verify equality of all 6 solutions:
all.equal(m_1, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_2, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_3, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_4, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_5, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_6, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE

This concludes our exercises on using dplyr commands for joining two tables.