A.8 Solutions (08)

ds4psy: Solutions 8

Here are the solutions to the exercises on using essential dplyr commands for joining 2 tables (from Section 8.4 of Chapter 8).

A.8.1 Exercise 1

Given the definitions of data_t1 and data_t2, and t1 to t4 above, evaluate and explain the result of:

  1. all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))

  2. all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))

  3. all.equal(data_t1, semi_join(data_t1, data_t2))

  4. all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))

Hint: If the base R function all.equal() complains about differences between tables that you expect to be equal, you may have to arrange their rows and columns before comparing them. Alternatively, use the dplyr function all_equal() to check the equality of data frames or tibbles.

Note

To do this exercise, we first re-load and re-create various data tables from the chapter (see Section 8.2.1 above):

  • We load data_t1 and data_t2 (from the ds4psy package or the CSV files online) and arrange them to re-create t1 and t2:
  • To obtain t3 and t4, we can either re-create them (with the code used above):
  • Alternatively, we can load two CSV files (i.e., t3.csv and t4.csv) from an online server or simply get the files from the ds4psy package:
  • As a summary, here are the current definitions of data_t1 and data_t2, and their 4 variants t1, t2, t3 and t4:

Solutions

We now can evaluate the statements of this exercise and explain the results returned:

# 1. left_join and right_join: 
all_equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))  # => TRUE, 
#> [1] TRUE
# as left_join(x, y) is the same as right_join(y, x) irrespective of x and y. 

# Note: In case you get strange inequalities about mismatches in 
#       "Attributes:", you may have to sort the rows and columns
#       of both tibbles in the same way prior to comparing them.

j1 <- left_join(data_t1, data_t2) %>%
  arrange(name, gender) %>%
  select(name, gender, bnt_1, bnt_2, like_1, like_2)

j2 <- right_join(data_t2, data_t1) %>%
  arrange(name, gender) %>%
  select(name, gender, bnt_1, bnt_2, like_1, like_2)

all.equal(j1, j2)  # should be TRUE.
#> [1] TRUE


# 2. full_join and inner_join:
all_equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))  # => TRUE, 
#> [1] TRUE
# as data_t1 and data_t2 contain the same cases (despite different orders).
# Proof: See comparison of arranged tables (from above):
all.equal(t1$name, t2$name)
#> [1] TRUE
all.equal(t1$gender, t2$gender)
#> [1] TRUE

# 3. semi_join: 
all_equal(data_t1, semi_join(data_t1, data_t2))  # => TRUE, 
#> [1] TRUE
# as data_t1 and data_t2 contain the same cases (despite different orders).
# Hence, semi_join does not drop any cases of data_t1 and remains the same as data_t1.

# 4. semi_join and anti_join: 
all_equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))  # => TRUE, 
#> [1] TRUE
# Explanation:
# - semi_join(t1, t3) # contains  8 cases of t1 with a match in t3
# - anti_join(t1, t3) # contains 12 cases of t1 with NO match in t3
# - full_join() [or rbind()] combine both parts (rows) into the cases of t1 again. 

A.8.2 Exercise 2

In the previous Chapter 7 on Tidying data we encountered the tidyr::table4a and tidyr::table4b.

  1. Join these tables into one table that contains the same information as tidyr::table1.
Solution
Table A.26: Joined table of table4a and table4b.
country 1999_cases 2000_cases 1999_population 2000_population
Afghanistan 745 2666 19987071 20595360
Brazil 37737 80488 172006362 174504898
China 212258 213766 1272915272 1280428583
  1. Use tidyr and dplyr commands to transform the resulting join into tidyr::table1.

A.8.3 Exercise 3

This exercise uses the following data files:

Preparation

  1. Import both tables of data (as data_t3 and data_t4) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?
Solution
# Load data: 
data_t3 <- ds4psy::data_t3  # from ds4psy package
data_t4 <- ds4psy::data_t4  # from ds4psy package
# data_t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t3.csv")  # online
# data_t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t4.csv")  # online

# Inspect tables:
data_t3  # 20 x 4
#> # A tibble: 20 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 A.V.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2
data_t4  # 20 x 4
#> # A tibble: 20 x 4
#>    initials sex    like_2 bnt_2
#>    <chr>    <chr>   <dbl> <dbl>
#>  1 X.D.     female      5     1
#>  2 T.D.     male        3    NA
#>  3 Z.D.     male        7     4
#>  4 L.S.     male        7     4
#>  5 H.V.     female      5     2
#>  6 Q.N.     female      6     2
#>  7 X.P.     female      4     1
#>  8 K.R.     male        6     1
#>  9 R.S.     female      7     3
#> 10 C.N.     female      2     2
#> 11 Z.Y.     female      7     1
#> 12 T.V.     male        4     4
#> 13 Y.G.     male        4     4
#> 14 M.O.     male       NA    NA
#> 15 M.Y.     male        6     2
#> 16 A.F.     female      4     1
#> 17 A.V.     male        5     4
#> 18 L.V.     female      2     4
#> 19 X.Y.     female      5     2
#> 20 A.V.     female      5     3

# Both tables seem similar, but do not share variable names.
# Are names and initials identical?

# Arrange both tables by first 2 variables:
data_t3 <- data_t3 %>% arrange(name, gender)
data_t4 <- data_t4 %>% arrange(initials, sex)

all.equal(data_t3$name, data_t4$initials)  # TRUE
#> [1] TRUE
all.equal(data_t3$gender, data_t4$sex)     # TRUE
#> [1] TRUE
# => Both tables seem to contain the same people as cases!

Joining tables

  1. Join both tables by using only the name variable of data_t3 and initials variable of data_t4 as keys. Why is this difficult or impossible?
Solution
# Copy data: Make sure that cases are arranged in the same way:
t3 <- data_t3 %>% arrange(name, gender)
t4 <- data_t4 %>% arrange(initials, sex)

# Note 1:
# left_join(t3, t4)  # would yield an error:
# `by` required, because the data sources have no common variables

# Let's specify both primary and foreign key as:  
# by = c("name" = "initials")

# Note 2: Both of the following 
left_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 x 7
#>    name  gender like_1 bnt_1 sex    like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1 female      4     1
#>  2 A.V.  female      3     2 female      5     3
#>  3 A.V.  female      3     2 male        5     4
#>  4 A.V.  male        3     3 female      5     3
#>  5 A.V.  male        3     3 male        5     4
#>  6 C.N.  female      4     3 female      2     2
#>  7 H.V.  female      1     4 female      5     2
#>  8 K.R.  male        4     1 male        6     1
#>  9 L.S.  male        4     4 male        7     4
#> 10 L.V.  female      4     2 female      2     4
#> # … with 12 more rows
full_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 x 7
#>    name  gender like_1 bnt_1 sex    like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1 female      4     1
#>  2 A.V.  female      3     2 female      5     3
#>  3 A.V.  female      3     2 male        5     4
#>  4 A.V.  male        3     3 female      5     3
#>  5 A.V.  male        3     3 male        5     4
#>  6 C.N.  female      4     3 female      2     2
#>  7 H.V.  female      1     4 female      5     2
#>  8 K.R.  male        4     1 male        6     1
#>  9 L.S.  male        4     4 male        7     4
#> 10 L.V.  female      4     2 female      2     4
#> # … with 12 more rows
# yield 22 x 7 tables.

# Problem: The name/initials "A.V." appears TWICE data_t3 and data_t4:
length(unique(data_t3$name))            # => only 19 unique names
#> [1] 19
data_t3$name[duplicated(data_t3$name)]  # => "A.V." appears twice.
#> [1] "A.V."
# ==> The combination lists 2x2 instances of "A.V.". 

Alternative join

  1. Use a different command to join both tables into a 20 x 6 table that contains all data in 1 table.
Solution

The following solution first re-names 2 variables in one table (to have 2 matching key variables):

An alternative solution pre-supposes that both tables contain the same cases in the same order and introduces a temporary unique key variable (e.g., the row number) as key to link both tables:

A.8.4 Exercise 4

This exercise uses the datasets on web-based interventions in positive psychology that we have used before. (See Section B.1 of Appendix B for details on the data.)

So far, we always distinguished between 3 main data files:

Given your knowledge of dplyr and tidyr, we can now transform ds4psy::posPsy_long (posPsy_AHI_CESD_corrected.csv) and combine its parts with ds4psy::posPsy_p_info (posPsy_participants.csv) to obtain ds4psy::posPsy_wide (posPsy_data_wide.csv). Do this in the following steps:

Cleaning up

  1. As you might have noticed the number of columns of our newly created data set and the data in wide format (data_wide) are not the same. That is because:
  • information on each participants’ intervention group is included 6 times in our created data;

  • data_wide also includes the information on the participants (contained in p_info).

Exclude any redundant variables on the intervention group (keeping this variable only once) and include the participant information from p_info to create a new table complete_data. Check if and ensure that complete_data is equal to data_wide.

Filter join

  1. Use a dplyr summary table on the data in AHI_CESD to determine the id of all participants that took part in all 6 occasions. Then use a filter join on p_info and on AHI_CESD to keep only the (rows of) data of these participants.
Solution
# Summary table of id values with 6 occasions:
id_6_occ <- AHI_CESD %>%
  group_by(id) %>%
  summarise(n = n()) %>%
  filter(n == 6)
id_6_occ
#> # A tibble: 72 x 2
#>       id     n
#>    <dbl> <int>
#>  1     2     6
#>  2    11     6
#>  3    12     6
#>  4    14     6
#>  5    24     6
#>  6    27     6
#>  7    28     6
#>  8    37     6
#>  9    38     6
#> 10    40     6
#> # … with 62 more rows

# Use id_6_occ$id to filter cases of p_info and AHI_CESD: 
semi_join(p_info,   id_6_occ, by = "id")  #  72 x  6
#> # A tibble: 72 x 6
#>       id intervention   sex   age  educ income
#>    <dbl>        <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     2            1     1    59     1      1
#>  2    11            2     1    56     5      1
#>  3    12            2     1    46     4      3
#>  4    14            4     1    41     1      2
#>  5    24            3     1    51     4      3
#>  6    27            4     1    34     4      3
#>  7    28            1     1    62     4      1
#>  8    37            2     1    47     5      2
#>  9    38            3     1    50     4      3
#> 10    40            2     1    53     4      3
#> # … with 62 more rows
semi_join(AHI_CESD, id_6_occ, by = "id")  # 432 x 50
#> # A tibble: 432 x 50
#>       id occasion elapsed.days intervention ahi01 ahi02 ahi03 ahi04 ahi05 ahi06
#>    <dbl>    <dbl>        <dbl>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     2        0         0               1     3     4     3     4     2     3
#>  2     2        1         8.02            1     3     4     4     4     3     3
#>  3     2        2        14.3             1     3     4     4     4     3     3
#>  4     2        3        32.0             1     3     4     4     4     4     4
#>  5     2        4        92.2             1     3     3     2     3     3     3
#>  6     2        5       182.              1     3     3     3     4     2     3
#>  7    11        0         0               2     2     2     3     4     2     3
#>  8    11        1         8.90            2     2     2     3     4     2     3
#>  9    11        2        14.8             2     3     2     3     3     2     2
#> 10    11        3        34.0             2     2     2     3     4     2     2
#> # … with 422 more rows, and 40 more variables: ahi07 <dbl>, ahi08 <dbl>,
#> #   ahi09 <dbl>, ahi10 <dbl>, ahi11 <dbl>, ahi12 <dbl>, ahi13 <dbl>,
#> #   ahi14 <dbl>, ahi15 <dbl>, ahi16 <dbl>, ahi17 <dbl>, ahi18 <dbl>,
#> #   ahi19 <dbl>, ahi20 <dbl>, ahi21 <dbl>, ahi22 <dbl>, ahi23 <dbl>,
#> #   ahi24 <dbl>, cesd01 <dbl>, cesd02 <dbl>, cesd03 <dbl>, cesd04 <dbl>,
#> #   cesd05 <dbl>, cesd06 <dbl>, cesd07 <dbl>, cesd08 <dbl>, cesd09 <dbl>,
#> #   cesd10 <dbl>, cesd11 <dbl>, cesd12 <dbl>, cesd13 <dbl>, cesd14 <dbl>,
#> #   cesd15 <dbl>, cesd16 <dbl>, cesd17 <dbl>, cesd18 <dbl>, cesd19 <dbl>,
#> #   cesd20 <dbl>, ahiTotal <dbl>, cesdTotal <dbl>

A.8.5 Exercise 5

This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.

Preparation

  1. Download the 2 raw data files and load them into R as study_1 and study_2 (fixing any obvious errors).

Cleaning up

  1. Identify individuals and arrange the dataset: Add a numeric ID variable identifying each individual participant and sort the dataset in an intuitive way.
Solution
## Data: 
dim(fpositive_all)         # 78 x 18
#> [1] 78 18
sum(is.na(fpositive_all))  # No missing values.
#> [1] 0

# Add a factor variable ID to the dataset, 
# which runs from 1 to the total number of participants in the dataset: 
n_rows <- nrow(fpositive_all)

fpositive_all <- fpositive_all %>%
  mutate(ID = 1:n_rows) 

# Order columns for having ID and participant info first, followd by condition and the other items: 
fpositive_all <- fpositive_all %>%
  select(study, ID, aged, aged365, female, dad, mom, cond, potato, when64, kalimba, everything()) %>%
  arrange(study, ID)

# Check re-ordered tibble:
head(fpositive_all)
#> # A tibble: 6 x 19
#>   study    ID  aged aged365 female   dad   mom cond  potato when64 kalimba  root
#>   <dbl> <int> <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl>   <dbl> <dbl>
#> 1     1     1  6765    18.5      0    49    45 cont…      0      0       1     1
#> 2     1     2  7715    21.1      1    63    62 64         0      1       0     1
#> 3     1     3  7630    20.9      0    61    59 64         0      1       0     1
#> 4     1     4  7543    20.7      0    54    51 cont…      0      0       1     1
#> 5     1     5  7849    21.5      0    47    43 64         0      1       0     1
#> 6     1     6  7581    20.8      1    49    50 64         0      1       0     1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> #   olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>
tail(fpositive_all)
#> # A tibble: 6 x 19
#>   study    ID  aged aged365 female   dad   mom cond  potato when64 kalimba  root
#>   <dbl> <int> <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl>   <dbl> <dbl>
#> 1     2    73  7559    20.7      0    52    50 cont…      0      0       1     1
#> 2     2    74  7864    21.5      1    47    65 64         0      1       0     1
#> 3     2    75  7214    19.8      0    55    50 64         0      1       0     1
#> 4     2    76  7173    19.7      1    56    55 64         0      1       0     1
#> 5     2    77  6984    19.1      1    52    43 pota…      1      0       0     1
#> 6     2    78  7082    19.4      0    63    51 pota…      1      0       0     1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> #   olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>

Splitting and joining

  1. The following code splits the data of falsePosPsy into 2 parts (pt_1 and pt_2):
  • Re-create the full data set (from pt_1 and pt_2) in at least 3 different ways (and verify that each of them yields the same data as falsePosPsy).
Solution
# Note that
all.equal(pt_1$ID, pt_2$ID)  # TRUE
#> [1] TRUE
# => pt_1 and pt_2 contain the same cases (observations).

# Thus, ALL 4 mutating joins yield the same result:
m_1 <-  left_join(pt_1, pt_2, by = "ID")
m_2 <- right_join(pt_1, pt_2, by = "ID")
m_3 <-  full_join(pt_1, pt_2, by = "ID")
m_4 <- inner_join(pt_1, pt_2, by = "ID")

# Also, different merges would yield the same result:
m_5 <- merge(pt_1, pt_2, by = "ID")  # as data.frame
m_5 <- as_tibble(m_5)                # convert to tibble
m_5 <- m_5 %>% select(study, ID, everything())  # sort columns
# m_5

# bind_cols(): 
m_6 <- bind_cols(pt_1, pt_2)
m_6  # Note that some column/variable names have changed: 
#> # A tibble: 78 x 20
#>    study ID...2  aged aged365 female   dad   mom ID...8 potato when64 kalimba
#>    <dbl>  <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#>  1     1      1  6765    18.5      0    49    45      1      0      0       1
#>  2     1      2  7715    21.1      1    63    62      2      0      1       0
#>  3     1      3  7630    20.9      0    61    59      3      0      1       0
#>  4     1      4  7543    20.7      0    54    51      4      0      0       1
#>  5     1      5  7849    21.5      0    47    43      5      0      1       0
#>  6     1      6  7581    20.8      1    49    50      6      0      1       0
#>  7     1      7  7534    20.6      1    56    55      7      0      0       1
#>  8     1      8  6678    18.3      1    45    45      8      0      1       0
#>  9     1      9  6970    19.1      0    53    51      9      1      0       0
#> 10     1     10  7681    21.0      0    53    51     10      0      1       0
#> # … with 68 more rows, and 9 more variables: cond <chr>, root <dbl>,
#> #   bird <dbl>, political <dbl>, quarterback <dbl>, olddays <dbl>,
#> #   feelold <dbl>, computer <dbl>, diner <dbl>

# Reset column/variable names:
m_6 <- m_6 %>% rename(ID = ID...2,
                      ID1 = ID...8)
m_6 <- m_6 %>% select(-ID1)  # remove duplicate column ID1
# m_6

# Verify equality of all 6 solutions:
all.equal(m_1, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_2, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_3, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_4, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_5, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE
all.equal(m_6, falsePosPsy, check.attributes = FALSE)
#> [1] TRUE

This concludes our exercises on using dplyr commands for joining two tables.