A.8 Solutions (08)

ds4psy: Solutions 8

Here are the solutions to the exercises on using essential dplyr commands for joining 2 tables (from Section 8.4 of Chapter 8).

A.8.1 Exercise 1

Given the definitions of data_t1 and data_t2, and t1 to t4 above, evaluate and explain the result of:

  1. all.equal(left_join(data_t1, data_t2), right_join(data_t2, data_t1))

  2. all.equal(full_join(data_t1, data_t2), inner_join(data_t2, data_t1))

  3. all.equal(data_t1, semi_join(data_t1, data_t2))

  4. all.equal(t1, full_join(semi_join(t1, t3), anti_join(t1, t3)))

Note

To do this exercise, we first re-load and re-create various data tables from the chapter (see Section 8.2.1 above):

  • We load data_t1 and data_t2 (from the ds4psy package or the CSV files online) and arrange them to re-create t1 and t2:
  • To obtain t3 and t4, we can either re-create them (with the code used above):
  • Alternatively, we can load t3.csv and t4.csv from the ds4psy package (or the CSV files online):
  • As a summary, here are the current definitions of data_t1 and data_t2, and their 4 variants t1, t2, t3 and t4:
# Show current data tables:

# data_t1 and data_t2:
data_t1
#> # A tibble: 20 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 K.X.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2
data_t2
#> # A tibble: 20 x 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 X.D.  female      5     1
#>  2 T.D.  male        3    NA
#>  3 Z.D.  male        7     4
#>  4 L.S.  male        7     4
#>  5 H.V.  female      5     2
#>  6 Q.N.  female      6     2
#>  7 X.P.  female      4     1
#>  8 K.R.  male        6     1
#>  9 R.S.  female      7     3
#> 10 C.N.  female      2     2
#> 11 Z.Y.  female      7     1
#> 12 T.V.  male        4     4
#> 13 Y.G.  male        4     4
#> 14 M.O.  male       NA    NA
#> 15 M.Y.  male        6     2
#> 16 A.F.  female      4     1
#> 17 K.X.  male        5     4
#> 18 L.V.  female      2     4
#> 19 X.Y.  female      5     2
#> 20 A.V.  female      5     3

# Sorted versions t1 and t2:
t1
#> # A tibble: 20 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1
#>  2 A.V.  female      3     2
#>  3 C.N.  female      4     3
#>  4 H.V.  female      1     4
#>  5 K.R.  male        4     1
#>  6 K.X.  male        3     3
#>  7 L.S.  male        4     4
#>  8 L.V.  female      4     2
#>  9 M.O.  male        4     1
#> 10 M.Y.  male        2     4
#> 11 Q.N.  female      6     1
#> 12 R.S.  female      2    NA
#> 13 T.D.  male        5     2
#> 14 T.V.  male        6     4
#> 15 X.D.  female      4     4
#> 16 X.P.  female      4     1
#> 17 X.Y.  female      5     4
#> 18 Y.G.  male        4     4
#> 19 Z.D.  male        5     4
#> 20 Z.Y.  female     NA    NA
t2
#> # A tibble: 20 x 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 A.F.  female      4     1
#>  2 A.V.  female      5     3
#>  3 C.N.  female      2     2
#>  4 H.V.  female      5     2
#>  5 K.R.  male        6     1
#>  6 K.X.  male        5     4
#>  7 L.S.  male        7     4
#>  8 L.V.  female      2     4
#>  9 M.O.  male       NA    NA
#> 10 M.Y.  male        6     2
#> 11 Q.N.  female      6     2
#> 12 R.S.  female      7     3
#> 13 T.D.  male        3    NA
#> 14 T.V.  male        4     4
#> 15 X.D.  female      5     1
#> 16 X.P.  female      4     1
#> 17 X.Y.  female      5     2
#> 18 Y.G.  male        4     4
#> 19 Z.D.  male        7     4
#> 20 Z.Y.  female      7     1

# Variants t3 and t4:
t3
#> # A tibble: 10 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 H.V.  female      1     4
#>  2 H.V.  female      1     4
#>  3 K.R.  male        4     1
#>  4 K.R.  male        4     1
#>  5 K.X.  male        3     3
#>  6 L.S.  male        4     4
#>  7 M.Y.  male        2     4
#>  8 Q.N.  female      6     1
#>  9 R.S.  female      2    NA
#> 10 Z.Y.  female     NA    NA
t4
#> # A tibble: 10 x 4
#>    name  gender like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl>
#>  1 A.V.  female      5     3
#>  2 K.R.  male        6     1
#>  3 K.R.  male        6     1
#>  4 L.S.  male        7     4
#>  5 M.O.  male       NA    NA
#>  6 M.Y.  male        6     2
#>  7 Q.N.  female      6     2
#>  8 X.D.  female      5     1
#>  9 X.P.  female      4     1
#> 10 X.P.  female      4     1

A.8.2 Exercise 2

In the previous Chapter 7 on Tidying data we encountered the tidyr::table4a and tidyr::table4b.

  1. Join these tables into one table that contains the same information as tidyr::table1.
Solution
Table A.26: Joined table of table4a and table4b.
country 1999_cases 2000_cases 1999_population 2000_population
Afghanistan 745 2666 19987071 20595360
Brazil 37737 80488 172006362 174504898
China 212258 213766 1272915272 1280428583
  1. Use tidyr and dplyr commands to transform the resulting join into tidyr::table1.

A.8.3 Exercise 3

This exercise uses the following data files:

Preparation

  1. Import both tables of data (as data_t3 and data_t4) and check out each table individually, before comparing them to each other. Do both tables contain the same cases and variables?
Solution
# Load data: 
data_t3 <- ds4psy::data_t3  # from ds4psy package
data_t4 <- ds4psy::data_t4  # from ds4psy package
# data_t3 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t3.csv")  # online
# data_t4 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/data_t4.csv")  # online

# Inspect tables:
data_t3  # 20 x 4
#> # A tibble: 20 x 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 A.V.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2
data_t4  # 20 x 4
#> # A tibble: 20 x 4
#>    initials sex    like_2 bnt_2
#>    <chr>    <chr>   <dbl> <dbl>
#>  1 X.D.     female      5     1
#>  2 T.D.     male        3    NA
#>  3 Z.D.     male        7     4
#>  4 L.S.     male        7     4
#>  5 H.V.     female      5     2
#>  6 Q.N.     female      6     2
#>  7 X.P.     female      4     1
#>  8 K.R.     male        6     1
#>  9 R.S.     female      7     3
#> 10 C.N.     female      2     2
#> 11 Z.Y.     female      7     1
#> 12 T.V.     male        4     4
#> 13 Y.G.     male        4     4
#> 14 M.O.     male       NA    NA
#> 15 M.Y.     male        6     2
#> 16 A.F.     female      4     1
#> 17 A.V.     male        5     4
#> 18 L.V.     female      2     4
#> 19 X.Y.     female      5     2
#> 20 A.V.     female      5     3

# Both tables seem similar, but do not share variable names.
# Are names and initials identical?

# Arrange both tables by first 2 variables:
data_t3 <- data_t3 %>% arrange(name, gender)
data_t4 <- data_t4 %>% arrange(initials, sex)

all.equal(data_t3$name, data_t4$initials)  # TRUE
#> [1] TRUE
all.equal(data_t3$gender, data_t4$sex)     # TRUE
#> [1] TRUE
# => Both tables seem to contain the same people as cases!

Joining tables

  1. Join both tables by using only the name variable of data_t3 and initials variable of data_t4 as keys. Why is this difficult or impossible?
Solution
# Copy data: Make sure that cases are arranged in the same way:
t3 <- data_t3 %>% arrange(name, gender)
t4 <- data_t4 %>% arrange(initials, sex)

# Note 1:
# left_join(t3, t4)  # would yield an error:
# `by` required, because the data sources have no common variables

# Let's specify both primary and foreign key as:  
# by = c("name" = "initials")

# Note 2: Both of the following 
left_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 x 7
#>    name  gender like_1 bnt_1 sex    like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1 female      4     1
#>  2 A.V.  female      3     2 female      5     3
#>  3 A.V.  female      3     2 male        5     4
#>  4 A.V.  male        3     3 female      5     3
#>  5 A.V.  male        3     3 male        5     4
#>  6 C.N.  female      4     3 female      2     2
#>  7 H.V.  female      1     4 female      5     2
#>  8 K.R.  male        4     1 male        6     1
#>  9 L.S.  male        4     4 male        7     4
#> 10 L.V.  female      4     2 female      2     4
#> # … with 12 more rows
full_join(t3, t4, by = c("name" = "initials"))
#> # A tibble: 22 x 7
#>    name  gender like_1 bnt_1 sex    like_2 bnt_2
#>    <chr> <chr>   <dbl> <dbl> <chr>   <dbl> <dbl>
#>  1 A.F.  female      5     1 female      4     1
#>  2 A.V.  female      3     2 female      5     3
#>  3 A.V.  female      3     2 male        5     4
#>  4 A.V.  male        3     3 female      5     3
#>  5 A.V.  male        3     3 male        5     4
#>  6 C.N.  female      4     3 female      2     2
#>  7 H.V.  female      1     4 female      5     2
#>  8 K.R.  male        4     1 male        6     1
#>  9 L.S.  male        4     4 male        7     4
#> 10 L.V.  female      4     2 female      2     4
#> # … with 12 more rows
# yield 22 x 7 tables.

# Problem: The name/initials "A.V." appears TWICE data_t3 and data_t4:
length(unique(data_t3$name))            # => only 19 unique names
#> [1] 19
data_t3$name[duplicated(data_t3$name)]  # => "A.V." appears twice.
#> [1] "A.V."
# ==> The combination lists 2x2 instances of "A.V.". 

Alternative join

  1. Use a different command to join both tables into a 20 x 6 table that contains all data in 1 table.
Solution

The following solution first re-names 2 variables in one table (to have 2 matching key variables):

An alternative solution pre-supposes that both tables contain the same cases in the same order and introduces a temporary unique key variable (e.g., the row number) as key to link both tables:

A.8.4 Exercise 4

This exercise uses the datasets on web-based interventions in positive psychology that we have used before. (See Section B.1 of Appendix B for details on the data.)

So far, we always distinguished between 3 main data files:

Given your knowledge of dplyr and tidyr, we can now transform ds4psy::posPsy_long (posPsy_AHI_CESD_corrected.csv) and combine its parts with ds4psy::posPsy_p_info (posPsy_participants.csv) to obtain ds4psy::posPsy_wide (posPsy_data_wide.csv). Do this in the following steps:

Cleaning up

  1. As you might have noticed the number of columns of our newly created data set and the data in wide format (data_wide) are not the same. That is because:
  • information on each participants’ intervention group is included 6 times in our created data;

  • data_wide also includes the information on the participants (contained in p_info).

Exclude any redundant variables on the intervention group (keeping this variable only once) and include the participant information from p_info to create a new table complete_data. Check if and ensure that complete_data is equal to data_wide.

Filter join

  1. Use a dplyr summary table on the data in AHI_CESD to determine the id of all participants that took part in all 6 occasions. Then use a filter join on p_info and on AHI_CESD to keep only the (rows of) data of these participants.
Solution
# Summary table of id values with 6 occasions:
id_6_occ <- AHI_CESD %>%
  group_by(id) %>%
  summarise(n = n()) %>%
  filter(n == 6)
id_6_occ
#> # A tibble: 72 x 2
#>       id     n
#>    <dbl> <int>
#>  1     2     6
#>  2    11     6
#>  3    12     6
#>  4    14     6
#>  5    24     6
#>  6    27     6
#>  7    28     6
#>  8    37     6
#>  9    38     6
#> 10    40     6
#> # … with 62 more rows

# Use id_6_occ$id to filter cases of p_info and AHI_CESD: 
semi_join(p_info,   id_6_occ, by = "id")  #  72 x  6
#> # A tibble: 72 x 6
#>       id intervention   sex   age  educ income
#>    <dbl>        <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     2            1     1    59     1      1
#>  2    11            2     1    56     5      1
#>  3    12            2     1    46     4      3
#>  4    14            4     1    41     1      2
#>  5    24            3     1    51     4      3
#>  6    27            4     1    34     4      3
#>  7    28            1     1    62     4      1
#>  8    37            2     1    47     5      2
#>  9    38            3     1    50     4      3
#> 10    40            2     1    53     4      3
#> # … with 62 more rows
semi_join(AHI_CESD, id_6_occ, by = "id")  # 432 x 50
#> # A tibble: 432 x 50
#>       id occasion elapsed.days intervention ahi01 ahi02 ahi03 ahi04 ahi05 ahi06
#>    <dbl>    <dbl>        <dbl>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     2        0         0               1     3     4     3     4     2     3
#>  2     2        1         8.02            1     3     4     4     4     3     3
#>  3     2        2        14.3             1     3     4     4     4     3     3
#>  4     2        3        32.0             1     3     4     4     4     4     4
#>  5     2        4        92.2             1     3     3     2     3     3     3
#>  6     2        5       182.              1     3     3     3     4     2     3
#>  7    11        0         0               2     2     2     3     4     2     3
#>  8    11        1         8.90            2     2     2     3     4     2     3
#>  9    11        2        14.8             2     3     2     3     3     2     2
#> 10    11        3        34.0             2     2     2     3     4     2     2
#> # … with 422 more rows, and 40 more variables: ahi07 <dbl>, ahi08 <dbl>,
#> #   ahi09 <dbl>, ahi10 <dbl>, ahi11 <dbl>, ahi12 <dbl>, ahi13 <dbl>,
#> #   ahi14 <dbl>, ahi15 <dbl>, ahi16 <dbl>, ahi17 <dbl>, ahi18 <dbl>,
#> #   ahi19 <dbl>, ahi20 <dbl>, ahi21 <dbl>, ahi22 <dbl>, ahi23 <dbl>,
#> #   ahi24 <dbl>, cesd01 <dbl>, cesd02 <dbl>, cesd03 <dbl>, cesd04 <dbl>,
#> #   cesd05 <dbl>, cesd06 <dbl>, cesd07 <dbl>, cesd08 <dbl>, cesd09 <dbl>,
#> #   cesd10 <dbl>, cesd11 <dbl>, cesd12 <dbl>, cesd13 <dbl>, cesd14 <dbl>,
#> #   cesd15 <dbl>, cesd16 <dbl>, cesd17 <dbl>, cesd18 <dbl>, cesd19 <dbl>,
#> #   cesd20 <dbl>, ahiTotal <dbl>, cesdTotal <dbl>

A.8.5 Exercise 5

This exercise relies on the raw data for the false positive psychology (see Section B.2 of Appendix B for details on the data and corresponding information). The original data can be obtained at https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/ or https://zenodo.org/record/7664.

Preparation

  1. Download the 2 raw data files and load them into R as study_1 and study_2 (fixing any obvious errors).

Cleaning up

  1. Identify individuals and arrange the dataset: Add a numeric ID variable identifying each individual participant and sort the dataset in an intuitive way.
Solution
## Data: 
dim(fpositive_all)         # 78 x 18
#> [1] 78 18
sum(is.na(fpositive_all))  # No missing values.
#> [1] 0

# Add a factor variable ID to the dataset, 
# which runs from 1 to the total number of participants in the dataset: 
n_rows <- nrow(fpositive_all)

fpositive_all <- fpositive_all %>%
  mutate(ID = 1:n_rows) 

# Order columns for having ID and participant info first, followd by condition and the other items: 
fpositive_all <- fpositive_all %>%
  select(study, ID, aged, aged365, female, dad, mom, cond, potato, when64, kalimba, everything()) %>%
  arrange(study, ID)

# Check re-ordered tibble:
head(fpositive_all)
#> # A tibble: 6 x 19
#>   study    ID  aged aged365 female   dad   mom cond  potato when64 kalimba  root
#>   <dbl> <int> <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl>   <dbl> <dbl>
#> 1     1     1  6765    18.5      0    49    45 cont…      0      0       1     1
#> 2     1     2  7715    21.1      1    63    62 64         0      1       0     1
#> 3     1     3  7630    20.9      0    61    59 64         0      1       0     1
#> 4     1     4  7543    20.7      0    54    51 cont…      0      0       1     1
#> 5     1     5  7849    21.5      0    47    43 64         0      1       0     1
#> 6     1     6  7581    20.8      1    49    50 64         0      1       0     1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> #   olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>
tail(fpositive_all)
#> # A tibble: 6 x 19
#>   study    ID  aged aged365 female   dad   mom cond  potato when64 kalimba  root
#>   <dbl> <int> <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl>   <dbl> <dbl>
#> 1     2    73  7559    20.7      0    52    50 cont…      0      0       1     1
#> 2     2    74  7864    21.5      1    47    65 64         0      1       0     1
#> 3     2    75  7214    19.8      0    55    50 64         0      1       0     1
#> 4     2    76  7173    19.7      1    56    55 64         0      1       0     1
#> 5     2    77  6984    19.1      1    52    43 pota…      1      0       0     1
#> 6     2    78  7082    19.4      0    63    51 pota…      1      0       0     1
#> # … with 7 more variables: bird <dbl>, political <dbl>, quarterback <dbl>,
#> #   olddays <dbl>, feelold <dbl>, computer <dbl>, diner <dbl>

Splitting and joining

  1. The following code splits the data of falsePosPsy into 2 parts (pt_1 and pt_2):
  • Re-create the full data set (from pt_1 and pt_2) in at least 3 different ways (and verify that each of them yields the same data as falsePosPsy).