## A.6 Solutions (06)

Here are the solutions of the exercises on navigating local directories and using essential readr commands for importing and writing data of Chapter 6 (Section A.6).

#### Introduction

The following exercises require the essential readr commands and repeat many commands from earlier chapters (involving dplyr, ggplot2, and tibble).

### A.6.1 Exercise 1

1. Find out your current working directory and list all files and folders contained in it.

2. Change your working directory to a different directory (e.g., a parallel directory data that is located on the same level as your current working directory) and list all the files and folders in the other directory.

3. Return to your original working directory, but list all files in the other (data) directory.

Please note: If you are doing this exercise in an R Markdown file (.Rmd), it is possible that compiling chunks that contain local paths may yield error messages (in case R runs from a different location). If this happens, simply execute your commands in the Console and set the chunk option to eval = FALSE to stop compiling the files in R Markdown (see Section F.3.3 of Appendix F on Using R Markdown for details).

#### Solution

my_wd <- getwd()   # store current working directory in my_wd
my_wd              # show (absolute) path to my_wd

list.files(my_wd)  # list files in this directory
1. Change your working directory to a different directory (e.g., a parallel directory data that is located on the same level as your current working directory) and list all the files and folders in the other directory.

#### Solution

other_dir <- "./../data"  # (relative) path to the other directory
setwd(other_dir)
getwd()       # verify new location
list.files()  # list all files here
1. Return to your original working directory, but list all files in the other (data) directory.

#### Solution

setwd(my_wd)  # return to previous my_wd
getwd()       # verify new location

list.files(my_wd)      # list files in my_wd
list.files(other_dir)  # list files in other dir (using relative path)

### A.6.2 Exercise 2

#### Parsing dates and numbers

Look at your ID card and type your birthday as a string as it’s written on the card (including any spaces or punctuation symbols). For instance, if you were Erika Mustermann (see https://de.wikipedia.org/wiki/Personalausweis_(Deutschland)) you would write the character string “12.08.1964.”

1. Use an appropriate parse_ command to read this character string into R.

2. Now read out the date in German (i.e., “12. August 1964”) and use another command to parse this string into R.

3. Use Google Translate to translate this character string into French, Italian, and Spanish and use appropriate R commands to parse these strings into R.

Hint: Consult vignette("locales") for specifying languages.

#### Solution

parse_date("12.08.1964", "%d.%m.%Y")
#> [1] "1964-08-12"

parse_date("12. August 1964", "%d. %B %Y", locale = locale("de"))
#> [1] "1964-08-12"

parse_date("12. août 1964", "%d. %B %Y", locale = locale("fr"))
#> [1] "1964-08-12"
parse_date("12. Agosto 1964", "%d. %B %Y", locale = locale("it"))
#> [1] "1964-08-12"
parse_date("12 de agosto de 1964.", "%d de %B de %Y.", locale = locale("es"))
#> [1] "1964-08-12"
1. Use a parse_ command (with an appropriate locale) to parse the following character strings into the desired data format:
• "US$1,099.95" as a number; • "EUR1.099,95" as a number. #### Solution parse_number("US$1,099.95")
#> [1] 1099.95
parse_number("EUR1.099,95", locale = locale(grouping_mark = "."))
#> [1] 1099.95

### A.6.3 Exercise 3

1. Read in the data in file http://rpository.com/ds4psy/data/data_2.dat into an R object data_2, but by using the command read_delim() rather than by using read_fwf() (as above).

Hint: The variable names should be the same as above, but inspect the file to see its delimiter.

#### Solution

# Path to file:
my_file <- "./data/data_2.dat"                            # from local directory
# my_file <- "http://rpository.com/ds4psy/data/data_2.dat"  # from online source

data_2 <- readr::read_delim(my_file, delim = "$", col_names = c("initials", "age", "tel", "pwd") ) dim(data_2) # 100 observations, 4 variables #> [1] 100 4 tibble::glimpse(data_2) #> Rows: 100 #> Columns: 4 #>$ initials <chr> "EU", "KI", "PP", "DH", "PQ", "NN", "NO", "WV", "CS", "XH", "…
#> $age <dbl> 63, 71, 39, 49, 71, 42, 63, 60, 70, 20, 63, 48, 54, 31, 20, 7… #>$ tel      <chr> "0397", "6685", "8950", "5619", "0896", "2282", "8598", "9975…
#> [1] 20.81279
mean(d_2$aged365) # works: 20.8 years #> [1] 20.81279 #### Notes • Despite being a character variable in ex4.dat, the variable aged365 is numeric in the data read in. This illustrates that both read.csv and read_csv aim to convert data into its appropriate type. This often works well, but can also lead to unexpected or undesired results. Here, the female variable cannot be converted into a numeric column (as it contains the word “female” in ex4.dat). This leads to its interpretation as a factor in the data.frame (when using read.csv) and as a character variable in the tibble (when using read_csv). • To further control the class of variables when reading in the data, we can use the following variants of both commands: # with read.csv: e_1 <- read.csv(ex4, stringsAsFactors = FALSE) as_tibble(e_1) # => female is now a character variable. #> # A tibble: 78 × 19 #> study ID aged aged365 female dad mom potato when64 kalimba cond #> <int> <int> <int> <dbl> <chr> <int> <int> <int> <int> <int> <chr> #> 1 1 1 6765 18.5 female 49 45 0 0 1 control #> 2 1 2 7715 21.1 1 63 62 0 1 0 64 #> 3 1 3 7630 20.9 female 61 59 0 1 0 64 #> 4 1 4 7543 20.7 female 54 51 0 0 1 control #> 5 1 5 7849 21.5 female 47 43 0 1 0 64 #> 6 1 6 7581 20.8 1 49 50 0 1 0 64 #> 7 1 7 7534 20.6 1 56 55 0 0 1 control #> 8 1 8 6678 18.3 1 45 45 0 1 0 64 #> 9 1 9 6970 19.1 female 53 51 1 0 0 potato #> 10 1 10 7681 21.0 female 53 51 0 1 0 64 #> # … with 68 more rows, and 8 more variables: root <int>, bird <int>, #> # political <int>, quarterback <int>, olddays <int>, feelold <int>, #> # computer <int>, diner <int> # with read_csv: e_2 <- read_csv(ex4, col_types = cols( study = col_integer(), ID = col_integer(), aged = col_integer(), aged365 = col_double(), # integer not possible female = col_character(), dad = col_integer(), mom = col_integer(), potato = col_integer(), when64 = col_integer(), kalimba = col_integer(), cond = col_character(), root = col_integer(), bird = col_integer(), political = col_integer(), quarterback = col_integer(), olddays = col_integer(), feelold = col_integer(), computer = col_integer(), diner = col_integer() )) all.equal(as_tibble(e_1), e_2) # TRUE #> [1] "Attributes: < Names: 1 string mismatch >" #> [2] "Attributes: < Length mismatch: comparison on first 2 components >" #> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >" #> [4] "Attributes: < Component \"class\": 3 string mismatches >" #> [5] "Attributes: < Component 2: Modes: numeric, externalptr >" #> [6] "Attributes: < Component 2: Lengths: 78, 1 >" #> [7] "Attributes: < Component 2: target is numeric, current is externalptr >" ### A.6.5 Exercise 5 #### Writing data In Exercise 4 of the previous chapter on tibbles (see Section 5.4.4 of Chapter 5), we created the following summary tibble in different ways (either directly entering it by using tibble commands, or by using dplyr commands to obtain a summary table from the raw data): knitr::kable(summary, caption = "Age-related data from Simmons et al. (2011). [See Exercise 4 of the chapter on tibbles.]") Table A.18: Age-related data from Simmons et al. (2011). [See Exercise 4 of the chapter on tibbles.] cond n mean_age youngest oldest fl_vyoung fl_young fl_neither fl_old fl_vold 64 25 21.09 18.30 38.24 0 13 10 2 0 control 22 20.80 18.53 27.23 3 15 3 1 0 potato 31 20.60 18.18 27.37 1 17 11 2 0 (See Section B.2 of Appendix B for details on the data and corresponding articles.) Imagine that you are trying to send this file to a friend who — due to excessive demand for our course — was unable to secure a spot in this course and ended up in a course on the “History of data science,” whose members are encouraged to experiment with software products like MS Excel and SPSS. 1. Assuming that your friend is currently located in Troy, NY (i.e., in the USA), export the summary as a file that your friend can read with her software. #### Notes • Since our friend is supposed to be in the US, we use a csv file that adheres to the convention that variables are separated by commas (,) and the decimal mark is a point (.) – which is the default setting of readr::write_csv. • We write out the csv file to a separate data directory and specify its path and name relative to our current working directory. #### Solution # (a) write_csv: readr::write_csv(x = summary, path = "./data/summary_a.csv") # (b) write.csv: write.csv(x = summary, file = "./data/summary_b.csv", row.names = FALSE) # (c) write.table: write.table(x = summary, file = "./data/summary_c.csv", row.names = FALSE, sep = ",", dec = ".") 1. Read back your file and verify that it contains the same information as your original summary. #### Solution summary_2a <- readr::read_csv(file = "./data/summary_a.csv") all.equal(summary, summary_2a) # TRUE (except for numeric data types) #> [1] "Attributes: < Names: 1 string mismatch >" #> [2] "Attributes: < Length mismatch: comparison on first 2 components >" #> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >" #> [4] "Attributes: < Component \"class\": 3 string mismatches >" #> [5] "Attributes: < Component 2: Modes: numeric, externalptr >" #> [6] "Attributes: < Component 2: Lengths: 3, 1 >" #> [7] "Attributes: < Component 2: target is numeric, current is externalptr >" summary_2b <- read.csv(file = "./data/summary_b.csv", stringsAsFactors = FALSE) all.equal(summary, summary_2b) # TRUE (except for numeric data types) #> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >" #> [2] "Attributes: < Component \"class\": 1 string mismatch >" summary_2c <- read.csv(file = "./data/summary_c.csv", stringsAsFactors = FALSE) all.equal(summary, summary_2c) # TRUE (except for numeric data types) #> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >" #> [2] "Attributes: < Component \"class\": 1 string mismatch >" 1. Now repeat both steps (i.e., writing and re-reading the summary data) under the assumption that your friend is located in Berlin, Germany. #### Notes • Using a comma-separated value (csv) file is still the best choice. However, since our friend is supposed to be in Germany, we use a csv2 file that adheres to the European convention that variables are separated by semi-colons (;) and the decimal mark is a comma (,) – which is the default setting of readr::write_csv2. #### Solution To signal the different file format to our friend, we’ll use the file extension .csv2. ## Writing data in csv2 format: ---- # (a) write_csv: readr::write_delim(x = summary, path = "./data/summary_a.csv2", delim = ";") # (b) write.csv: write.csv2(x = summary, file = "./data/summary_b.csv2", row.names = FALSE) # (c) write.table: write.table(x = summary, file = "./data/summary_c.csv2", row.names = FALSE, sep = ";", dec = ",") ## Reading data in csv2 format: ---- summary_3a <- readr::read_csv2(file = "./data/summary_a.csv2") all.equal(summary, summary_2a) # TRUE (except numeric data types) #> [1] "Attributes: < Names: 1 string mismatch >" #> [2] "Attributes: < Length mismatch: comparison on first 2 components >" #> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >" #> [4] "Attributes: < Component \"class\": 3 string mismatches >" #> [5] "Attributes: < Component 2: Modes: numeric, externalptr >" #> [6] "Attributes: < Component 2: Lengths: 3, 1 >" #> [7] "Attributes: < Component 2: target is numeric, current is externalptr >" summary_3b <- read.csv2(file = "./data/summary_b.csv2", stringsAsFactors = FALSE) all.equal(summary, summary_3b) # TRUE (except numeric data types) #> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >" #> [2] "Attributes: < Component \"class\": 1 string mismatch >" all.equal(summary_2b, summary_3b) # TRUE #> [1] TRUE summary_3c <- read.csv2(file = "./data/summary_c.csv2", stringsAsFactors = FALSE) all.equal(summary, summary_3c) # TRUE (except numeric data types) #> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >" #> [2] "Attributes: < Component \"class\": 1 string mismatch >" all.equal(summary_2c, summary_3c) # TRUE #> [1] TRUE #### Notes • As it is simple and functional, the readr function write_csv provides good and safe options for exchanging files. • For more flexibility and potential complexity, it may be indicated to use write_delim or write.table (or one of many other functions). • For the base R functions write.csv and write.table the default of the argument row.names = TRUE. As we only want to use column names here, we specified row.names = FALSE. If we left this out, we would get an extra column that counts the rows from 1 to nrow (on normally does not hurt either). • Even though our friend can now read our data without problems, she should try to enroll into a real data science course at some point. ### A.6.6 Exercise 6 #### Variants of p_info In this exercise, we re-visit the participant data on positive psychology interventions that we have analyzed before and try to parse some variants of this data. (See Section B.1 of Appendix B for details on the data.) 1. Load the data at http://rpository.com/ds4psy/data/posPsy_participants.csv into an R object p_info and compute participants’ mean age by intervention, by sex, and by level of education (educ). #### Solution # Load data (from online source): my_URL <- "http://rpository.com/ds4psy/data/posPsy_participants.csv" p_info <- readr::read_csv(file = my_URL) p_info # inspect tibble #> # A tibble: 295 × 6 #> id intervention sex age educ income #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 4 2 35 5 3 #> 2 2 1 1 59 1 1 #> 3 3 4 1 51 4 3 #> 4 4 3 1 50 5 2 #> 5 5 2 2 58 5 2 #> 6 6 1 1 31 5 1 #> 7 7 3 1 44 5 2 #> 8 8 2 1 57 4 2 #> 9 9 1 1 36 4 3 #> 10 10 2 1 45 4 3 #> # … with 285 more rows Note that all variables (columns) seem to be encoded as numeric variables. Use dplyr pipes for creating some descriptive tables: # Mean age by intervention: p_info %>% group_by(intervention) %>% summarise(n = n(), mn_age = mean(age)) #> # A tibble: 4 × 3 #> intervention n mn_age #> <dbl> <int> <dbl> #> 1 1 72 44.6 #> 2 2 76 45.4 #> 3 3 74 43.3 #> 4 4 73 41.7 # Mean age by sex: p_info %>% group_by(sex) %>% summarise(n = n(), mn_age = mean(age)) #> # A tibble: 2 × 3 #> sex n mn_age #> <dbl> <int> <dbl> #> 1 1 251 43.9 #> 2 2 44 43.0 # Mean age by educ: p_info %>% group_by(educ) %>% summarise(n = n(), mn_age = mean(age)) #> # A tibble: 5 × 3 #> educ n mn_age #> <dbl> <int> <dbl> #> 1 1 14 45.5 #> 2 2 21 42.2 #> 3 3 39 44.2 #> 4 4 104 42.7 #> 5 5 117 44.6 1. Download the file p_info_2.dat (located at http://rpository.com/ds4psy/data/p_info_2.dat) into a local directory (called data) and import it from there into an R object p_info_2. (Hint: Inspect the file prior to loading it: What is different in this file?) #### Solution # Downloaded http://rpository.com/ds4psy/data/p_info_2.dat into # a local directory ./data/: my_file <- "./data/p_info_2.dat" # Inspecting the file shows that "|" is used as delimitor and "," as decimal mark: p_info_2 <- readr::read_delim(file = my_file, delim = "|", locale = locale(decimal_mark = ",")) p_info_2 # inspect tibble #> # A tibble: 295 × 6 #> id intervention sex age educ income #> <dbl> <chr> <chr> <dbl> <chr> <dbl> #> 1 1 early memories (control) male 35 MSc/PhD degree 3 #> 2 2 signature strengths female 59 below year 12 1 #> 3 3 early memories (control) female 51 BSc degree 3 #> 4 4 gratitude visit female 50 MSc/PhD degree 2 #> 5 5 3 good things male 58 MSc/PhD degree 2 #> 6 6 signature strengths female 31 MSc/PhD degree 1 #> 7 7 gratitude visit female 44 MSc/PhD degree 2 #> 8 8 3 good things female 57 BSc degree 2 #> 9 9 signature strengths female 36 BSc degree 3 #> 10 10 3 good things female 45 BSc degree 3 #> # … with 285 more rows Exploring the file shows an odd problem with its id variable: mean(p_info_2$id)  # => 147.999 [i.e., almost (295 + 1)/2]
#> [1] 147.9997
p_info_2$id # contains a value of 99.9 at position 100: #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 #> [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 #> [ reached getOption("max.print") -- omitted 220 entries ] p_info_2$id[100]
#> [1] 99.9
sum(ds4psy::is_wholenumber(p_info_2$id) == FALSE) #> [1] 1 # Note: If we omitted the locale = locale(decimal_mark = ","): p_info_2b <- readr::read_delim(file = my_file, delim = "|") mean(p_info_2b$id)  # => 151.0475, due to
#> [1] 151.0475
p_info_2b\$id[100]   # => 999 (rather than 100)
#> [1] 999

• The variables intervention, sex and educ were encoded as numbers (of type integer or double) before, but are now encoded as character variables.

• The variable id is encoded as a double, due to the value 100 being erroneously saved as 99,9 (with a comma as the decimal mark). This should be fixed prior to any further analysis.

• In both data files, some variables should probably be converted into factors.

1. Recompute the mean age by intervention, by sex, and by level of education (educ). Are they the same as before?

#### Solution

# Mean age by intervention:
p_info_2 %>%
group_by(intervention) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 4 × 3
#>   intervention                 n mn_age
#>   <chr>                    <int>  <dbl>
#> 1 3 good things               76   45.4
#> 2 early memories (control)    73   41.7
#> 3 gratitude visit             74   43.3
#> 4 signature strengths         72   44.6

# Mean age by sex:
p_info_2 %>%
group_by(sex) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 2 × 3
#>   sex        n mn_age
#>   <chr>  <int>  <dbl>
#> 1 female   251   43.9
#> 2 male      44   43.0

# Mean age by educ:
p_info_2 %>%
group_by(educ) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 5 × 3
#>   educ                    n mn_age
#>   <chr>               <int>  <dbl>
#> 1 below year 12          14   45.5
#> 2 BSc degree            104   42.7
#> 3 MSc/PhD degree        117   44.6
#> 4 vocational training    39   44.2
#> 5 year 12                21   42.2