A.6 Solutions (06)

ds4psy: Solutions 6

Here are the solutions of the exercises on navigating local directories and using essential readr commands for importing and writing data of Chapter 6 (Section A.6).

Introduction

The following exercises require the essential readr commands and repeat many commands from earlier chapters (involving dplyr, ggplot2, and tibble).

A.6.1 Exercise 1

  1. Find out your current working directory and list all files and folders contained in it.

  2. Change your working directory to a different directory (e.g., a parallel directory data that is located on the same level as your current working directory) and list all the files and folders in the other directory.

  3. Return to your original working directory, but list all files in the other (data) directory.

Please note: If you are doing this exercise in an R Markdown file (.Rmd), it is possible that compiling chunks that contain local paths may yield error messages (in case R runs from a different location). If this happens, simply execute your commands in the Console and set the chunk option to eval = FALSE to stop compiling the files in R Markdown (see Section F.3.3 of Appendix F on Using R Markdown for details).

Solution

my_wd <- getwd()   # store current working directory in my_wd
my_wd              # show (absolute) path to my_wd

list.files(my_wd)  # list files in this directory
  1. Change your working directory to a different directory (e.g., a parallel directory data that is located on the same level as your current working directory) and list all the files and folders in the other directory.

Solution

other_dir <- "./../data"  # (relative) path to the other directory
setwd(other_dir)
getwd()       # verify new location
list.files()  # list all files here
  1. Return to your original working directory, but list all files in the other (data) directory.

Solution

setwd(my_wd)  # return to previous my_wd
getwd()       # verify new location

list.files(my_wd)      # list files in my_wd  
list.files(other_dir)  # list files in other dir (using relative path)

A.6.2 Exercise 2

Parsing dates and numbers

Look at your ID card and type your birthday as a string as it’s written on the card (including any spaces or punctuation symbols). For instance, if you were Erika Mustermann (see https://de.wikipedia.org/wiki/Personalausweis_(Deutschland)) you would write the character string “12.08.1964”.

  1. Use an appropriate parse_ command to read this character string into R.

  2. Now read out the date in German (i.e., “12. August 1964”) and use another command to parse this string into R.

  3. Use Google Translate to translate this character string into French, Italian, and Spanish and use appropriate R commands to parse these strings into R.

Hint: Consult vignette("locales") for specifying languages.

Solution

parse_date("12.08.1964", "%d.%m.%Y")
#> [1] "1964-08-12"

parse_date("12. August 1964", "%d. %B %Y", locale = locale("de"))
#> [1] "1964-08-12"

parse_date("12. août 1964", "%d. %B %Y", locale = locale("fr"))
#> [1] "1964-08-12"
parse_date("12. Agosto 1964", "%d. %B %Y", locale = locale("it"))
#> [1] "1964-08-12"
parse_date("12 de agosto de 1964.", "%d de %B de %Y.", locale = locale("es"))
#> [1] "1964-08-12"
  1. Use a parse_ command (with an appropriate locale) to parse the following character strings into the desired data format:
  • "US$1,099.95" as a number;
  • "EUR1.099,95" as a number.

Solution

parse_number("US$1,099.95") 
#> [1] 1099.95
parse_number("EUR1.099,95", locale = locale(grouping_mark = "."))
#> [1] 1099.95

A.6.3 Exercise 3

A read-write-read cycle

  1. Read in the data in file http://rpository.com/ds4psy/data/data_2.dat into an R object data_2, but by using the command read_delim() rather than by using read_fwf() (as above).

Hint: The variable names should be the same as above, but inspect the file to see its delimiter.

Solution

# Path to file:
my_file <- "./data/data_2.dat"                            # from local directory
# my_file <- "http://rpository.com/ds4psy/data/data_2.dat"  # from online source

# read_delim: 
data_2 <- readr::read_delim(my_file, delim = "$", 
                            col_names = c("initials", "age", "tel", "pwd")
)

dim(data_2)  # 100 observations, 4 variables
#> [1] 100   4
tibble::glimpse(data_2)
#> Rows: 100
#> Columns: 4
#> $ initials <chr> "EU", "KI", "PP", "DH", "PQ", "NN", "NO", "WV", "CS", "XH", "…
#> $ age      <dbl> 63, 71, 39, 49, 71, 42, 63, 60, 70, 20, 63, 48, 54, 31, 20, 7…
#> $ tel      <chr> "0397", "6685", "8950", "5619", "0896", "2282", "8598", "9975…
#> $ pwd      <chr> "aZAIGM", "IHEMCK", "baWzHb", "IdOCIm", "bYheST", "ZWpRIi", "…
sum(is.na(data_2))  # 0 NA values
#> [1] 0
  1. Store the data file as data_2.csv (a csv file that includes variable names) into a directory that is not your current working directory.

Solution

readr::write_csv(x = data_2, path = "./data/data_2.csv")
  1. Now use a command to re-read the file data_2.csv back into an object data_2b and use the all.equal() function to verify that data_2 and data_2b are equal.

Solution

data_2b <- readr::read_csv("./data/data_2.csv")

# Verify equality: 
all.equal(data_2, data_2b)
#> [1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"

A.6.4 Exercise 4

Reading odd data

The following data files are variants of the data at http://rpository.com/ds4psy/data/falsePosPsy_all.csv:

(See Section B.2 of Appendix B for details on the data and corresponding articles.)

Hint: Defining the file paths as R objects saves you from typing them repeatedly later:

# Relative paths to my local copies of the 4 data files:
ex1 <- "./data/falsePosPsy/ex1.dat"
ex2 <- "./data/falsePosPsy/ex2.dat"
ex3 <- "./data/falsePosPsy/ex3.dat"
ex4 <- "./data/falsePosPsy/ex4.dat"

# Online paths to all 4 data files:
ex1 <- "http://rpository.com/ds4psy/data/ex1.dat"
ex2 <- "http://rpository.com/ds4psy/data/ex2.dat"
ex3 <- "http://rpository.com/ds4psy/data/ex3.dat"
ex4 <- "http://rpository.com/ds4psy/data/ex4.dat"
  1. Inspect file ex1.dat and read it in two ways (by using either the generic read.csv or the appropriate variant of read_csv). How do the data read differ from each other?

Inspection

The file ex1.dat uses a comma , as variable separator and point . as a decimal mark. Thus, it uses the data format common in North America (e.g., U.S.A.).

Solution

We can use readr::read_csv or read.csv to import this file:

# a: ex1 is comma-separated: 
a_1 <- utils::read.csv(ex1)
a_2 <- readr::read_csv(ex1)

## Check:
# head(a_1)
# a_2

class(a_1)  # is a data frame
#> [1] "data.frame"
class(a_2)  # is a tibble
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

all.equal(as_tibble(a_1), a_2) # data is equal, 
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"                                           
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character)

Notes

  • read.csv is available without loading any extra packages (base R) and imports comma-separated data into a data frame.

  • read_csv belongs to the package readr which is part of the tidyverse and imports comma-separated data into a tibble.

  • The imported data frame a_1 and tibble a_2 contain the same data, but the data type of many variables differ. The tibble is simpler and easier to work with.

  1. Inspect and import the dataset ex2.dat using appropriate command(s).

Inspection

The file ex2.dat uses a semi-colon ; as variable separator and a comma , as decimal mark. Thus, it uses the data format common in many European countries (e.g., Germany).

Solution

# b: ex2 is separated by semi-colons (csv2):
b_1 <- read.csv(ex2, sep = ";", dec = ",")
b_2 <- utils::read.csv2(ex2)  # using the command variant
b_3 <- readr::read_csv2(ex2)


# Check:
class(b_1)  # is a data frame
#> [1] "data.frame"
class(b_2) 
#> [1] "data.frame"
all.equal(b_1, b_3)
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"                                
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"                                           
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"

class(b_3)  # is a tibble
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

# Verify equality: 
all.equal(as_tibble(b_1), b_3) # data is equal, 
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"                                           
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character)

all.equal(a_1, b_1)  # data.frames are equal
#> [1] TRUE
all.equal(a_2, b_3)  # tibbles are equal
#> [1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"

Notes

  • read.csv assumes by default that the symbol used to separate variables is a comma (,) and the symbol used as a decimal mark is a point/period/full stop (.). If different symbols (e.g., a semi-colon ; and a comma ,) are used as variable separator and decimal mark, respectively, this must be specified by providing the sep and dec arguments.

  • The command variant utils::read.csv2 uses the current settings as a default, rendering it suitable for reading in the ex2 data.

  • We cannot use the readr command read_csv here, as it assumes the same default symbols as read.csv and does not allow re-defining these symbols. However, its variant read_csv2 is appropriate, as it assumes the symbols used here (i.e., ; between variables and , as decimal mark) as its defaults.

  • Warnings when loading files, strange variable types, and errors in arithmetical operations often indicate that files have been read erroneously.

  1. Inspect and import the dataset ex3.dat using appropriate command(s).

Inspection

The file ex3.dat uses | as variable delimiter/separator and , as decimal mark/separator.

Solution

# With read.csv
c_1 <- utils::read.csv(ex3, sep = "|", dec = ",")

# With readr: 
c_2 <- readr::read_delim(ex3, delim = "|", locale = locale(decimal_mark = ","))

## Check:
# head(c_1)
# c_2

class(c_1)  # is a data frame
#> [1] "data.frame"
class(c_2)  # is a tibble
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

all.equal(as_tibble(c_1), c_2) # data is equal, 
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"                                           
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character) 

all.equal(a_1, c_1)  # data.frames are equal
#> [1] TRUE
all.equal(a_2, c_2)  # tibbles are equal
#> [1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"

Notes

  • The base R command read.csv — by virtue of providing explicit sep and dec arguments — is quite flexible to read in other character-delimited files. (However, the decimal mark must be either a comma or a point.)

  • By contrast, the readr commands read_csv and read_csv2 expect specific defaults. When these are not met, read_delim can be used to read a wider range of character-delimited files. Specific decimal marks and grouping marks can be defined in the locale argument.

  1. Inspect and import the dataset ex4.dat using appropriate command(s). Specifically, note the encoding of the age variable (aged365) and check whether you can compute participants’ average age (in years) after importing the data.

Inspection

  • The file ex4.dat uses a comma , as variable separator and point . as a decimal mark (i.e., U.S. style).

  • However, the columns aged365 and female in ex4.dat are encoded as character variables (as indicated by the quotation marks).

Solution

# With read.csv:
d_1 <- read.csv(ex4)

# With read_csv:
d_2 <- read_csv(ex4)

## Check:
head(d_1)
#>   study ID aged  aged365 female dad mom potato when64 kalimba    cond root bird
#> 1     1  1 6765 18.53425 female  49  45      0      0       1 control    1    7
#> 2     1  2 7715 21.13699      1  63  62      0      1       0      64    1    3
#> 3     1  3 7630 20.90411 female  61  59      0      1       0      64    1    7
#>   political quarterback olddays feelold computer diner
#> 1         3           2      13       2        4     7
#> 2         2           1      12       4        4     8
#> 3         1           2      12       2        4     6
#>  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
d_2
#> # A tibble: 78 × 19
#>    study    ID  aged aged365 female   dad   mom potato when64 kalimba cond   
#>    <dbl> <dbl> <dbl>   <dbl> <chr>  <dbl> <dbl>  <dbl>  <dbl>   <dbl> <chr>  
#>  1     1     1  6765    18.5 female    49    45      0      0       1 control
#>  2     1     2  7715    21.1 1         63    62      0      1       0 64     
#>  3     1     3  7630    20.9 female    61    59      0      1       0 64     
#>  4     1     4  7543    20.7 female    54    51      0      0       1 control
#>  5     1     5  7849    21.5 female    47    43      0      1       0 64     
#>  6     1     6  7581    20.8 1         49    50      0      1       0 64     
#>  7     1     7  7534    20.6 1         56    55      0      0       1 control
#>  8     1     8  6678    18.3 1         45    45      0      1       0 64     
#>  9     1     9  6970    19.1 female    53    51      1      0       0 potato 
#> 10     1    10  7681    21.0 female    53    51      0      1       0 64     
#> # … with 68 more rows, and 8 more variables: root <dbl>, bird <dbl>,
#> #   political <dbl>, quarterback <dbl>, olddays <dbl>, feelold <dbl>,
#> #   computer <dbl>, diner <dbl>

class(d_1)  # is a data frame
#> [1] "data.frame"
class(d_2)  # is a tibble
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

all.equal(as_tibble(d_1), d_2) # data is equal, 
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"                                           
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character) 

all.equal(a_1, d_1)  # data.frames differ in "female" variable: numeric vs. factor variable
#> [1] "Component \"female\": Modes: numeric, character"              
#> [2] "Component \"female\": target is numeric, current is character"
all.equal(a_2, d_2)  # tibbles differ in "female" variable: numeric vs. character variable
#> [1] "Attributes: < Component \"spec\": Component \"cols\": Component \"female\": Attributes: < Component \"class\": 1 string mismatch > >"
#> [2] "Component \"female\": Modes: numeric, character"                                                                                     
#> [3] "Component \"female\": target is numeric, current is character"

# Compute the mean age of all participants
mean(d_1$aged365)  # works: 20.8 years
#> [1] 20.81279
mean(d_2$aged365)  # works: 20.8 years
#> [1] 20.81279

Notes

  • Despite being a character variable in ex4.dat, the variable aged365 is numeric in the data read in. This illustrates that both read.csv and read_csv aim to convert data into its appropriate type. This often works well, but can also lead to unexpected or undesired results. Here, the female variable cannot be converted into a numeric column (as it contains the word “female” in ex4.dat). This leads to its interpretation as a factor in the data.frame (when using read.csv) and as a character variable in the tibble (when using read_csv).

  • To further control the class of variables when reading in the data, we can use the following variants of both commands:

# with read.csv:
e_1 <- read.csv(ex4, stringsAsFactors = FALSE)
as_tibble(e_1)  # => female is now a character variable.
#> # A tibble: 78 × 19
#>    study    ID  aged aged365 female   dad   mom potato when64 kalimba cond   
#>    <int> <int> <int>   <dbl> <chr>  <int> <int>  <int>  <int>   <int> <chr>  
#>  1     1     1  6765    18.5 female    49    45      0      0       1 control
#>  2     1     2  7715    21.1 1         63    62      0      1       0 64     
#>  3     1     3  7630    20.9 female    61    59      0      1       0 64     
#>  4     1     4  7543    20.7 female    54    51      0      0       1 control
#>  5     1     5  7849    21.5 female    47    43      0      1       0 64     
#>  6     1     6  7581    20.8 1         49    50      0      1       0 64     
#>  7     1     7  7534    20.6 1         56    55      0      0       1 control
#>  8     1     8  6678    18.3 1         45    45      0      1       0 64     
#>  9     1     9  6970    19.1 female    53    51      1      0       0 potato 
#> 10     1    10  7681    21.0 female    53    51      0      1       0 64     
#> # … with 68 more rows, and 8 more variables: root <int>, bird <int>,
#> #   political <int>, quarterback <int>, olddays <int>, feelold <int>,
#> #   computer <int>, diner <int>

# with read_csv: 
e_2 <- read_csv(ex4,
                col_types = cols(
                  study = col_integer(),
                  ID = col_integer(),
                  aged = col_integer(),
                  aged365 = col_double(),   # integer not possible
                  female = col_character(),
                  dad = col_integer(),
                  mom = col_integer(),
                  potato = col_integer(),
                  when64 = col_integer(),
                  kalimba = col_integer(),
                  cond = col_character(),
                  root = col_integer(),
                  bird = col_integer(),
                  political = col_integer(),
                  quarterback = col_integer(),
                  olddays = col_integer(),
                  feelold = col_integer(),
                  computer = col_integer(),
                  diner = col_integer()
                ))

all.equal(as_tibble(e_1), e_2)  # TRUE
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"                                           
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"

A.6.5 Exercise 5

Writing data

In Exercise 4 of the previous chapter on tibbles (see Section 5.4.4 of Chapter 5), we created the following summary tibble in different ways (either directly entering it by using tibble commands, or by using dplyr commands to obtain a summary table from the raw data):

knitr::kable(summary, caption = "Age-related data from Simmons et al. (2011). [See Exercise 4 of the chapter on tibbles.]")
Table A.18: Age-related data from Simmons et al. (2011). [See Exercise 4 of the chapter on tibbles.]
cond n mean_age youngest oldest fl_vyoung fl_young fl_neither fl_old fl_vold
64 25 21.09 18.30 38.24 0 13 10 2 0
control 22 20.80 18.53 27.23 3 15 3 1 0
potato 31 20.60 18.18 27.37 1 17 11 2 0

(See Section B.2 of Appendix B for details on the data and corresponding articles.)

Imagine that you are trying to send this file to a friend who — due to excessive demand for our course — was unable to secure a spot in this course and ended up in a course on the “History of data science”, whose members are encouraged to experiment with software products like MS Excel and SPSS.

  1. Assuming that your friend is currently located in Troy, NY (i.e., in the USA), export the summary as a file that your friend can read with her software.

Notes

  • Since our friend is supposed to be in the US, we use a csv file that adheres to the convention that variables are separated by commas (,) and the decimal mark is a point (.) – which is the default setting of readr::write_csv.

  • We write out the csv file to a separate data directory and specify its path and name relative to our current working directory.

Solution

# (a) write_csv:
readr::write_csv(x = summary, path = "./data/summary_a.csv")

# (b) write.csv:
write.csv(x = summary, file = "./data/summary_b.csv", row.names = FALSE)

# (c) write.table:
write.table(x = summary, file = "./data/summary_c.csv", row.names = FALSE, sep = ",", dec = ".")
  1. Read back your file and verify that it contains the same information as your original summary.

Solution

summary_2a <- readr::read_csv(file = "./data/summary_a.csv")
all.equal(summary, summary_2a)  # TRUE (except for numeric data types)
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 3, 1 >"                                            
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"

summary_2b <- read.csv(file = "./data/summary_b.csv", stringsAsFactors = FALSE)
all.equal(summary, summary_2b)  # TRUE (except for numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"

summary_2c <- read.csv(file = "./data/summary_c.csv", stringsAsFactors = FALSE)
all.equal(summary, summary_2c)  # TRUE (except for numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
  1. Now repeat both steps (i.e., writing and re-reading the summary data) under the assumption that your friend is located in Berlin, Germany.

Notes

  • Using a comma-separated value (csv) file is still the best choice. However, since our friend is supposed to be in Germany, we use a csv2 file that adheres to the European convention that variables are separated by semi-colons (;) and the decimal mark is a comma (,) – which is the default setting of readr::write_csv2.

Solution

To signal the different file format to our friend, we’ll use the file extension .csv2.

## Writing data in csv2 format: ---- 

# (a) write_csv:
readr::write_delim(x = summary, path = "./data/summary_a.csv2", delim = ";")

# (b) write.csv:
write.csv2(x = summary, file = "./data/summary_b.csv2", row.names = FALSE)

# (c) write.table:
write.table(x = summary, file = "./data/summary_c.csv2", row.names = FALSE, sep = ";", dec = ",")


## Reading data in csv2 format: ----

summary_3a <- readr::read_csv2(file = "./data/summary_a.csv2")
all.equal(summary, summary_2a)     # TRUE (except numeric data types)
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"                              
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
#> [6] "Attributes: < Component 2: Lengths: 3, 1 >"                                            
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"

summary_3b <- read.csv2(file = "./data/summary_b.csv2", stringsAsFactors = FALSE)
all.equal(summary, summary_3b)     # TRUE (except numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
all.equal(summary_2b, summary_3b)  # TRUE
#> [1] TRUE

summary_3c <- read.csv2(file = "./data/summary_c.csv2", stringsAsFactors = FALSE)
all.equal(summary, summary_3c)     # TRUE (except numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
all.equal(summary_2c, summary_3c)  # TRUE
#> [1] TRUE

Notes

  • As it is simple and functional, the readr function write_csv provides good and safe options for exchanging files.

  • For more flexibility and potential complexity, it may be indicated to use write_delim or write.table (or one of many other functions).

  • For the base R functions write.csv and write.table the default of the argument row.names = TRUE. As we only want to use column names here, we specified row.names = FALSE. If we left this out, we would get an extra column that counts the rows from 1 to nrow (on normally does not hurt either).

  • Even though our friend can now read our data without problems, she should try to enroll into a real data science course at some point.

A.6.6 Exercise 6

Variants of p_info

In this exercise, we re-visit the participant data on positive psychology interventions that we have analyzed before and try to parse some variants of this data. (See Section B.1 of Appendix B for details on the data.)

  1. Load the data at http://rpository.com/ds4psy/data/posPsy_participants.csv into an R object p_info and compute participants’ mean age by intervention, by sex, and by level of education (educ).

Solution

# Load data (from online source):
my_URL <- "http://rpository.com/ds4psy/data/posPsy_participants.csv"
p_info <- readr::read_csv(file = my_URL)
p_info  # inspect tibble
#> # A tibble: 295 × 6
#>       id intervention   sex   age  educ income
#>    <dbl>        <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     1            4     2    35     5      3
#>  2     2            1     1    59     1      1
#>  3     3            4     1    51     4      3
#>  4     4            3     1    50     5      2
#>  5     5            2     2    58     5      2
#>  6     6            1     1    31     5      1
#>  7     7            3     1    44     5      2
#>  8     8            2     1    57     4      2
#>  9     9            1     1    36     4      3
#> 10    10            2     1    45     4      3
#> # … with 285 more rows

Note that all variables (columns) seem to be encoded as numeric variables.

Use dplyr pipes for creating some descriptive tables:

# Mean age by intervention:
p_info %>% 
  group_by(intervention) %>% 
  summarise(n = n(),
            mn_age = mean(age))
#> # A tibble: 4 × 3
#>   intervention     n mn_age
#>          <dbl> <int>  <dbl>
#> 1            1    72   44.6
#> 2            2    76   45.4
#> 3            3    74   43.3
#> 4            4    73   41.7

# Mean age by sex:
p_info %>% 
  group_by(sex) %>% 
  summarise(n = n(),
            mn_age = mean(age))
#> # A tibble: 2 × 3
#>     sex     n mn_age
#>   <dbl> <int>  <dbl>
#> 1     1   251   43.9
#> 2     2    44   43.0

# Mean age by educ:
p_info %>% 
  group_by(educ) %>% 
  summarise(n = n(),
            mn_age = mean(age))
#> # A tibble: 5 × 3
#>    educ     n mn_age
#>   <dbl> <int>  <dbl>
#> 1     1    14   45.5
#> 2     2    21   42.2
#> 3     3    39   44.2
#> 4     4   104   42.7
#> 5     5   117   44.6
  1. Download the file p_info_2.dat (located at http://rpository.com/ds4psy/data/p_info_2.dat) into a local directory (called data) and import it from there into an R object p_info_2. (Hint: Inspect the file prior to loading it: What is different in this file?)

Solution

# Downloaded http://rpository.com/ds4psy/data/p_info_2.dat into 
# a local directory ./data/: 
my_file <- "./data/p_info_2.dat"

# Inspecting the file shows that "|" is used as delimitor and "," as decimal mark:
p_info_2 <- readr::read_delim(file = my_file, delim = "|", locale = locale(decimal_mark = ","))
p_info_2  # inspect tibble
#> # A tibble: 295 × 6
#>       id intervention             sex      age educ           income
#>    <dbl> <chr>                    <chr>  <dbl> <chr>           <dbl>
#>  1     1 early memories (control) male      35 MSc/PhD degree      3
#>  2     2 signature strengths      female    59 below year 12       1
#>  3     3 early memories (control) female    51 BSc degree          3
#>  4     4 gratitude visit          female    50 MSc/PhD degree      2
#>  5     5 3 good things            male      58 MSc/PhD degree      2
#>  6     6 signature strengths      female    31 MSc/PhD degree      1
#>  7     7 gratitude visit          female    44 MSc/PhD degree      2
#>  8     8 3 good things            female    57 BSc degree          2
#>  9     9 signature strengths      female    36 BSc degree          3
#> 10    10 3 good things            female    45 BSc degree          3
#> # … with 285 more rows

Exploring the file shows an odd problem with its id variable:

mean(p_info_2$id)  # => 147.999 [i.e., almost (295 + 1)/2]
#> [1] 147.9997
p_info_2$id  # contains a value of 99.9 at position 100: 
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#>  [ reached getOption("max.print") -- omitted 220 entries ]
p_info_2$id[100]
#> [1] 99.9
sum(ds4psy::is_wholenumber(p_info_2$id) == FALSE)
#> [1] 1

# Note: If we omitted the locale = locale(decimal_mark = ","): 
p_info_2b <- readr::read_delim(file = my_file, delim = "|")
mean(p_info_2b$id)  # => 151.0475, due to
#> [1] 151.0475
p_info_2b$id[100]   # => 999 (rather than 100)
#> [1] 999

Answers

  • The variables intervention, sex and educ were encoded as numbers (of type integer or double) before, but are now encoded as character variables.

  • The variable id is encoded as a double, due to the value 100 being erroneously saved as 99,9 (with a comma as the decimal mark). This should be fixed prior to any further analysis.

  • In both data files, some variables should probably be converted into factors.

  1. Recompute the mean age by intervention, by sex, and by level of education (educ). Are they the same as before?

Solution

# Mean age by intervention:
p_info_2 %>% 
  group_by(intervention) %>% 
  summarise(n = n(),
            mn_age = mean(age))
#> # A tibble: 4 × 3
#>   intervention                 n mn_age
#>   <chr>                    <int>  <dbl>
#> 1 3 good things               76   45.4
#> 2 early memories (control)    73   41.7
#> 3 gratitude visit             74   43.3
#> 4 signature strengths         72   44.6

# Mean age by sex:
p_info_2 %>% 
  group_by(sex) %>% 
  summarise(n = n(),
            mn_age = mean(age))
#> # A tibble: 2 × 3
#>   sex        n mn_age
#>   <chr>  <int>  <dbl>
#> 1 female   251   43.9
#> 2 male      44   43.0

# Mean age by educ:
p_info_2 %>% 
  group_by(educ) %>% 
  summarise(n = n(),
            mn_age = mean(age))
#> # A tibble: 5 × 3
#>   educ                    n mn_age
#>   <chr>               <int>  <dbl>
#> 1 below year 12          14   45.5
#> 2 BSc degree            104   42.7
#> 3 MSc/PhD degree        117   44.6
#> 4 vocational training    39   44.2
#> 5 year 12                21   42.2

Answer

The means are the same as before, but their order (in the resulting tibbles) is different, as character variables are listed in alphabetical order (unless they are defined as factors).

This concludes our set of exercises on importing data.