A.6 Solutions (06)
Here are the solutions of the exercises on navigating local directories and using essential readr commands for importing and writing data of Chapter 6 (Section A.6).
Introduction
The following exercises require the essential readr commands and repeat many commands from earlier chapters (involving dplyr, ggplot2, and tibble).
A.6.1 Exercise 1
Find out your current working directory and list all files and folders contained in it.
Change your working directory to a different directory (e.g., a parallel directory
data
that is located on the same level as your current working directory) and list all the files and folders in the other directory.Return to your original working directory, but list all files in the other (
data
) directory.
Please note:
If you are doing this exercise in an R Markdown file (.Rmd
), it is possible that compiling chunks that contain local paths may yield error messages (in case R runs from a different location). If this happens, simply execute your commands in the Console and set the chunk option to eval = FALSE
to stop compiling the files in R Markdown (see Section F.3.3 of Appendix F on Using R Markdown for details).
Solution
my_wd <- getwd() # store current working directory in my_wd
my_wd # show (absolute) path to my_wd
list.files(my_wd) # list files in this directory
- Change your working directory to a different directory (e.g., a parallel directory
data
that is located on the same level as your current working directory) and list all the files and folders in the other directory.
A.6.2 Exercise 2
Parsing dates and numbers
Look at your ID card and type your birthday as a string as it’s written on the card (including any spaces or punctuation symbols). For instance, if you were Erika Mustermann (see https://de.wikipedia.org/wiki/Personalausweis_(Deutschland)) you would write the character string “12.08.1964”.
Use an appropriate
parse_
command to read this character string into R.Now read out the date in German (i.e., “12. August 1964”) and use another command to parse this string into R.
Use Google Translate to translate this character string into French, Italian, and Spanish and use appropriate R commands to parse these strings into R.
Hint: Consult vignette("locales")
for specifying languages.
Solution
parse_date("12.08.1964", "%d.%m.%Y")
#> [1] "1964-08-12"
parse_date("12. August 1964", "%d. %B %Y", locale = locale("de"))
#> [1] "1964-08-12"
parse_date("12. août 1964", "%d. %B %Y", locale = locale("fr"))
#> [1] "1964-08-12"
parse_date("12. Agosto 1964", "%d. %B %Y", locale = locale("it"))
#> [1] "1964-08-12"
parse_date("12 de agosto de 1964.", "%d de %B de %Y.", locale = locale("es"))
#> [1] "1964-08-12"
- Use a
parse_
command (with an appropriatelocale
) to parse the following character strings into the desired data format:
"US$1,099.95"
as a number;
"EUR1.099,95"
as a number.
A.6.3 Exercise 3
A read-write-read cycle
- Read in the data in file http://rpository.com/ds4psy/data/data_2.dat into an R object
data_2
, but by using the commandread_delim()
rather than by usingread_fwf()
(as above).
Hint: The variable names should be the same as above, but inspect the file to see its delimiter.
Solution
# Path to file:
my_file <- "./data/data_2.dat" # from local directory
# my_file <- "http://rpository.com/ds4psy/data/data_2.dat" # from online source
# read_delim:
data_2 <- readr::read_delim(my_file, delim = "$",
col_names = c("initials", "age", "tel", "pwd")
)
dim(data_2) # 100 observations, 4 variables
#> [1] 100 4
tibble::glimpse(data_2)
#> Rows: 100
#> Columns: 4
#> $ initials <chr> "EU", "KI", "PP", "DH", "PQ", "NN", "NO", "WV", "CS", "XH", "…
#> $ age <dbl> 63, 71, 39, 49, 71, 42, 63, 60, 70, 20, 63, 48, 54, 31, 20, 7…
#> $ tel <chr> "0397", "6685", "8950", "5619", "0896", "2282", "8598", "9975…
#> $ pwd <chr> "aZAIGM", "IHEMCK", "baWzHb", "IdOCIm", "bYheST", "ZWpRIi", "…
sum(is.na(data_2)) # 0 NA values
#> [1] 0
- Store the data file as
data_2.csv
(acsv
file that includes variable names) into a directory that is not your current working directory.
A.6.4 Exercise 4
Reading odd data
The following data files are variants of the data at http://rpository.com/ds4psy/data/falsePosPsy_all.csv:
- http://rpository.com/ds4psy/data/ex1.dat
- http://rpository.com/ds4psy/data/ex2.dat
- http://rpository.com/ds4psy/data/ex3.dat
- http://rpository.com/ds4psy/data/ex4.dat
(See Section B.2 of Appendix B for details on the data and corresponding articles.)
Hint: Defining the file paths as R objects saves you from typing them repeatedly later:
# Relative paths to my local copies of the 4 data files:
ex1 <- "./data/falsePosPsy/ex1.dat"
ex2 <- "./data/falsePosPsy/ex2.dat"
ex3 <- "./data/falsePosPsy/ex3.dat"
ex4 <- "./data/falsePosPsy/ex4.dat"
# Online paths to all 4 data files:
ex1 <- "http://rpository.com/ds4psy/data/ex1.dat"
ex2 <- "http://rpository.com/ds4psy/data/ex2.dat"
ex3 <- "http://rpository.com/ds4psy/data/ex3.dat"
ex4 <- "http://rpository.com/ds4psy/data/ex4.dat"
- Inspect file
ex1.dat
and read it in two ways (by using either the genericread.csv
or the appropriate variant ofread_csv
). How do the data read differ from each other?
Inspection
The file ex1.dat
uses a comma ,
as variable separator and point .
as a decimal mark. Thus, it uses the data format common in North America (e.g., U.S.A.).
Solution
We can use readr::read_csv
or read.csv
to import this file:
# a: ex1 is comma-separated:
a_1 <- utils::read.csv(ex1)
a_2 <- readr::read_csv(ex1)
## Check:
# head(a_1)
# a_2
class(a_1) # is a data frame
#> [1] "data.frame"
class(a_2) # is a tibble
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
all.equal(as_tibble(a_1), a_2) # data is equal,
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character)
Notes
read.csv
is available without loading any extra packages (base R) and imports comma-separated data into a data frame.read_csv
belongs to the package readr which is part of the tidyverse and imports comma-separated data into a tibble.The imported data frame
a_1
and tibblea_2
contain the same data, but the data type of many variables differ. The tibble is simpler and easier to work with.
- Inspect and import the dataset
ex2.dat
using appropriate command(s).
Inspection
The file ex2.dat
uses a semi-colon ;
as variable separator and a comma ,
as decimal mark. Thus, it uses the data format common in many European countries (e.g., Germany).
Solution
# b: ex2 is separated by semi-colons (csv2):
b_1 <- read.csv(ex2, sep = ";", dec = ",")
b_2 <- utils::read.csv2(ex2) # using the command variant
b_3 <- readr::read_csv2(ex2)
# Check:
class(b_1) # is a data frame
#> [1] "data.frame"
class(b_2)
#> [1] "data.frame"
all.equal(b_1, b_3)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
class(b_3) # is a tibble
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# Verify equality:
all.equal(as_tibble(b_1), b_3) # data is equal,
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character)
all.equal(a_1, b_1) # data.frames are equal
#> [1] TRUE
all.equal(a_2, b_3) # tibbles are equal
#> [1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"
Notes
read.csv
assumes by default that the symbol used to separate variables is a comma (,
) and the symbol used as a decimal mark is a point/period/full stop (.
). If different symbols (e.g., a semi-colon;
and a comma,
) are used as variable separator and decimal mark, respectively, this must be specified by providing thesep
anddec
arguments.The command variant
utils::read.csv2
uses the current settings as a default, rendering it suitable for reading in theex2
data.We cannot use the readr command
read_csv
here, as it assumes the same default symbols asread.csv
and does not allow re-defining these symbols. However, its variantread_csv2
is appropriate, as it assumes the symbols used here (i.e.,;
between variables and,
as decimal mark) as its defaults.Warnings when loading files, strange variable types, and errors in arithmetical operations often indicate that files have been read erroneously.
- Inspect and import the dataset
ex3.dat
using appropriate command(s).
Solution
# With read.csv
c_1 <- utils::read.csv(ex3, sep = "|", dec = ",")
# With readr:
c_2 <- readr::read_delim(ex3, delim = "|", locale = locale(decimal_mark = ","))
## Check:
# head(c_1)
# c_2
class(c_1) # is a data frame
#> [1] "data.frame"
class(c_2) # is a tibble
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
all.equal(as_tibble(c_1), c_2) # data is equal,
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character)
all.equal(a_1, c_1) # data.frames are equal
#> [1] TRUE
all.equal(a_2, c_2) # tibbles are equal
#> [1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"
Notes
The base R command
read.csv
— by virtue of providing explicitsep
anddec
arguments — is quite flexible to read in other character-delimited files. (However, the decimal mark must be either a comma or a point.)By contrast, the readr commands
read_csv
andread_csv2
expect specific defaults. When these are not met,read_delim
can be used to read a wider range of character-delimited files. Specific decimal marks and grouping marks can be defined in thelocale
argument.
- Inspect and import the dataset
ex4.dat
using appropriate command(s). Specifically, note the encoding of the age variable (aged365
) and check whether you can compute participants’ average age (in years) after importing the data.
Inspection
The file
ex4.dat
uses a comma,
as variable separator and point.
as a decimal mark (i.e., U.S. style).However, the columns
aged365
andfemale
inex4.dat
are encoded as character variables (as indicated by the quotation marks).
Solution
# With read.csv:
d_1 <- read.csv(ex4)
# With read_csv:
d_2 <- read_csv(ex4)
## Check:
head(d_1)
#> study ID aged aged365 female dad mom potato when64 kalimba cond root bird
#> 1 1 1 6765 18.53425 female 49 45 0 0 1 control 1 7
#> 2 1 2 7715 21.13699 1 63 62 0 1 0 64 1 3
#> 3 1 3 7630 20.90411 female 61 59 0 1 0 64 1 7
#> political quarterback olddays feelold computer diner
#> 1 3 2 13 2 4 7
#> 2 2 1 12 4 4 8
#> 3 1 2 12 2 4 6
#> [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
d_2
#> # A tibble: 78 × 19
#> study ID aged aged365 female dad mom potato when64 kalimba cond
#> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 6765 18.5 female 49 45 0 0 1 control
#> 2 1 2 7715 21.1 1 63 62 0 1 0 64
#> 3 1 3 7630 20.9 female 61 59 0 1 0 64
#> 4 1 4 7543 20.7 female 54 51 0 0 1 control
#> 5 1 5 7849 21.5 female 47 43 0 1 0 64
#> 6 1 6 7581 20.8 1 49 50 0 1 0 64
#> 7 1 7 7534 20.6 1 56 55 0 0 1 control
#> 8 1 8 6678 18.3 1 45 45 0 1 0 64
#> 9 1 9 6970 19.1 female 53 51 1 0 0 potato
#> 10 1 10 7681 21.0 female 53 51 0 1 0 64
#> # … with 68 more rows, and 8 more variables: root <dbl>, bird <dbl>,
#> # political <dbl>, quarterback <dbl>, olddays <dbl>, feelold <dbl>,
#> # computer <dbl>, diner <dbl>
class(d_1) # is a data frame
#> [1] "data.frame"
class(d_2) # is a tibble
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
all.equal(as_tibble(d_1), d_2) # data is equal,
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
# but variable types differ (integer vs. numeric; factor vs. character)
all.equal(a_1, d_1) # data.frames differ in "female" variable: numeric vs. factor variable
#> [1] "Component \"female\": Modes: numeric, character"
#> [2] "Component \"female\": target is numeric, current is character"
all.equal(a_2, d_2) # tibbles differ in "female" variable: numeric vs. character variable
#> [1] "Attributes: < Component \"spec\": Component \"cols\": Component \"female\": Attributes: < Component \"class\": 1 string mismatch > >"
#> [2] "Component \"female\": Modes: numeric, character"
#> [3] "Component \"female\": target is numeric, current is character"
# Compute the mean age of all participants
mean(d_1$aged365) # works: 20.8 years
#> [1] 20.81279
mean(d_2$aged365) # works: 20.8 years
#> [1] 20.81279
Notes
Despite being a character variable in
ex4.dat
, the variableaged365
is numeric in the data read in. This illustrates that bothread.csv
andread_csv
aim to convert data into its appropriate type. This often works well, but can also lead to unexpected or undesired results. Here, thefemale
variable cannot be converted into a numeric column (as it contains the word “female” inex4.dat
). This leads to its interpretation as a factor in the data.frame (when usingread.csv
) and as a character variable in the tibble (when usingread_csv
).To further control the class of variables when reading in the data, we can use the following variants of both commands:
# with read.csv:
e_1 <- read.csv(ex4, stringsAsFactors = FALSE)
as_tibble(e_1) # => female is now a character variable.
#> # A tibble: 78 × 19
#> study ID aged aged365 female dad mom potato when64 kalimba cond
#> <int> <int> <int> <dbl> <chr> <int> <int> <int> <int> <int> <chr>
#> 1 1 1 6765 18.5 female 49 45 0 0 1 control
#> 2 1 2 7715 21.1 1 63 62 0 1 0 64
#> 3 1 3 7630 20.9 female 61 59 0 1 0 64
#> 4 1 4 7543 20.7 female 54 51 0 0 1 control
#> 5 1 5 7849 21.5 female 47 43 0 1 0 64
#> 6 1 6 7581 20.8 1 49 50 0 1 0 64
#> 7 1 7 7534 20.6 1 56 55 0 0 1 control
#> 8 1 8 6678 18.3 1 45 45 0 1 0 64
#> 9 1 9 6970 19.1 female 53 51 1 0 0 potato
#> 10 1 10 7681 21.0 female 53 51 0 1 0 64
#> # … with 68 more rows, and 8 more variables: root <int>, bird <int>,
#> # political <int>, quarterback <int>, olddays <int>, feelold <int>,
#> # computer <int>, diner <int>
# with read_csv:
e_2 <- read_csv(ex4,
col_types = cols(
study = col_integer(),
ID = col_integer(),
aged = col_integer(),
aged365 = col_double(), # integer not possible
female = col_character(),
dad = col_integer(),
mom = col_integer(),
potato = col_integer(),
when64 = col_integer(),
kalimba = col_integer(),
cond = col_character(),
root = col_integer(),
bird = col_integer(),
political = col_integer(),
quarterback = col_integer(),
olddays = col_integer(),
feelold = col_integer(),
computer = col_integer(),
diner = col_integer()
))
all.equal(as_tibble(e_1), e_2) # TRUE
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 78, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
A.6.5 Exercise 5
Writing data
In Exercise 4 of the previous chapter on tibbles (see Section 5.4.4 of Chapter 5),
we created the following summary
tibble in different ways (either directly entering it by using tibble commands, or by using dplyr commands to obtain a summary table from the raw data):
knitr::kable(summary, caption = "Age-related data from Simmons et al. (2011). [See Exercise 4 of the chapter on tibbles.]")
cond | n | mean_age | youngest | oldest | fl_vyoung | fl_young | fl_neither | fl_old | fl_vold |
---|---|---|---|---|---|---|---|---|---|
64 | 25 | 21.09 | 18.30 | 38.24 | 0 | 13 | 10 | 2 | 0 |
control | 22 | 20.80 | 18.53 | 27.23 | 3 | 15 | 3 | 1 | 0 |
potato | 31 | 20.60 | 18.18 | 27.37 | 1 | 17 | 11 | 2 | 0 |
(See Section B.2 of Appendix B for details on the data and corresponding articles.)
Imagine that you are trying to send this file to a friend who — due to excessive demand for our course — was unable to secure a spot in this course and ended up in a course on the “History of data science”, whose members are encouraged to experiment with software products like MS Excel and SPSS.
- Assuming that your friend is currently located in Troy, NY (i.e., in the USA), export the
summary
as a file that your friend can read with her software.
Notes
Since our friend is supposed to be in the US, we use a
csv
file that adheres to the convention that variables are separated by commas (,
) and the decimal mark is a point (.
) – which is the default setting ofreadr::write_csv
.We write out the
csv
file to a separatedata
directory and specify its path and name relative to our current working directory.
Solution
# (a) write_csv:
readr::write_csv(x = summary, path = "./data/summary_a.csv")
# (b) write.csv:
write.csv(x = summary, file = "./data/summary_b.csv", row.names = FALSE)
# (c) write.table:
write.table(x = summary, file = "./data/summary_c.csv", row.names = FALSE, sep = ",", dec = ".")
- Read back your file and verify that it contains the same information as your original
summary
.
Solution
summary_2a <- readr::read_csv(file = "./data/summary_a.csv")
all.equal(summary, summary_2a) # TRUE (except for numeric data types)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 3, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
summary_2b <- read.csv(file = "./data/summary_b.csv", stringsAsFactors = FALSE)
all.equal(summary, summary_2b) # TRUE (except for numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
summary_2c <- read.csv(file = "./data/summary_c.csv", stringsAsFactors = FALSE)
all.equal(summary, summary_2c) # TRUE (except for numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
- Now repeat both steps (i.e., writing and re-reading the
summary
data) under the assumption that your friend is located in Berlin, Germany.
Notes
- Using a comma-separated value (
csv
) file is still the best choice. However, since our friend is supposed to be in Germany, we use acsv2
file that adheres to the European convention that variables are separated by semi-colons (;
) and the decimal mark is a comma (,
) – which is the default setting ofreadr::write_csv2
.
Solution
To signal the different file format to our friend, we’ll use the file extension .csv2
.
## Writing data in csv2 format: ----
# (a) write_csv:
readr::write_delim(x = summary, path = "./data/summary_a.csv2", delim = ";")
# (b) write.csv:
write.csv2(x = summary, file = "./data/summary_b.csv2", row.names = FALSE)
# (c) write.table:
write.table(x = summary, file = "./data/summary_c.csv2", row.names = FALSE, sep = ";", dec = ",")
## Reading data in csv2 format: ----
summary_3a <- readr::read_csv2(file = "./data/summary_a.csv2")
all.equal(summary, summary_2a) # TRUE (except numeric data types)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
#> [6] "Attributes: < Component 2: Lengths: 3, 1 >"
#> [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
summary_3b <- read.csv2(file = "./data/summary_b.csv2", stringsAsFactors = FALSE)
all.equal(summary, summary_3b) # TRUE (except numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
all.equal(summary_2b, summary_3b) # TRUE
#> [1] TRUE
summary_3c <- read.csv2(file = "./data/summary_c.csv2", stringsAsFactors = FALSE)
all.equal(summary, summary_3c) # TRUE (except numeric data types)
#> [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
all.equal(summary_2c, summary_3c) # TRUE
#> [1] TRUE
Notes
As it is simple and functional, the readr function
write_csv
provides good and safe options for exchanging files.For more flexibility and potential complexity, it may be indicated to use
write_delim
orwrite.table
(or one of many other functions).For the base R functions
write.csv
andwrite.table
the default of the argumentrow.names = TRUE
. As we only want to use column names here, we specifiedrow.names = FALSE
. If we left this out, we would get an extra column that counts the rows from 1 tonrow
(on normally does not hurt either).Even though our friend can now read our data without problems, she should try to enroll into a real data science course at some point.
A.6.6 Exercise 6
Variants of p_info
In this exercise, we re-visit the participant data on positive psychology interventions that we have analyzed before and try to parse some variants of this data. (See Section B.1 of Appendix B for details on the data.)
- Load the data at http://rpository.com/ds4psy/data/posPsy_participants.csv into an R object
p_info
and compute participants’ meanage
byintervention
, bysex
, and by level of education (educ
).
Solution
# Load data (from online source):
my_URL <- "http://rpository.com/ds4psy/data/posPsy_participants.csv"
p_info <- readr::read_csv(file = my_URL)
p_info # inspect tibble
#> # A tibble: 295 × 6
#> id intervention sex age educ income
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4 2 35 5 3
#> 2 2 1 1 59 1 1
#> 3 3 4 1 51 4 3
#> 4 4 3 1 50 5 2
#> 5 5 2 2 58 5 2
#> 6 6 1 1 31 5 1
#> 7 7 3 1 44 5 2
#> 8 8 2 1 57 4 2
#> 9 9 1 1 36 4 3
#> 10 10 2 1 45 4 3
#> # … with 285 more rows
Note that all variables (columns) seem to be encoded as numeric variables.
Use dplyr pipes for creating some descriptive tables:
# Mean age by intervention:
p_info %>%
group_by(intervention) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 4 × 3
#> intervention n mn_age
#> <dbl> <int> <dbl>
#> 1 1 72 44.6
#> 2 2 76 45.4
#> 3 3 74 43.3
#> 4 4 73 41.7
# Mean age by sex:
p_info %>%
group_by(sex) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 2 × 3
#> sex n mn_age
#> <dbl> <int> <dbl>
#> 1 1 251 43.9
#> 2 2 44 43.0
# Mean age by educ:
p_info %>%
group_by(educ) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 5 × 3
#> educ n mn_age
#> <dbl> <int> <dbl>
#> 1 1 14 45.5
#> 2 2 21 42.2
#> 3 3 39 44.2
#> 4 4 104 42.7
#> 5 5 117 44.6
- Download the file
p_info_2.dat
(located at http://rpository.com/ds4psy/data/p_info_2.dat) into a local directory (calleddata
) and import it from there into an R objectp_info_2
. (Hint: Inspect the file prior to loading it: What is different in this file?)
Solution
# Downloaded http://rpository.com/ds4psy/data/p_info_2.dat into
# a local directory ./data/:
my_file <- "./data/p_info_2.dat"
# Inspecting the file shows that "|" is used as delimitor and "," as decimal mark:
p_info_2 <- readr::read_delim(file = my_file, delim = "|", locale = locale(decimal_mark = ","))
p_info_2 # inspect tibble
#> # A tibble: 295 × 6
#> id intervention sex age educ income
#> <dbl> <chr> <chr> <dbl> <chr> <dbl>
#> 1 1 early memories (control) male 35 MSc/PhD degree 3
#> 2 2 signature strengths female 59 below year 12 1
#> 3 3 early memories (control) female 51 BSc degree 3
#> 4 4 gratitude visit female 50 MSc/PhD degree 2
#> 5 5 3 good things male 58 MSc/PhD degree 2
#> 6 6 signature strengths female 31 MSc/PhD degree 1
#> 7 7 gratitude visit female 44 MSc/PhD degree 2
#> 8 8 3 good things female 57 BSc degree 2
#> 9 9 signature strengths female 36 BSc degree 3
#> 10 10 3 good things female 45 BSc degree 3
#> # … with 285 more rows
Exploring the file shows an odd problem with its id
variable:
mean(p_info_2$id) # => 147.999 [i.e., almost (295 + 1)/2]
#> [1] 147.9997
p_info_2$id # contains a value of 99.9 at position 100:
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#> [ reached getOption("max.print") -- omitted 220 entries ]
p_info_2$id[100]
#> [1] 99.9
sum(ds4psy::is_wholenumber(p_info_2$id) == FALSE)
#> [1] 1
# Note: If we omitted the locale = locale(decimal_mark = ","):
p_info_2b <- readr::read_delim(file = my_file, delim = "|")
mean(p_info_2b$id) # => 151.0475, due to
#> [1] 151.0475
p_info_2b$id[100] # => 999 (rather than 100)
#> [1] 999
Answers
The variables
intervention
,sex
andeduc
were encoded as numbers (of typeinteger
ordouble
) before, but are now encoded as character variables.The variable
id
is encoded as a double, due to the value100
being erroneously saved as99,9
(with a comma as the decimal mark). This should be fixed prior to any further analysis.In both data files, some variables should probably be converted into factors.
- Recompute the mean
age
byintervention
, bysex
, and by level of education (educ
). Are they the same as before?
Solution
# Mean age by intervention:
p_info_2 %>%
group_by(intervention) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 4 × 3
#> intervention n mn_age
#> <chr> <int> <dbl>
#> 1 3 good things 76 45.4
#> 2 early memories (control) 73 41.7
#> 3 gratitude visit 74 43.3
#> 4 signature strengths 72 44.6
# Mean age by sex:
p_info_2 %>%
group_by(sex) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 2 × 3
#> sex n mn_age
#> <chr> <int> <dbl>
#> 1 female 251 43.9
#> 2 male 44 43.0
# Mean age by educ:
p_info_2 %>%
group_by(educ) %>%
summarise(n = n(),
mn_age = mean(age))
#> # A tibble: 5 × 3
#> educ n mn_age
#> <chr> <int> <dbl>
#> 1 below year 12 14 45.5
#> 2 BSc degree 104 42.7
#> 3 MSc/PhD degree 117 44.6
#> 4 vocational training 39 44.2
#> 5 year 12 21 42.2
Answer
The means are the same as before, but their order (in the resulting tibbles) is different, as character variables are listed in alphabetical order (unless they are defined as factors).
This concludes our set of exercises on importing data.