6.2 Essential readr commands

The process of reading some data structure is known as parsing. We usually parse entire files of data. However, as data files typically consist of columns of variables, we also need to consider how individual vectors are parsed.45

6.2.1 Parsing vectors

The parse() family of functions provided by readr take a character vector as their main input argument and return a more specialised vector (e.g., logical, numeric, or date/time values) as output vector. The expected data type is specified after an underscore character (e.g., parse_logical() for logical data, parse_integer() for integers, etc.).
Some examples of the corresponding parse() commands are:

# Parse logical values:
v1 <- parse_logical(c("TRUE", "FALSE", "NA", "TRUE"))
v1
#> [1]  TRUE FALSE    NA  TRUE
str(v1)
#>  logi [1:4] TRUE FALSE NA TRUE

# Parse numeric integers:
v2 <- parse_integer(c("1", "2", "3", "NA", "5"))
v2
#> [1]  1  2  3 NA  5
str(v2)
#>  int [1:5] 1 2 3 NA 5

# Parse numeric doubles:
v3 <- parse_double(c("1.1", "2.2", "3.14", "NA"))
v3
#> [1] 1.10 2.20 3.14   NA
str(v3)
#>  num [1:4] 1.1 2.2 3.14 NA

# Parse dates: 
v4 <- parse_date(c("2019-06-03", "2019-12-24", "NA", "2019-12-31"))
v4
#> [1] "2019-06-03" "2019-12-24" NA           "2019-12-31"
str(v4)
#>  Date[1:4], format: "2019-06-03" "2019-12-24" NA "2019-12-31"

A uniform na argument allows specifying which string should be treated as missing data in all parse() functions:

parse_integer(c("12", "34", "?", "78"), na = "?")
#> [1] 12 34 NA 78

If the process of parsing fails for an element, a warning is issued and the problematic element is missing (NA) in the output vector:

v5 <- parse_integer(c("12", "34", "xy", "78"))
v5
#> [1] 12 34 NA 78
#> attr(,"problems")
#> # A tibble: 1 × 4
#>     row   col expected               actual
#>   <int> <int> <chr>                  <chr> 
#> 1     3    NA no trailing characters xy

We can use problems() on the parsed vector to see which problems occurred during parsing:

problems(v5)
#> # A tibble: 1 × 4
#>     row   col expected               actual
#>   <int> <int> <chr>                  <chr> 
#> 1     3    NA no trailing characters xy

The range of important parsers matches the range of data types that we typically deal with. Important parsers include the following:

  1. parse_logical() parses logical values (TRUE vs. FALSE);

  2. parse_integer() parses integer numbers (1, 2, 3, etc.);

  3. parse_double() is a strict numeric parser, which enforces numeric output:

parse_double(as.character(1:3))
#> [1] 1 2 3
parse_double(c("1.23", "3.14"))
#> [1] 1.23 3.14

When using a decimal mark different from “.” (used by default in the U.S.) we need to specify this by providing a locale argument:

parse_double(c("1,23", "3,14"))  # issues a warning
#> [1] NA NA
#> attr(,"problems")
#> # A tibble: 2 × 4
#>     row   col expected               actual
#>   <int> <int> <chr>                  <chr> 
#> 1     1    NA no trailing characters 1,23  
#> 2     2    NA no trailing characters 3,14

parse_double(c("1,23", "3,14"), locale = locale(decimal_mark = ","))  # works
#> [1] 1.23 3.14
  1. parse_number() is a more flexible numeric parser that ignores prefixes and postfixes, but only reads the first number of a character string:
parse_number(c("1%", "is less than 2%", "and 3% is less than 4%"))
#> [1] 1 2 3

In addition, parse_number() allows specifying both a decimal_mark and grouping_mark in its locale argument for reading country-specific number formats:

# Used in the US: 
parse_number("$1,000.99")
#> [1] 1000.99

# Used in Germany (and the EU): 
parse_number("EUR1.000,99", locale = locale(decimal_mark = ",", grouping_mark = "."))
#> [1] 1000.99

# Used in Switzerland:
parse_number("CHF1'000,99", locale = locale(decimal_mark = ",", grouping_mark = "'"))
#> [1] 1000.99
  1. parse_character() may seem trivial (as it parses a character vector into a character vector), but allows dealing with different character encodings:
parse_character(c("très difficile", "mon chérie"), 
                locale = locale(encoding = "UTF-8"))
#> [1] "très difficile" "mon chérie"
parse_character("\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd", 
                locale = locale(encoding = "Shift-JIS"))
#> [1] "こんにちは"

french <- "très difficile"
parse_character(french, locale = locale(encoding = "UTF-8"))
#> [1] "très difficile"
guess_encoding(charToRaw("très difficile"))
#> # A tibble: 4 × 2
#>   encoding   confidence
#>   <chr>           <dbl>
#> 1 UTF-8            0.8 
#> 2 ISO-8859-1       0.37
#> 3 ISO-8859-2       0.37
#> 4 ISO-8859-9       0.37

The issue of character encodings is — like spelling, typography, and punctuation — a topic that is important, yet most people seem almost aggressively allergic to it. Fortunately, there are solid standards for defining characters and encodings. If you never want to worry about encodings again, just remember that Unicode and UTF-8 are our friends (and see the Wikipedia articles on Unicode, UTF-8, and especially the List of Unicode characters for details, and read this article at Kunststube.net for further information).

We will re-visit the issue of character encodings in Chapter 9 on Strings of text (see Section 9.2.2).

  1. parse_factor() creates factors, a data structure often used to represent categorical variables with fixed and known values (e.g., the different conditions of an experiment, or the treatment levels in a study):
treatments <- c("therapy", "medication", "placebo")  # define factor levels

parse_factor(c("medication", "placebo", "therapy", "medication"), levels = treatments)
#> [1] medication placebo    therapy    medication
#> Levels: therapy medication placebo
  1. parse_date(), parse_datetime() and parse_time() allow parsing various date and time specifications:
# Current date and time:
Sys.Date()  # returns a "Date" object
#> [1] "2022-09-10"
Sys.time()  # returns a "POSIXct" calendar time object
#> [1] "2022-09-10 14:12:37 CEST"

parse_date(c("2020-02-29", "2020-12-24", "2020-12-31"))
#> [1] "2020-02-29" "2020-12-24" "2020-12-31"
parse_datetime("2020-02-29 07:38:01", locale = locale("de"))
#> [1] "2020-02-29 07:38:01 UTC"
parse_time(c("07:38:01", "11:12:13"))
#> 07:38:01
#> 11:12:13

Reading dates and times often requires specifying the details of the current date-time format (see the documentation to ?parse_datetime for details).

In R, dates are typically represented by a 4-digit year (%Y), a 2-digit month (%m), and a 2-digit day (%d), all separated by either “-” or “/”:

parse_date("2008/12/10")
#> [1] "2008-12-10"
parse_date("2008-12-10")
#> [1] "2008-12-10"

Note the ambiguity of a date string like "2018-12-10" (in which month and day could be confused) or even "08/10/12" (in which even the year is unclear):

# When was "2018-12-10"?
parse_date("2018-12-10", format = "%Y-%m-%d")
#> [1] "2018-12-10"
parse_date("2018-12-10", format = "%Y-%d-%m")
#> [1] "2018-10-12"

# When was "08/10/12"?
parse_date(c("08/10/12"), "%d/%m/%y")
#> [1] "2012-10-08"
parse_date(c("08/10/12"), "%m/%d/%y")
#> [1] "2012-08-10"
parse_date(c("08/10/12"), "%y/%m/%d")
#> [1] "2008-10-12"
parse_date(c("08/10/12"), "%y/%d/%m")
#> [1] "2008-12-10"

The measurement and representation of dates and times is complicated by many historic and local idiosyncracies and creates a corresponding need for standards. See Chapter 10 on Dates and times for additional details on handling dates and times in R.

6.2.2 Parsing files

The ease or difficulty of loading a data file depends largely on the sensibility of whoever saved it. If the originator of the file used a file format that is universal and easily exchanged (e.g., a csv file in a common encoding format), reading the data into R — or any other data analysis software — is easy and straightforward. Unfortunately, many people still insist on saving and exchanging specialized or proprietary file formats (e.g., xls or sav files). In the following, we outline some basic methods of reading different kinds of files that should cover 95% of all cases, and provide some pointers to resources that allow dealing with the remaining 5%.

read_csv()

A common and sensible file format for data separates variables (columns) by commas, which is why it is called a comma separated value (or csv) file. The best way to read such files within the tidyverse is by using the readr function read_csv(). We have used this command in this book before (e.g., to read in http://rpository.com/ds4psy/data/data_t1.csv above). Depending on the location of the data file and our preferred way of expressing the path to it, here are five possible ways of reading this file:

# Read a csv file: 
# A. From a different directory: -----

# (1) provide absolute/full path:
# dt <- readr::read_csv("/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")

# (2) provide relative path of the data file:
dt <- readr::read_csv("./data/data_t1.csv")

# (3) relative to (platform dependent) home directory:
# dt <- readr::read_csv("~/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")

# (4) using here():
# library(here)
dt <- readr::read_csv(here::here("data", "data_t1.csv"))

# B. From an online source: ----- 

# (5) provide path to an online source of the data file:
dt <- readr::read_csv("http://rpository.com/ds4psy/data/data_t1.csv")

Actually, base R (or rather the R default package utils) also provides a function read.csv() to achieve the same thing:

df <- utils::read.csv("./data/data_t1.csv")

However, note that the objects resulting from our calls of read_csv() and read.csv() (i.e., dt and df, respectively) are similar, but not identical:

# Differences: 
class(dt)  # read_csv yields a tibble 
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
class(df)  # read.csv yields a data.frame 
#> [1] "data.frame"

all.equal(dt, df)  # shows various differences:
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (4, 1) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"                                
#> [5] "Attributes: < Component 2: target is externalptr, current is numeric >"

# => df converted characters to factors, unless: 
df <- read.csv("./data/data_t1.csv", stringsAsFactors = FALSE)

all.equal(dt, df)  # shows fewer differences, but: 
#> [1] "Attributes: < Names: 1 string mismatch >"                                              
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
#> [3] "Attributes: < Component \"class\": Lengths (4, 1) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"                                
#> [5] "Attributes: < Component 2: target is externalptr, current is numeric >"

Unless we really need a data frame, working with tibbles — and the read_csv() command from readr — is typically safer than using read.csv(), as it reduces the likelihood of strange things happening when reading in data files.

Variants (read_csv2() and read_tsv())

In many countries (e.g., European countries like Germany), commas (,) are used as the decimal mark (e.g., “pi is close to 3,1416.”) and hence are not a good separator for variables (columns). In these locations, variables are typically separated by semi-colons (;). For such files, both readr and base R provide variants of the above commands that follow this convention. For instance, to read in the file http://rpository.com/ds4psy/data/data_t1_de.csv we can use:

readr::read_csv2("http://rpository.com/ds4psy/data/data_t1_de.csv")
#> # A tibble: 20 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 K.X.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2

Sometimes, the variables (columns) of a data file are separated by tabs (\t), which prompts the use of read_tsv. To read in the tab-separated data file http://rpository.com/ds4psy/data/data_t1_tab.csv we use:

readr::read_tsv("http://rpository.com/ds4psy/data/data_t1_tab.csv")
#> # A tibble: 20 × 4
#>    name  gender like_1 bnt_1
#>    <chr> <chr>   <dbl> <dbl>
#>  1 R.S.  female      2    NA
#>  2 M.Y.  male        2     4
#>  3 K.R.  male        4     1
#>  4 A.V.  female      3     2
#>  5 Z.Y.  female     NA    NA
#>  6 X.D.  female      4     4
#>  7 A.F.  female      5     1
#>  8 X.Y.  female      5     4
#>  9 K.X.  male        3     3
#> 10 M.O.  male        4     1
#> 11 T.V.  male        6     4
#> 12 X.P.  female      4     1
#> 13 Z.D.  male        5     4
#> 14 T.D.  male        5     2
#> 15 C.N.  female      4     3
#> 16 H.V.  female      1     4
#> 17 Q.N.  female      6     1
#> 18 Y.G.  male        4     4
#> 19 L.S.  male        4     4
#> 20 L.V.  female      4     2

read_delim()

All read() commands encountered so far are special instances of the more general read_delim() function, which reads delimited files into a tibble and allows specifying various additional arguments.

Suppose we want to read a file like http://rpository.com/ds4psy/data/data_1.dat, which is known to include four variables:

  • the initials of participants;
  • their age;
  • their telephone number;
  • their pwd to some account.

Inspecting the file shows the following:

  1. The data is delimited, but by a full stop (aka. point or period, i.e., .) rather than by a comma or semi-colon. As a consequence, we need to use read_delim() and explicitly specify the delimiting symbol as ".".

  2. As the file does not include variable names (in the first row), we provide them with the col_names argument (which accepts a vector of characters).

  3. Some lines are shorter than most others. Close inspection reveals that they include either -99 or -77. As these symbols are frequently used to express missing values, we supply them to the na argument (again as a vector of characters).

# Path to file:
my_file <- "./data/data_1.dat"                            # from local directory
my_file <- "http://rpository.com/ds4psy/data/data_1.dat"  # from online source

# read_delim: 
data_1 <- readr::read_delim(my_file, delim = ".", 
                            col_names = c("initials", "age", "tel", "pwd"), 
                            na = c("-77", "-99"))

dim(data_1)         # 100 observations/rows, 4 variables/columns
#> [1] 100   4
tibble::glimpse(data_1)
#> Rows: 100
#> Columns: 4
#> $ initials <chr> "ES", "OE", "YY", "DV", "MO", "OG", "TE", "KW", "ZF", "LB", "…
#> $ age      <dbl> 78, 63, 60, 86, 44, 75, 34, 32, 32, 35, 87, 55, 76, 65, 86, 2…
#> $ tel      <dbl> 84487016, 80047160, 64716120, 32620018, 93001588, 88705369, 1…
#> $ pwd      <chr> "mVZwyO", "SnjYqW", "xDNOul", "fWreff", "mWSZDV", "YStupT", "…
sum(is.na(data_1))  # 15 NA values
#> [1] 15

read_fwf()

When the variables in some data file are not delimited by some special character, we sometimes can read the data by knowing the positions of the variables (or their values) in a file. This assumes that the positions of all variable values (i.e., their starting and end positions within a row) are constant (or “fixed”) throughout the rows of the data file. If this is the case, we can use the command read_fwf() (with “fwf” standing for fixed-width files) and specify the positions of the variable values (assuming that they are identical for all rows). For instance, the file at http://rpository.com/ds4psy/data/data_2.dat contains the same variables as the previous file (data_1), but does not include missing values and can therefore be read in the following way:

# Path to file:
my_file_path <- "./data/data_2.dat"                            # from local directory
my_file_path <- "http://rpository.com/ds4psy/data/data_2.dat"  # from online source

# read_fwf: 
data_2 <- readr::read_fwf(my_file_path, 
                          readr::fwf_cols(initials = c(1, 2), 
                                          age = c(4, 5), 
                                          tel = c(7, 10), 
                                          pwd = c(12, 17)))

dim(data_2)         # 100 x 4 (as before)
#> [1] 100   4
tibble::glimpse(data_2)
#> Rows: 100
#> Columns: 4
#> $ initials <chr> "EU", "KI", "PP", "DH", "PQ", "NN", "NO", "WV", "CS", "XH", "…
#> $ age      <dbl> 63, 71, 39, 49, 71, 42, 63, 60, 70, 20, 63, 48, 54, 31, 20, 7…
#> $ tel      <chr> "0397", "6685", "8950", "5619", "0896", "2282", "8598", "9975…
#> $ pwd      <chr> "aZAIGM", "IHEMCK", "baWzHb", "IdOCIm", "bYheST", "ZWpRIi", "…
sum(is.na(data_2))  # but no NA values!
#> [1] 0

6.2.3 Writing files

To write a data file to a local directory, use one of the following write() functions provided by readr (ordered from more general to more specific):

readr::write_csv(data_1, path = "./data/data_1a.csv")
readr::write_tsv(data_1, path = "./data/data_1a.tsv")
readr::write_delim(data_1, path = "./data/data_1a.txt", delim = " ", na = "NA", col_names = TRUE)

Alternatively, base R (or the utils package) provides the following options (and note that the path argument from above is named file here):

utils::write.csv(data_1, file = "./data/data_1b.csv")
utils::write.csv2(data_1, file = "./data/data_1b.csv2")
utils::write.table(data_1, file = "./data/data_1b.txt", sep = "<>", na = "-777", col.names = TRUE) 

R also provides save() and load() functions for storing an external representation of R objects to an (.RData) file. Although this is a convenient way of saving some current state within R, such files can only be opened from an R environment. Hence, to make your future life (and the lives of your colleagues) easier, we recommend sticking to write_csv() (i.e., csv files), unless there are good reasons for using another file format.

6.2.4 Other data formats and programs

Sometimes we have to import files stored in the proprietary format of other software packages. Fortunately, there exist dedicated packages for most cases.

For rectangular data

  • The readxl package allows reading data from MS Excel files (both .xls and .xlsx).

  • The haven package allows reading data from SPSS, Stata, and SAS files.

  • The DBI package provides an interface to optain R data frames from relational databases, but requires the specific database backend (e.g., RMySQL, RSQLite, RPostgreSQL).

For hierarchical data

  • Use jsonlite for json, and xml2 for XML files.

For web log data

  • The readr command read_log() imports Apache style log files. (For additional functionality, check out the webreadr package, which builds upon read_log().)

For additional information on importing other types of data, see the official guide on R Data Import/Export.

6.2.5 Conclusion

The more experience you accumulate with reading and sharing data, the more relevant the following insight becomes: Avoid using exotic or proprietary file formats — in R, but also in other software programs. Saving some data as a Rdata, sav, or xls file makes you (and anyone else who may ever be using your data) dependent on particular software products. If saving something in plain text format (e.g., a csv file with UTF-8 encoding) is an option, it is much more flexible and an easy way of reducing future frustration.

References

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

  1. We only cover the most common parse commands here (see Section 11.3 Parsing a vector of r4ds (Wickham & Grolemund, 2017) for details).↩︎