2  Reading in data with readr (et al.)

In general, importing data with readr (Hester et al. 2018) is pretty hassle-free: the hardest thing about it is calling the right function. It usually takes care of the rest itself, parsing columns properly, etc. However, sometimes you need to specify additional arguments.

The following unordered list shows the most common read_*() functions. Usually, you can simply provide them a file path and they load in the data and return a Tibble. If your data is in a compressed file with the extension .gz, .bz2, .xz, or .zip, readr will automatically uncompress it. If the file is stored online, you can provide a URL starting with http://, https://, ftp://, or ftps://. readr will automatically take care of the download process.

2.1 …and their additional arguments

Also, all these functions share certain arguments which just need to be included in the call. In the following, I will enumerate the most useful ones.

  • If your file does not have a header (most of the time, column names), provide col_names = FALSE. The resulting Tibble will have X1 … Xn as column names
  • If your file does not have column names, but you want the resulting Tibble to have some, you can specify them with col_names = c("a", "b", "c"). It takes a character vector.
  • If there are rows you do not want to be considered, you can use skip =. For instance, read_csv("file.csv", skip = 6) reads everything but the first six data rows (the very first row is not taken into consideration as well)
  • Sometimes the original creator of your data set might go across missing values differently than you would want it to. na = can be used to specify which values shall be considered missing. If your missings are coded as 99 and 999, for instance, you can address that in the read-in process already by using read_csv("file.csv", na = c("99", "999")). Please note that it takes a character vector as argument
  • In some data sets, the first rows consists of comments that start with particular signs or special characters. Using comment = allows you to skip these lines. For instance, read_csv("file.csv", comment = "#") drops all the rows that begin with a hash.

2.1.1 Column types

As you have already learned in the script before, a Tibble consists of multiple vectors of the same length. The vectors can be of different types. When you read in data using readr, it will print out the column types it has guessed. When you read in data, you must ascribe it to an object in your environment. The following code reads in a .csv file with data on the 100 most-played songs on Spotify in 2018 and stores it in the object spotify_top100_2018.

library(tidyverse)
# library(readr) --> no need to load readr, it's part of the core tidyverse
spotify_top100_2018 <- read_csv("https://www.dropbox.com/s/z8d6irpjdohdktf/spotify2018.csv?dl=1")

If your data is well-behaved, R will guess the vector types correctly and everything will run smoothly. However, sooner or later you will stumble across a data set which is not well-behaved. This is where knowing how to fine-tune your parsing process up-front will eventually save you a lot of head scratching.

But how does parsing actually look like. Well, readr’s parsing functions take a character vector and return a more specialized vector.

parse_double(c("1", "2", "3"))
[1] 1 2 3

So far so good. What readrdoes when it reads in your data sets is that it takes the first 1,000 values of every column and tries to guess the correct data type. This can be emulated using guess_parser() and parse_guess(). Both functions take a character vector as input. The former one returns the guessed type, the latter returns a vector which is parsed to the type it has guessed.

guess_parser("2009-04-23")
[1] "date"
str(parse_guess("2009-04-23"))
 Date[1:1], format: "2009-04-23"

The heuristic it uses is fairly simple yet robust. However, there are common cases when you might run into problems with different data types. In the following, I will show you the two most common ones. The first one regards numeric data, the second one data on date and time.

2.1.1.1 Numbers

Parsing numbers should be straight-forward, right, so what could possibly go wrong?

Well…

  • Decimal points
  • Special characters ($, %, §, €)
  • So-called grouping characters such as 1,000,000 (USA) or 1.000.000 (Germany) or 1’000’000 (Switzerland)

The problem with decimal points (– and commas) can be addresses by specifying a locale. Compare:

parse_double("1,3")
Warning: 1 parsing failure.
row col               expected actual
  1  -- no trailing characters    1,3
[1] NA
attr(,"problems")
# A tibble: 1 × 4
    row   col expected               actual
  <int> <int> <chr>                  <chr> 
1     1    NA no trailing characters 1,3   
parse_double("1,3", locale = locale(decimal_mark = ","))
[1] 1.3

The special character problem can be addressed using parse_number instead of parse_double: it will ignore the special characters.

parse_number("1.5€")
[1] 1.5

The final problem, grouping characters, can be addressed using another locale.

parse_number("1.300.000", locale = locale(grouping_mark = "."))
[1] 1300000

2.1.1.2 Date and time

Date vectors in R are numeric vectors indicating how many days have passed since 1970. Date-Time vectors indicate the seconds that have passed since 1970-01-01 00:00:00. Time vectors indicate the number of seconds that have passed since midnight.

The parse_*() functions expect the vectors to be in a certain format:

  • parse_datetime() expects the input to follow the ISO8601 standard. The times components must be ordered from biggest to smallest: year, month, day, hour, minute, second.
parse_datetime("2000-02-29T2000")
[1] "2000-02-29 20:00:00 UTC"
  • parse_date() wants a four digit year, two digit month, and two digit day. They can be separated by either “-” or “/”.
parse_date("2000-02-29")
[1] "2000-02-29"
parse_date("2000/02/29")
[1] "2000-02-29"

Do you wonder why I chose 2000-02-29? It’s R’s birthday…

  • parse_time() needs at least hours and minutes, seconds are optional. They need to be separated by colons. There is no proper built-in class for time data in Base R. Hence, I will use the hms package here.
library(hms)
parse_time("20:15:00")
20:15:00
parse_time("20:15") # both works
20:15:00

When it comes to dates, you can also build your own format. Just mash together the following pieces:

  • Year: %Y – year in 4 digits; %y – year in two digits following this rule: 00–69 = 2000–2069, 70–99 = 1970–1999
  • Month: %m – two digits; %b – abbreviated name (e.g., “Nov”); %B – full name (e.g., “November”)
  • Day: %d – two digits
  • Time: %H – hour, 0–23; %h – hour, 1–12, must come together with %p – a.m./p.m. indicator; %M – minutes; %S – integer seconds; %Z time zone – America/Chicago for instance
  • Non-digits: %. skips one non-digit character; %* skips any number of non-digits

You might see that there can emerge problems with this. You might, for example, have something like this:

example_date <- "29. Februar 2000"

So how can you parse this date with a German month name? Again, you can use locale =.

date_names_langs() # what could be the proper abbreviation?
  [1] "af"  "agq" "ak"  "am"  "ar"  "as"  "asa" "az"  "bas" "be"  "bem" "bez"
 [13] "bg"  "bm"  "bn"  "bo"  "br"  "brx" "bs"  "ca"  "cgg" "chr" "cs"  "cy" 
 [25] "da"  "dav" "de"  "dje" "dsb" "dua" "dyo" "dz"  "ebu" "ee"  "el"  "en" 
 [37] "eo"  "es"  "et"  "eu"  "ewo" "fa"  "ff"  "fi"  "fil" "fo"  "fr"  "fur"
 [49] "fy"  "ga"  "gd"  "gl"  "gsw" "gu"  "guz" "gv"  "ha"  "haw" "he"  "hi" 
 [61] "hr"  "hsb" "hu"  "hy"  "id"  "ig"  "ii"  "is"  "it"  "ja"  "jgo" "jmc"
 [73] "ka"  "kab" "kam" "kde" "kea" "khq" "ki"  "kk"  "kkj" "kl"  "kln" "km" 
 [85] "kn"  "ko"  "kok" "ks"  "ksb" "ksf" "ksh" "kw"  "ky"  "lag" "lb"  "lg" 
 [97] "lkt" "ln"  "lo"  "lt"  "lu"  "luo" "luy" "lv"  "mas" "mer" "mfe" "mg" 
[109] "mgh" "mgo" "mk"  "ml"  "mn"  "mr"  "ms"  "mt"  "mua" "my"  "naq" "nb" 
[121] "nd"  "ne"  "nl"  "nmg" "nn"  "nnh" "nus" "nyn" "om"  "or"  "os"  "pa" 
[133] "pl"  "ps"  "pt"  "qu"  "rm"  "rn"  "ro"  "rof" "ru"  "rw"  "rwk" "sah"
[145] "saq" "sbp" "se"  "seh" "ses" "sg"  "shi" "si"  "sk"  "sl"  "smn" "sn" 
[157] "so"  "sq"  "sr"  "sv"  "sw"  "ta"  "te"  "teo" "th"  "ti"  "to"  "tr" 
[169] "twq" "tzm" "ug"  "uk"  "ur"  "uz"  "vai" "vi"  "vun" "wae" "xog" "yav"
[181] "yi"  "yo"  "zgh" "zh"  "zu" 
parse_date(example_date, format = "%d%. %B %Y", locale = locale(date_names = "de"))
[1] "2000-02-29"

Now you know how to parse number and date vectors yourself. This is nice, but normally you do not want to read in data, put it into character vectors and then parse it to the right data format. You want to read in a data set and get a Tibble whose columns consist of data which have been parsed to the right type already.

2.1.1.3 Parsing entire files

As mentioned earlier, the read_* functions take the first 1000 rows and then guess the columns format. I emulated this using the guess_parser() function.

If readr finds values in a column that do not match the type of the column which it has guessed in first place, or entirely fails to parse a column (e.g., because it only consists of NAs), it returns parsing failures. They can be obtained using problems().

challenge <- read_csv(readr_example("challenge.csv"))
Rows: 2000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): x
date (1): y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(challenge)
# A tibble: 6 × 2
      x y     
  <dbl> <date>
1   404 NA    
2  4172 NA    
3  3004 NA    
4   787 NA    
5    37 NA    
6  2332 NA    
problems(challenge)
# A tibble: 0 × 5
# … with 5 variables: row <int>, col <int>, expected <chr>, actual <chr>,
#   file <chr>

When looking at the parsing failures here, what catches the eye is that the first 1000 values of challenge$y seem to be NA. Because readr only takes the first 1000 rows into account, it parses challenge$y as logical. However, it should be considered a date column. You can specify this using col_types =:

challenge_w_date <- read_csv("/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv",
                             col_types = cols(
                               x = col_number(),
                               y = col_date()
                             ))

In general, every parse_* function has its col_* counterpart.

If you want to read in data and change the column specifications, there is a little shortcut:

First, read in your data set:

challenge <- read_csv(readr_example("challenge.csv"))
Rows: 2000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): x
date (1): y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(challenge)
cols(
  x = col_double(),
  y = col_date(format = "")
)

Second, you can copy the column specification from the output to your clipboard:

#This part: 
#cols(
#  x = col_double(),
#  y = col_logical()
#)

Third, provide it your read_csv() call as a col_types = argument (by simply pasting it):

#challenge <- read_csv(readr_example("challenge.csv"),
#                     col_types = cols(
#                       x = col_double(),
#  needs to be modified --> y = col_logical()
#                            ))

Fourth, modify the argument:

#challenge_w_date <- read_csv(readr_example("challenge.csv"),
#                             col_types = cols(
#                               x = col_number(),
#  modified -->                 y = col_date()
#                            ))

Fifth, read it in:

challenge_w_date <- read_csv(readr_example("challenge.csv"),
                      col_types = cols(
                        x = col_number(),
                        y = col_date()
                      ))

2.2 .rds and .RDatafiles

.rds files can be used to store singular R-specific objects (such as lists), .RData files can be used to store multiple R-specific objects. The former can be read in using read_rds("file.rds"), the latter with load("file.RData"). More on read_rds() here and .RData here

2.3 Write data

2.3.1 write_csv()

Writing data is fairly straight-forward. Most of the times, you will work with plain Tibbles which consist of different kinds of vectors except for lists. If you want to store them, I recommend you to simply use write_csv(tibble, path = "file.csv"). If you plan on working on the .csv file in Excel, use write_excel_csv(tibble, path = "file.csv")

2.3.2 write_rds()

Sometimes, however, it might be impossible to create a .csv file of your data – e.g., if you want to store a list. This is what you can use write_rds(r_specific_object, path = "file.rds") for.

2.3.3 save()

Akin to .rds files are .RData files. They can contain multiple objects and be written using save(r_specific_object_1, r_specific_object_2, r_specific_object_n, file = "file.RData"). You can save your entire work space as well by calling save.image(file = "file.RData").

2.4 Alternative ways to read in and write data

There do also other packages exist for different data types. I will explain the ones which might be of particular use for you and their main-functions only briefly.

2.4.1 haven

You can use haven (Wickham and Miller 2020) for reading and writing SAS (suffixes .sas7bdat, .sas7bcat, and .xpt), SPSS (suffixes .sav and .por), and STATA (suffix .dta) files.

The functions then are:

  • read_sas("file.sas7bdate" and write_sas(tibble, "file.sas7bdat") for both .sas7bdat and .sas7bcat files. read_xpt("file.xpt") reads .xpt files
  • read_sav("file.sav") and read_por("file.por") for .sav and .por files. write_sav(tibble, "file.sav" writes a the Tibble tibble to the file file.sav
  • read_dta("file.dta") and write_dta(tibble, "file.dta") read and write .dta files

The additional arguments can be found in the vignette.

2.4.2 readxl

readxl (Wickham et al. 2019) can be used to read Excel files. read_excel("file.xls") works for both .xls and .xlsx files alike. It guesses the data type from the suffix. Excel files often consist of multiple sheets. excel_sheets("file.xlsx") returns the name of the singular sheets. When dealing with an Excel file that contains multiple sheets, you need to specify the sheet you are after in the read_excel() function: read_excel("file.xlsx", sheet = "sheet_1"). Please note that it only takes one sheet at a time.

More on the readxl package can be found here.

2.4.3 vroom

vroom (Hester et al. 2020) has been introduced recently. It claims to be able to read in delimited files with up to 1.4 GB/s. Regarding its arguments, vroom works in the same way as the read_*() functions from the readr package. I would recommend you to use vroom as soon as your data set’s size exceeds ~100 MB.

More on vroom here and here.

2.5 Further readings