library(tidyverse)
# library(readr) --> no need to load readr, it's part of the core tidyverse
<- read_csv("https://www.dropbox.com/s/z8d6irpjdohdktf/spotify2018.csv?dl=1") spotify_top100_2018
2 Reading in data with readr
(et al.)
In general, importing data with readr
(Hester et al. 2018) is pretty hassle-free: the hardest thing about it is calling the right function. It usually takes care of the rest itself, parsing columns properly, etc. However, sometimes you need to specify additional arguments.
The following unordered list shows the most common read_*()
functions. Usually, you can simply provide them a file path and they load in the data and return a Tibble. If your data is in a compressed file with the extension .gz
, .bz2
, .xz
, or .zip
, readr
will automatically uncompress it. If the file is stored online, you can provide a URL starting with http://
, https://
, ftp://
, or ftps://
. readr
will automatically take care of the download process.
read_csv("file.csv")
reads comma delimited files
read_csv2("file.csv")
reads semi-colon delimited files and treats commas as decimal separator
read_delim("file.txt", delim = "|")
reads files which are delimited by whatever delimiter you specify (|
in this case)
read_fwf("file.fwf", col_positions = c(1, 3, 5))
reads fixed width files. Here, some sort of data on the columns must be provided, e.g., their positions in the file
- If the values are separated by white space, you can also use
read_tsv("file.tsv")
orread_table("file.tsv")
2.1 …and their additional arguments
Also, all these functions share certain arguments which just need to be included in the call. In the following, I will enumerate the most useful ones.
- If your file does not have a header (most of the time, column names), provide
col_names = FALSE
. The resulting Tibble will haveX1 … Xn
as column names
- If your file does not have column names, but you want the resulting Tibble to have some, you can specify them with
col_names = c("a", "b", "c")
. It takes a character vector.
- If there are rows you do not want to be considered, you can use
skip =
. For instance,read_csv("file.csv", skip = 6)
reads everything but the first six data rows (the very first row is not taken into consideration as well)
- Sometimes the original creator of your data set might go across missing values differently than you would want it to.
na =
can be used to specify which values shall be considered missing. If your missings are coded as 99 and 999, for instance, you can address that in the read-in process already by usingread_csv("file.csv", na = c("99", "999"))
. Please note that it takes a character vector as argument
- In some data sets, the first rows consists of comments that start with particular signs or special characters. Using
comment =
allows you to skip these lines. For instance,read_csv("file.csv", comment = "#")
drops all the rows that begin with a hash.
2.1.1 Column types
As you have already learned in the script before, a Tibble consists of multiple vectors of the same length. The vectors can be of different types. When you read in data using readr
, it will print out the column types it has guessed. When you read in data, you must ascribe it to an object in your environment. The following code reads in a .csv
file with data on the 100 most-played songs on Spotify in 2018 and stores it in the object spotify_top100_2018
.
If your data is well-behaved, R will guess the vector types correctly and everything will run smoothly. However, sooner or later you will stumble across a data set which is not well-behaved. This is where knowing how to fine-tune your parsing process up-front will eventually save you a lot of head scratching.
But how does parsing actually look like. Well, readr
’s parsing functions take a character vector and return a more specialized vector.
parse_double(c("1", "2", "3"))
[1] 1 2 3
So far so good. What readr
does when it reads in your data sets is that it takes the first 1,000 values of every column and tries to guess the correct data type. This can be emulated using guess_parser()
and parse_guess()
. Both functions take a character vector as input. The former one returns the guessed type, the latter returns a vector which is parsed to the type it has guessed.
guess_parser("2009-04-23")
[1] "date"
str(parse_guess("2009-04-23"))
Date[1:1], format: "2009-04-23"
The heuristic it uses is fairly simple yet robust. However, there are common cases when you might run into problems with different data types. In the following, I will show you the two most common ones. The first one regards numeric data, the second one data on date and time.
2.1.1.1 Numbers
Parsing numbers should be straight-forward, right, so what could possibly go wrong?
Well…
- Decimal points
- Special characters ($, %, §, €)
- So-called grouping characters such as 1,000,000 (USA) or 1.000.000 (Germany) or 1’000’000 (Switzerland)
The problem with decimal points (– and commas) can be addresses by specifying a locale
. Compare:
parse_double("1,3")
Warning: 1 parsing failure.
row col expected actual
1 -- no trailing characters 1,3
[1] NA
attr(,"problems")
# A tibble: 1 × 4
row col expected actual
<int> <int> <chr> <chr>
1 1 NA no trailing characters 1,3
parse_double("1,3", locale = locale(decimal_mark = ","))
[1] 1.3
The special character problem can be addressed using parse_number
instead of parse_double
: it will ignore the special characters.
parse_number("1.5€")
[1] 1.5
The final problem, grouping characters, can be addressed using another locale
.
parse_number("1.300.000", locale = locale(grouping_mark = "."))
[1] 1300000
2.1.1.2 Date and time
Date vectors in R are numeric vectors indicating how many days have passed since 1970. Date-Time vectors indicate the seconds that have passed since 1970-01-01 00:00:00. Time vectors indicate the number of seconds that have passed since midnight.
The parse_*()
functions expect the vectors to be in a certain format:
parse_datetime()
expects the input to follow the ISO8601 standard. The times components must be ordered from biggest to smallest: year, month, day, hour, minute, second.
parse_datetime("2000-02-29T2000")
[1] "2000-02-29 20:00:00 UTC"
parse_date()
wants a four digit year, two digit month, and two digit day. They can be separated by either “-” or “/”.
parse_date("2000-02-29")
[1] "2000-02-29"
parse_date("2000/02/29")
[1] "2000-02-29"
Do you wonder why I chose 2000-02-29? It’s R’s birthday…
parse_time()
needs at least hours and minutes, seconds are optional. They need to be separated by colons. There is no proper built-in class for time data in Base R. Hence, I will use thehms
package here.
library(hms)
parse_time("20:15:00")
20:15:00
parse_time("20:15") # both works
20:15:00
When it comes to dates, you can also build your own format. Just mash together the following pieces:
- Year:
%Y
– year in 4 digits;%y
– year in two digits following this rule: 00–69 = 2000–2069, 70–99 = 1970–1999
- Month:
%m
– two digits;%b
– abbreviated name (e.g., “Nov”);%B
– full name (e.g., “November”)
- Day:
%d
– two digits
- Time:
%H
– hour, 0–23;%h
– hour, 1–12, must come together with%p
– a.m./p.m. indicator;%M
– minutes;%S
– integer seconds;%Z
time zone –America/Chicago
for instance
- Non-digits:
%.
skips one non-digit character;%*
skips any number of non-digits
You might see that there can emerge problems with this. You might, for example, have something like this:
<- "29. Februar 2000" example_date
So how can you parse this date with a German month name? Again, you can use locale =
.
date_names_langs() # what could be the proper abbreviation?
[1] "af" "agq" "ak" "am" "ar" "as" "asa" "az" "bas" "be" "bem" "bez"
[13] "bg" "bm" "bn" "bo" "br" "brx" "bs" "ca" "cgg" "chr" "cs" "cy"
[25] "da" "dav" "de" "dje" "dsb" "dua" "dyo" "dz" "ebu" "ee" "el" "en"
[37] "eo" "es" "et" "eu" "ewo" "fa" "ff" "fi" "fil" "fo" "fr" "fur"
[49] "fy" "ga" "gd" "gl" "gsw" "gu" "guz" "gv" "ha" "haw" "he" "hi"
[61] "hr" "hsb" "hu" "hy" "id" "ig" "ii" "is" "it" "ja" "jgo" "jmc"
[73] "ka" "kab" "kam" "kde" "kea" "khq" "ki" "kk" "kkj" "kl" "kln" "km"
[85] "kn" "ko" "kok" "ks" "ksb" "ksf" "ksh" "kw" "ky" "lag" "lb" "lg"
[97] "lkt" "ln" "lo" "lt" "lu" "luo" "luy" "lv" "mas" "mer" "mfe" "mg"
[109] "mgh" "mgo" "mk" "ml" "mn" "mr" "ms" "mt" "mua" "my" "naq" "nb"
[121] "nd" "ne" "nl" "nmg" "nn" "nnh" "nus" "nyn" "om" "or" "os" "pa"
[133] "pl" "ps" "pt" "qu" "rm" "rn" "ro" "rof" "ru" "rw" "rwk" "sah"
[145] "saq" "sbp" "se" "seh" "ses" "sg" "shi" "si" "sk" "sl" "smn" "sn"
[157] "so" "sq" "sr" "sv" "sw" "ta" "te" "teo" "th" "ti" "to" "tr"
[169] "twq" "tzm" "ug" "uk" "ur" "uz" "vai" "vi" "vun" "wae" "xog" "yav"
[181] "yi" "yo" "zgh" "zh" "zu"
parse_date(example_date, format = "%d%. %B %Y", locale = locale(date_names = "de"))
[1] "2000-02-29"
Now you know how to parse number and date vectors yourself. This is nice, but normally you do not want to read in data, put it into character vectors and then parse it to the right data format. You want to read in a data set and get a Tibble whose columns consist of data which have been parsed to the right type already.
2.1.1.3 Parsing entire files
As mentioned earlier, the read_*
functions take the first 1000 rows and then guess the columns format. I emulated this using the guess_parser()
function.
If readr
finds values in a column that do not match the type of the column which it has guessed in first place, or entirely fails to parse a column (e.g., because it only consists of NAs
), it returns parsing failures. They can be obtained using problems()
.
<- read_csv(readr_example("challenge.csv")) challenge
Rows: 2000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
date (1): y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(challenge)
# A tibble: 6 × 2
x y
<dbl> <date>
1 404 NA
2 4172 NA
3 3004 NA
4 787 NA
5 37 NA
6 2332 NA
problems(challenge)
# A tibble: 0 × 5
# … with 5 variables: row <int>, col <int>, expected <chr>, actual <chr>,
# file <chr>
When looking at the parsing failures here, what catches the eye is that the first 1000 values of challenge$y
seem to be NA. Because readr
only takes the first 1000 rows into account, it parses challenge$y
as logical
. However, it should be considered a date
column. You can specify this using col_types =
:
<- read_csv("/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv",
challenge_w_date col_types = cols(
x = col_number(),
y = col_date()
))
In general, every parse_*
function has its col_*
counterpart.
If you want to read in data and change the column specifications, there is a little shortcut:
First, read in your data set:
<- read_csv(readr_example("challenge.csv")) challenge
Rows: 2000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
date (1): y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(challenge)
cols(
x = col_double(),
y = col_date(format = "")
)
Second, you can copy the column specification from the output to your clipboard:
#This part:
#cols(
# x = col_double(),
# y = col_logical()
#)
Third, provide it your read_csv()
call as a col_types =
argument (by simply pasting it):
#challenge <- read_csv(readr_example("challenge.csv"),
# col_types = cols(
# x = col_double(),
# needs to be modified --> y = col_logical()
# ))
Fourth, modify the argument:
#challenge_w_date <- read_csv(readr_example("challenge.csv"),
# col_types = cols(
# x = col_number(),
# modified --> y = col_date()
# ))
Fifth, read it in:
<- read_csv(readr_example("challenge.csv"),
challenge_w_date col_types = cols(
x = col_number(),
y = col_date()
))
2.2 .rds
and .RData
files
.rds
files can be used to store singular R-specific objects (such as lists), .RData
files can be used to store multiple R-specific objects. The former can be read in using read_rds("file.rds")
, the latter with load("file.RData")
. More on read_rds()
here and .RData
here
2.3 Write data
2.3.1 write_csv()
Writing data is fairly straight-forward. Most of the times, you will work with plain Tibbles which consist of different kinds of vectors except for lists. If you want to store them, I recommend you to simply use write_csv(tibble, path = "file.csv")
. If you plan on working on the .csv
file in Excel, use write_excel_csv(tibble, path = "file.csv")
2.3.2 write_rds()
Sometimes, however, it might be impossible to create a .csv
file of your data – e.g., if you want to store a list. This is what you can use write_rds(r_specific_object, path = "file.rds")
for.
2.3.3 save()
Akin to .rds
files are .RData
files. They can contain multiple objects and be written using save(r_specific_object_1, r_specific_object_2, r_specific_object_n, file = "file.RData")
. You can save your entire work space as well by calling save.image(file = "file.RData")
.
2.4 Alternative ways to read in and write data
There do also other packages exist for different data types. I will explain the ones which might be of particular use for you and their main-functions only briefly.
2.4.1 haven
You can use haven
(Wickham and Miller 2020) for reading and writing SAS (suffixes .sas7bdat
, .sas7bcat
, and .xpt
), SPSS (suffixes .sav
and .por
), and STATA (suffix .dta
) files.
The functions then are:
read_sas("file.sas7bdate"
andwrite_sas(tibble, "file.sas7bdat")
for both.sas7bdat
and.sas7bcat
files.read_xpt("file.xpt")
reads.xpt
files
read_sav("file.sav")
andread_por("file.por")
for.sav
and.por
files.write_sav(tibble, "file.sav"
writes a the Tibbletibble
to the filefile.sav
read_dta("file.dta")
andwrite_dta(tibble, "file.dta")
read and write.dta
files
The additional arguments can be found in the vignette.
2.4.2 readxl
readxl
(Wickham et al. 2019) can be used to read Excel files. read_excel("file.xls")
works for both .xls
and .xlsx
files alike. It guesses the data type from the suffix. Excel files often consist of multiple sheets. excel_sheets("file.xlsx")
returns the name of the singular sheets. When dealing with an Excel file that contains multiple sheets, you need to specify the sheet you are after in the read_excel()
function: read_excel("file.xlsx", sheet = "sheet_1")
. Please note that it only takes one sheet at a time.
More on the readxl
package can be found here.
2.4.3 vroom
vroom
(Hester et al. 2020) has been introduced recently. It claims to be able to read in delimited files with up to 1.4 GB/s. Regarding its arguments, vroom
works in the same way as the read_*()
functions from the readr
package. I would recommend you to use vroom
as soon as your data set’s size exceeds ~100 MB.
2.5 Further readings
- Information on working directories
- Websites of the singular packages: readr, haven, readxl, vroom
readr
Cheatsheet
- Chapter in R for Data Science (Wickham and Grolemund 2016) regarding data import