2.4 Data Import/Export
Format | Typical Extension | Import Package | Export Package | Installed by Default |
---|---|---|---|---|
Comma-separated data | .csv | data.table | data.table | Yes |
Pipe-separated data | .psv | data.table | data.table | Yes |
Tab-separated data | .tsv | data.table | data.table | Yes |
CSVY (CSV + YAML metadata header) | .csvy | data.table | data.table | Yes |
SAS | .sas7bdat | haven | haven | Yes |
SPSS | .sav | haven | haven | Yes |
SPSS (compressed) | .zsav | haven | haven | Yes |
Stata | .dta | haven | haven | Yes |
SAS XPORT | .xpt | haven | haven | Yes |
SPSS Portable | .por | haven | Yes | |
Excel | .xls | readxl | Yes | |
Excel | .xlsx | readxl | openxlsx | Yes |
R syntax | .R | base | base | Yes |
Saved R objects | .RData, .rda | base | base | Yes |
Serialized R objects | .rds | base | base | Yes |
Epiinfo | .rec | foreign | Yes | |
Minitab | .mtp | foreign | Yes | |
Systat | .syd | foreign | Yes | |
“XBASE” database files | .dbf | foreign | foreign | Yes |
Weka Attribute-Relation File Format | .arff | foreign | foreign | Yes |
Data Interchange Format | .dif | utils | Yes | |
Fortran data | no recognized extension | utils | Yes | |
Fixed-width format data | .fwf | utils | utils | Yes |
gzip comma-separated data | .csv.gz | utils | utils | Yes |
Apache Arrow (Parquet) | .parquet | arrow | arrow | No |
EViews | .wf1 | hexView | No | |
Feather R/Python interchange format | .feather | feather | feather | No |
Fast Storage | .fst | fst | fst | No |
JSON | .json | jsonlite | jsonlite | No |
Matlab | .mat | rmatio | rmatio | No |
OpenDocument Spreadsheet | .ods | readODS | readODS | No |
HTML Tables | .html | xml2 | xml2 | No |
Shallow XML documents | .xml | xml2 | xml2 | No |
YAML | .yml | yaml | yaml | No |
Clipboard | default is tsv | clipr | clipr | No |
Google Sheets | as Comma-separated data |
R limitations:
By default, R use 1 core in CPU
R puts data into memory (limit around 2-4 GB), while SAS uses data from files on demand
Categorization
Medium-size file: within RAM limit, around 1-2 GB
Large file: 2-10 GB, there might be some workaround solution
Very large file > 10 GB, you have to use distributed or parallel computing
Solutions:
buy more RAM
HPC packages
Explicit Parallelism
Implicit Parallelism
Large Memory
Map/Reduce
specify number of rows and columns, typically including command
nrow =
Use packages that store data differently
bigmemory
,biganalytics
,bigtabulate
,synchronicity
,bigalgebra
,bigvideo
use C++ to store matrices, but also support one class typeFor multiple class types, use
ff
package
Very Large datasets use
RHaddop
packageHadoopStreaming
Rhipe
2.4.1 Medium size
To import multiple files in a directory
To export a single data file
export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")
To export multiple data files
export(list(mtcars = mtcars, iris = iris), "data_file_type")
# where data_file_type should substituted with the extension listed above
To convert between data file types
2.4.2 Large size
Use R on a cluster
- Amazon Web Service (AWS): $1/hr
Import files as chunks
file_in <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x <- readLines(file_in, n=chunk_size)
data.table
method
ff
package: this method does not allow you to pass connections
library("ff")
x <- read.csv.ffdf(
file = "file.csv",
nrow = 10,
header = TRUE,
VERBOSE = TRUE,
first.rows = 10000,
next.rows = 50000,
colClasses = NA
)
bigmemory
package
sqldf
package
library(sqldf)
my_data <- read.csv.sql('in.csv')
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where Species = 'setosa' ")
RQLite
package
- Download SQLite, pick “A bundle of command-line tools for managing SQLite database files” for Window 10
- Unzip file, and open
sqlite3.exe.
- Type in the prompt
sqlite> .cd 'C:\Users\data'
specify path to your desired directorysqlite> .open database_name.db
to open a database- To import the CSV file into the database
sqlite> .mode csv
specify to SQLite that the next file is .csv filesqlite> .import file_name.csv datbase_name
to import the csv file to the database
sqlite> .exit
After you’re done, exit the sqlite program
library(DBI)
library(dplyr)
library("RSQLite")
setwd("")
con <- dbConnect(RSQLite::SQLite(), "data_base.db")
tbl <- tbl(con, "data_table")
tbl %>%
filter() %>%
select() %>%
collect() # to actually pull the data into the workspace
dbDisconnect(con)
arrow
package
vroom
package
library(vroom)
spec(vroom(file_path))
compressed <- vroom_example("mtcars.csv.zip")
vroom(compressed)
data.table
package
Comparisons regarding storage space
test = ff::read.csv.ffdf(file = "")
object.size(test) # worst
test1 = data.table::fread(file = "")
object.size(test1) # best
test2 = readr::read_csv(""))
object.size(test2) # 2nd
test3 = vroom(file = "")
object.size(test3) # equal to read_csv
To work with big data, you can convert it to csv.gz
, but since typically, R would require you to load the whole data then export it. With data greater than 10 GB, we have to do it sequentially. Even though read.csv
is much slower than readr::read_csv
, we still have to use it because it can pass connection, and it allows you to loop sequentially. On the other, because currently readr::read_csv
does not have the skip
function, and even if we can use the skip, we still have to read and skip lines in previous loop.
For example, say you read_csv(, n_max = 100, skip =0)
and then read_csv(, n_max = 200, skip = 100)
you actually have to read again the first 100 rows. However, read.csv
without specifying anything, will continue at the 100 mark.
Notice, sometimes you might have error looking like this
“Error in (function (con, what, n = 1L, size = NA_integer_, signed = TRUE, : can only read from a binary connection”
then you can change it instead of r
in the connection into rb
. Even though an author of the package suggested that file
should be able to recognize the appropriate form, so far I did not prevail.