2.5 Data Import/Export

Extended Manual by R

Table by Rio Vignette
Format Typical Extension Import Package Export Package Installed by Default
Comma-separated data .csv data.table data.table Yes
Pipe-separated data .psv data.table data.table Yes
Tab-separated data .tsv data.table data.table Yes
CSVY (CSV + YAML metadata header) .csvy data.table data.table Yes
SAS .sas7bdat haven haven Yes
SPSS .sav haven haven Yes
SPSS (compressed) .zsav haven haven Yes
Stata .dta haven haven Yes
SAS XPORT .xpt haven haven Yes
SPSS Portable .por haven Yes
Excel .xls readxl Yes
Excel .xlsx readxl openxlsx Yes
R syntax .R base base Yes
Saved R objects .RData, .rda base base Yes
Serialized R objects .rds base base Yes
Epiinfo .rec foreign Yes
Minitab .mtp foreign Yes
Systat .syd foreign Yes
“XBASE” database files .dbf foreign foreign Yes
Weka Attribute-Relation File Format .arff foreign foreign Yes
Data Interchange Format .dif utils Yes
Fortran data no recognized extension utils Yes
Fixed-width format data .fwf utils utils Yes
gzip comma-separated data .csv.gz utils utils Yes
Apache Arrow (Parquet) .parquet arrow arrow No
EViews .wf1 hexView No
Feather R/Python interchange format .feather feather feather No
Fast Storage .fst fst fst No
JSON .json jsonlite jsonlite No
Matlab .mat rmatio rmatio No
OpenDocument Spreadsheet .ods readODS readODS No
HTML Tables .html xml2 xml2 No
Shallow XML documents .xml xml2 xml2 No
YAML .yml yaml yaml No
Clipboard default is tsv clipr clipr No
Google Sheets as Comma-separated data

R limitations:

  • By default, R use 1 core in CPU

  • R puts data into memory (limit around 2-4 GB), while SAS uses data from files on demand

  • Categorization

    • Medium-size file: within RAM limit, around 1-2 GB

    • Large file: 2-10 GB, there might be some workaround solution

    • Very large file > 10 GB, you have to use distributed or parallel computing

Solutions:

  • buy more RAM

  • HPC packages

    • Explicit Parallelism

    • Implicit Parallelism

    • Large Memory

    • Map/Reduce

  • specify number of rows and columns, typically including command nrow =

  • Use packages that store data differently

    • bigmemory, biganalytics, bigtabulate , synchronicity, bigalgebra, bigvideo use C++ to store matrices, but also support one class type

    • For multiple class types, use ff package

  • Very Large datasets use

    • RHaddop package
    • HadoopStreaming
    • Rhipe

2.5.1 Medium size

library("rio")

To import multiple files in a directory

str(import_list(dir()), which = 1)

To export a single data file

export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")

To export multiple data files

export(list(mtcars = mtcars, iris = iris), "data_file_type") # where data_file_type should substituted with the extension listed above


To convert between data file types

# convert Stata to SPSS
convert("data.dta", "data.sav")

2.5.2 Large size

Use R on a cluster

  • Amazon Web Service (AWS): $1/hr

Import files as chunks

file_in    <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x          <- readLines(file_in, n=chunk_size)

data.table method

require(data.table)
mydata = fread("in.csv", header = T)

ff package: this method does not allow you to pass connections

library("ff")
x <- read.csv.ffdf(
    file = "file.csv",
    nrow = 10,
    header = TRUE,
    VERBOSE = TRUE,
    first.rows = 10000,
    next.rows = 50000,
    colClasses = NA
)

bigmemory package

my_data <- read.big.matrix('in.csv', header = T)

sqldf package

library(sqldf)
my_data <- read.csv.sql('in.csv')

iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where Species = 'setosa' ")
library(RMySQL)
## Loading required package: DBI

RQLite package

  • Download SQLite, pick “A bundle of command-line tools for managing SQLite database files” for Window 10
  • Unzip file, and open sqlite3.exe.
  • Type in the prompt
    • sqlite> .cd 'C:\Users\data' specify path to your desired directory
    • sqlite> .open database_name.db to open a database
    • To import the CSV file into the database
      • sqlite> .mode csv specify to SQLite that the next file is .csv file
      • sqlite> .import file_name.csv datbase_name to import the csv file to the database
    • sqlite> .exit After you’re done, exit the sqlite program
library(DBI)
library(dplyr)
library("RSQLite")
setwd("")
con <- dbConnect(RSQLite::SQLite(), "data_base.db")
tbl <- tbl(con, "data_table")
tbl %>% 
    filter() %>%
    select() %>%
    collect() # to actually pull the data into the workspace
dbDisconnect(con)

arrow package

library("arrow")
read_csv_arrow()

vroom package

library(vroom)
spec(vroom(file_path))
compressed <- vroom_example("mtcars.csv.zip")
vroom(compressed)

data.table package

s = fread("sample.csv")

Comparisons regarding storage space

test = ff::read.csv.ffdf(file = "")
object.size(test) # worst

test1 = data.table::fread(file = "")
object.size(test1) # best

test2 = readr::read_csv(""))
object.size(test2) # 2nd

test3 = vroom(file = "")
object.size(test3) # equal to read_csv

To work with big data, you can convert it to csv.gz , but since typically, R would require you to load the whole data then export it. With data greater than 10 GB, we have to do it sequentially. Even though read.csv is much slower than readr::read_csv , we still have to use it because it can pass connection, and it allows you to loop sequentially. On the other, because currently readr::read_csv does not have the skip function, and even if we can use the skip, we still have to read and skip lines in previous loop.

For example, say you read_csv(, n_max = 100, skip =0) and then read_csv(, n_max = 200, skip = 100) you actually have to read again the first 100 rows. However, read.csv without specifying anything, will continue at the 100 mark.

Notice, sometimes you might have error looking like this

“Error in (function (con, what, n = 1L, size = NA_integer_, signed = TRUE, : can only read from a binary connection”

then you can change it instead of r in the connection into rb . Even though an author of the package suggested that file should be able to recognize the appropriate form, so far I did not prevail.