2.4 Data Import/Export

Extended Manual by R

Table by Rio Vignette
Format Typical Extension Import Package Export Package Installed by Default
Comma-separated data .csv data.table data.table Yes
Pipe-separated data .psv data.table data.table Yes
Tab-separated data .tsv data.table data.table Yes
CSVY (CSV + YAML metadata header) .csvy data.table data.table Yes
SAS .sas7bdat haven haven Yes
SPSS .sav haven haven Yes
SPSS (compressed) .zsav haven haven Yes
Stata .dta haven haven Yes
SAS XPORT .xpt haven haven Yes
SPSS Portable .por haven Yes
Excel .xls readxl Yes
Excel .xlsx readxl openxlsx Yes
R syntax .R base base Yes
Saved R objects .RData, .rda base base Yes
Serialized R objects .rds base base Yes
Epiinfo .rec foreign Yes
Minitab .mtp foreign Yes
Systat .syd foreign Yes
“XBASE” database files .dbf foreign foreign Yes
Weka Attribute-Relation File Format .arff foreign foreign Yes
Data Interchange Format .dif utils Yes
Fortran data no recognized extension utils Yes
Fixed-width format data .fwf utils utils Yes
gzip comma-separated data .csv.gz utils utils Yes
Apache Arrow (Parquet) .parquet arrow arrow No
EViews .wf1 hexView No
Feather R/Python interchange format .feather feather feather No
Fast Storage .fst fst fst No
JSON .json jsonlite jsonlite No
Matlab .mat rmatio rmatio No
OpenDocument Spreadsheet .ods readODS readODS No
HTML Tables .html xml2 xml2 No
Shallow XML documents .xml xml2 xml2 No
YAML .yml yaml yaml No
Clipboard default is tsv clipr clipr No
Google Sheets as Comma-separated data

R limitations:

  • By default, R use 1 core in CPU

  • R puts data into memory (limit around 2-4 GB), while SAS uses data from files on demand

  • Categorization

    • Medium-size file: within RAM limit, around 1-2 GB

    • Large file: 2-10 GB, there might be some workaround solution

    • Very large file > 10 GB, you have to use distributed or parallel computing

Solutions:

  • buy more RAM

  • HPC packages

    • Explicit Parallelism

    • Implicit Parallelism

    • Large Memory

    • Map/Reduce

  • specify number of rows and columns, typically including command nrow =

  • Use packages that store data differently

    • bigmemory, biganalytics, bigtabulate , synchronicity, bigalgebra, bigvideo use C++ to store matrices, but also support one class type

    • For multiple class types, use ff package

  • Very Large datasets use

    • RHaddop package
    • HadoopStreaming
    • Rhipe

2.4.1 Medium size

library("rio")

To import multiple files in a directory

str(import_list(dir()), which = 1)

To export a single data file

export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")

To export multiple data files

export(list(mtcars = mtcars, iris = iris), "data_file_type") 
# where data_file_type should substituted with the extension listed above

To convert between data file types

# convert Stata to SPSS
convert("data.dta", "data.sav")

2.4.2 Large size

2.4.2.1 Cloud Computing: Using AWS for Big Data

Amazon Web Service (AWS): Compute resources can be rented at approximately $1/hr. Use AWS to process large datasets without overwhelming your local machine.

2.4.2.2 Importing Large Files as Chunks

2.4.2.2.1 Using Base R
file_in <- file("in.csv", "r")  # Open a connection to the file
chunk_size <- 100000            # Define chunk size
x <- readLines(file_in, n = chunk_size)  # Read data in chunks
close(file_in)                  # Close the file connection
2.4.2.2.2 Using the data.table Package
library(data.table)
mydata <- fread("in.csv", header = TRUE)  # Fast and memory-efficient
2.4.2.2.3 Using the ff Package
library(ff)
x <- read.csv.ffdf(
  file = "file.csv",
  nrow = 10,          # Total rows
  header = TRUE,      # Include headers
  VERBOSE = TRUE,     # Display progress
  first.rows = 10000, # Initial chunk
  next.rows = 50000,  # Subsequent chunks
  colClasses = NA
)
2.4.2.2.4 Using the bigmemory Package
library(bigmemory)
my_data <- read.big.matrix('in.csv', header = TRUE)
2.4.2.2.5 Using the sqldf Package
library(sqldf)
my_data <- read.csv.sql('in.csv')

# Example: Filtering during import
iris2 <- read.csv.sql("iris.csv", 
    sql = "SELECT * FROM file WHERE Species = 'setosa'")
2.4.2.2.6 Using the RMySQL Package
library(RMySQL)

RQLite package

  • Download SQLite, pick “A bundle of command-line tools for managing SQLite database files” for Window 10
  • Unzip file, and open sqlite3.exe.
  • Type in the prompt
    • sqlite> .cd 'C:\Users\data' specify path to your desired directory
    • sqlite> .open database_name.db to open a database
    • To import the CSV file into the database
      • sqlite> .mode csv specify to SQLite that the next file is .csv file
      • sqlite> .import file_name.csv datbase_name to import the csv file to the database
    • sqlite> .exit After you’re done, exit the sqlite program
library(DBI)
library(dplyr)
library("RSQLite")
setwd("")
con <- dbConnect(RSQLite::SQLite(), "data_base.db")
tbl <- tbl(con, "data_table")
tbl %>% 
    filter() %>%
    select() %>%
    collect() # to actually pull the data into the workspace
dbDisconnect(con)
2.4.2.2.7 Using the arrow Package
library(arrow)
data <- read_csv_arrow("file.csv")
2.4.2.2.8 Using the vroom Package
library(vroom)

# Import a compressed CSV file
compressed <- vroom_example("mtcars.csv.zip")
data <- vroom(compressed)
2.4.2.2.9 Using the data.table Package
s = fread("sample.csv")
2.4.2.2.10 Comparisons Regarding Storage Space
test = ff::read.csv.ffdf(file = "")
object.size(test) # Highest memory usage

test1 = data.table::fread(file = "")
object.size(test1) # Lowest memory usage

test2 = readr::read_csv(file = "")
object.size(test2) # Second lowest memory usage

test3 = vroom::vroom(file = "")
object.size(test3) # Similar to read_csv

To work with large datasets, you can compress them into csv.gz format. However, typically, R requires loading the entire dataset before exporting it, which can be impractical for data over 10 GB. In such cases, processing the data sequentially becomes necessary. Although read.csv is slower compared to readr::read_csv, it can handle connections and allows for sequential looping, making it useful for large files.

Currently, readr::read_csv does not support the skip argument efficiently for large data. Even if you specify skip, the function reads all preceding lines again. For instance, if you run read_csv(file, n_max = 100, skip = 0) followed by read_csv(file, n_max = 200, skip = 100), the first 100 rows are re-read. In contrast, read.csv can continue from where it left off without re-reading previous rows.

If you encounter an error such as:

“Error in (function (con, what, n = 1L, size = NA_integer_, signed = TRUE): can only read from a binary connection”,

you can modify the connection mode from "r" to "rb" (read binary). Although the file function is designed to detect the appropriate format automatically, this workaround can help resolve the issue when it does not behave as expected.

2.4.2.3 Sequential Processing for Large Data

# Open file for sequential reading
file_conn <- file("file.csv", open = "r")
while (TRUE) {
  # Read a chunk of data
  data_chunk <- read.csv(file_conn, nrows = 1000)
  if (nrow(data_chunk) == 0) break  # Stop if no more rows
  # Process the chunk here
}
close(file_conn)  # Close connection