2.4 Data Import/Export
Format | Typical Extension | Import Package | Export Package | Installed by Default |
---|---|---|---|---|
Comma-separated data | .csv | data.table | data.table | Yes |
Pipe-separated data | .psv | data.table | data.table | Yes |
Tab-separated data | .tsv | data.table | data.table | Yes |
CSVY (CSV + YAML metadata header) | .csvy | data.table | data.table | Yes |
SAS | .sas7bdat | haven | haven | Yes |
SPSS | .sav | haven | haven | Yes |
SPSS (compressed) | .zsav | haven | haven | Yes |
Stata | .dta | haven | haven | Yes |
SAS XPORT | .xpt | haven | haven | Yes |
SPSS Portable | .por | haven | Yes | |
Excel | .xls | readxl | Yes | |
Excel | .xlsx | readxl | openxlsx | Yes |
R syntax | .R | base | base | Yes |
Saved R objects | .RData, .rda | base | base | Yes |
Serialized R objects | .rds | base | base | Yes |
Epiinfo | .rec | foreign | Yes | |
Minitab | .mtp | foreign | Yes | |
Systat | .syd | foreign | Yes | |
“XBASE” database files | .dbf | foreign | foreign | Yes |
Weka Attribute-Relation File Format | .arff | foreign | foreign | Yes |
Data Interchange Format | .dif | utils | Yes | |
Fortran data | no recognized extension | utils | Yes | |
Fixed-width format data | .fwf | utils | utils | Yes |
gzip comma-separated data | .csv.gz | utils | utils | Yes |
Apache Arrow (Parquet) | .parquet | arrow | arrow | No |
EViews | .wf1 | hexView | No | |
Feather R/Python interchange format | .feather | feather | feather | No |
Fast Storage | .fst | fst | fst | No |
JSON | .json | jsonlite | jsonlite | No |
Matlab | .mat | rmatio | rmatio | No |
OpenDocument Spreadsheet | .ods | readODS | readODS | No |
HTML Tables | .html | xml2 | xml2 | No |
Shallow XML documents | .xml | xml2 | xml2 | No |
YAML | .yml | yaml | yaml | No |
Clipboard | default is tsv | clipr | clipr | No |
Google Sheets | as Comma-separated data |
R limitations:
By default, R use 1 core in CPU
R puts data into memory (limit around 2-4 GB), while SAS uses data from files on demand
Categorization
Medium-size file: within RAM limit, around 1-2 GB
Large file: 2-10 GB, there might be some workaround solution
Very large file > 10 GB, you have to use distributed or parallel computing
Solutions:
buy more RAM
HPC packages
Explicit Parallelism
Implicit Parallelism
Large Memory
Map/Reduce
specify number of rows and columns, typically including command
nrow =
Use packages that store data differently
bigmemory
,biganalytics
,bigtabulate
,synchronicity
,bigalgebra
,bigvideo
use C++ to store matrices, but also support one class typeFor multiple class types, use
ff
package
Very Large datasets use
RHaddop
packageHadoopStreaming
Rhipe
2.4.1 Medium size
To import multiple files in a directory
To export a single data file
export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")
To export multiple data files
export(list(mtcars = mtcars, iris = iris), "data_file_type")
# where data_file_type should substituted with the extension listed above
To convert between data file types
2.4.2 Large size
2.4.2.1 Cloud Computing: Using AWS for Big Data
Amazon Web Service (AWS): Compute resources can be rented at approximately $1/hr. Use AWS to process large datasets without overwhelming your local machine.
2.4.2.2 Importing Large Files as Chunks
2.4.2.2.6 Using the RMySQL
Package
RQLite
package
- Download SQLite, pick “A bundle of command-line tools for managing SQLite database files” for Window 10
- Unzip file, and open
sqlite3.exe.
- Type in the prompt
sqlite> .cd 'C:\Users\data'
specify path to your desired directorysqlite> .open database_name.db
to open a database- To import the CSV file into the database
sqlite> .mode csv
specify to SQLite that the next file is .csv filesqlite> .import file_name.csv datbase_name
to import the csv file to the database
sqlite> .exit
After you’re done, exit the sqlite program
2.4.2.2.10 Comparisons Regarding Storage Space
test = ff::read.csv.ffdf(file = "")
object.size(test) # Highest memory usage
test1 = data.table::fread(file = "")
object.size(test1) # Lowest memory usage
test2 = readr::read_csv(file = "")
object.size(test2) # Second lowest memory usage
test3 = vroom::vroom(file = "")
object.size(test3) # Similar to read_csv
To work with large datasets, you can compress them into csv.gz
format. However, typically, R requires loading the entire dataset before exporting it, which can be impractical for data over 10 GB. In such cases, processing the data sequentially becomes necessary. Although read.csv
is slower compared to readr::read_csv
, it can handle connections and allows for sequential looping, making it useful for large files.
Currently, readr::read_csv
does not support the skip
argument efficiently for large data. Even if you specify skip
, the function reads all preceding lines again. For instance, if you run read_csv(file, n_max = 100, skip = 0)
followed by read_csv(file, n_max = 200, skip = 100)
, the first 100 rows are re-read. In contrast, read.csv
can continue from where it left off without re-reading previous rows.
If you encounter an error such as:
“Error in (function (con, what, n = 1L, size = NA_integer_, signed = TRUE): can only read from a binary connection”,
you can modify the connection mode from "r"
to "rb"
(read binary). Although the file
function is designed to detect the appropriate format automatically, this workaround can help resolve the issue when it does not behave as expected.