2.4 Data Import/Export
R is a powerful and flexible tool for data analysis, but it was originally designed for in-memory statistical computing. This imposes several practical limitations, especially when handling large datasets.
2.4.1 Key Limitations of R
Single-Core Default Execution
By default, R utilizes only one CPU core, limiting performance for compute-intensive tasks unless parallelization is explicitly implemented.Memory-Based Data Handling
R loads data into RAM. This approach becomes problematic when working with datasets that exceed the available memory.- Medium-Sized Files: Fits within typical RAM (1–2 GB). Processing is straightforward.
- Large Files: Between 2–10 GB. May require memory-efficient coding or special packages.
- Very Large Files: Exceed 10 GB. Necessitates distributed or parallel computing solutions.
2.4.2 Solutions and Workarounds
- Upgrade Hardware
- Increase RAM
A simple but often effective solution for moderately large datasets.
- Leverage High-Performance Computing (HPC) in R
There are several HPC strategies and packages in R that facilitate working with large or computationally intensive tasks:
Explicit Parallelism
Use packages likeparallel
,foreach
,doParallel
,future
, andsnow
to define how code runs across multiple cores or nodes.Implicit Parallelism
Certain functions in packages such asdata.table
,dplyr
, orcaret
internally optimize performance across cores when available.Large-Memory Computation
Use memory-efficient structures or external memory algorithms.MapReduce Paradigm
Useful in distributed computing environments, especially with big data infrastructure like Hadoop.
- Efficient Data Loading
- Limit Rows and Columns
Use arguments such asnrows =
in functions likeread.csv()
orfread()
to load subsets of large datasets.
- Use Specialized Packages for Large Data
In-Memory Matrix Packages (Single Class Type Support)
These packages interface with C++ to handle large matrices more efficiently:bigmemory
,biganalytics
,bigtabulate
,synchronicity
,bigalgebra
,bigvideo
Out-of-Memory Storage (Multiple Class Types)
For datasets with various data types:ff
package: Stores data on disk and accesses it as needed, suitable for mixed-type columns.
- Handling Very Large Datasets (>10 GB)
When data size exceeds the capacity of a single machine, distributed computing becomes necessary:
- Hadoop Ecosystem Integration
RHadoop
: A suite of R packages that integrate with the Hadoop framework.HadoopStreaming
: Enables R scripts to be used as Hadoop mappers and reducers.Rhipe
: Provides a more R-like interface to Hadoop, using Google’s Protocol Buffers for serialization.
2.4.3 Medium size
To import multiple files in a directory
To export a single data file
export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")
To export multiple data files
export(list(mtcars = mtcars, iris = iris), "data_file_type")
# where data_file_type should substituted with the extension listed above
To convert between data file types
2.4.4 Large size
2.4.4.1 Cloud Computing: Using AWS for Big Data
Amazon Web Service (AWS): Compute resources can be rented at approximately $1/hr. Use AWS to process large datasets without overwhelming your local machine.
2.4.4.2 Importing Large Files as Chunks
2.4.4.2.6 Using the RMySQL
Package
RQLite
package
- Download SQLite, pick “A bundle of command-line tools for managing SQLite database files” for Window 10
- Unzip file, and open
sqlite3.exe.
- Type in the prompt
sqlite> .cd 'C:\Users\data'
specify path to your desired directorysqlite> .open database_name.db
to open a database- To import the CSV file into the database
sqlite> .mode csv
specify to SQLite that the next file is .csv filesqlite> .import file_name.csv datbase_name
to import the csv file to the database
sqlite> .exit
After you’re done, exit the sqlite program
2.4.4.2.10 Comparisons Regarding Storage Space
test = ff::read.csv.ffdf(file = "")
object.size(test) # Highest memory usage
test1 = data.table::fread(file = "")
object.size(test1) # Lowest memory usage
test2 = readr::read_csv(file = "")
object.size(test2) # Second lowest memory usage
test3 = vroom::vroom(file = "")
object.size(test3) # Similar to read_csv
When dealing with large datasets—especially those exceeding 10 GB—standard data-loading strategies in R can become impractical due to memory constraints. One common approach is to store these datasets in a compressed format such as .csv.gz
, which saves disk space while preserving compatibility with many tools.
Compressed files (e.g., csv.gz
) are especially useful for archiving and transferring large datasets. However, R typically loads the entire dataset into memory before writing or processing, which is inefficient or even infeasible for very large files.
In such cases, sequential processing becomes essential. Rather than reading the entire dataset at once, you can process it in chunks or rows, minimizing memory usage.
Comparison: read.csv()
vs readr::read_csv()
Function | Supports Connections | Sequential Access | Performance | Notes |
---|---|---|---|---|
read.csv() |
Yes | Yes | Slower | Base R; can work with gz-compressed files and allows looped reading |
readr::read_csv() |
Limited | No | Faster | High performance but re-reads lines even when using skip |
While readr::read_csv()
is faster and more efficient for smaller files, it has a limitation for large files when used with skip
. The skip
argument does not avoid reading the skipped rows—it simply discards them after reading. This leads to redundant I/O operations and significantly slows down performance for large files.
readr::read_csv(file, n_max = 100, skip = 0) # Reads rows 1–100
readr::read_csv(file, n_max = 200, skip = 100) # Re-reads rows 1–100, then reads 101–300
This approach is inefficient when processing data in chunks.
read.csv()
can read directly from a file or a connection, and unlike readr::read_csv()
, it maintains the connection state, allowing you to loop over chunks without re-reading prior rows.
Example: Sequential Reading with Connection
con <- gzfile("large_file.csv.gz", open = "rt") # Text mode
headers <- read.csv(con, nrows = 1) # Read column names
repeat {
chunk <- tryCatch(read.csv(con, nrows = 1000), error = function(e) NULL)
if (is.null(chunk) || nrow(chunk) == 0) break
# Process 'chunk' here
}
close(con)
Occasionally, when reading compressed or encoded files, you might encounter the following error:
Error in (function (con, what, n = 1L, size = NA_integer_, signed = TRUE):
can only read from a binary connection
This can occur if the file is not correctly interpreted as text. A workaround is to explicitly set the connection mode to binary using "rb"
:
Although file()
and gzfile()
generally detect formats automatically, setting the mode explicitly can resolve issues when the behavior is inconsistent.