2.4 Data Import/Export

R is a powerful and flexible tool for data analysis, but it was originally designed for in-memory statistical computing. This imposes several practical limitations, especially when handling large datasets.

2.4.1 Key Limitations of R

Single-Core Default Execution
By default, R utilizes only one CPU core, limiting performance for compute-intensive tasks unless parallelization is explicitly implemented.
Memory-Based Data Handling
R loads data into RAM. This approach becomes problematic when working with datasets that exceed the available memory.
- Medium-Sized Files: Fits within typical RAM (1–2 GB). Processing is straightforward.
- Large Files: Between 2–10 GB. May require memory-efficient coding or special packages.
- Very Large Files: Exceed 10 GB. Necessitates distributed or parallel computing solutions.

2.4.2 Solutions and Workarounds

Upgrade Hardware

Increase RAM: A simple but often effective solution for moderately large datasets.

Leverage High-Performance Computing (HPC) in R

There are several HPC strategies and packages in R that facilitate working with large or computationally intensive tasks:

Explicit Parallelism
Use packages like parallel, foreach, doParallel, future, and snow to define how code runs across multiple cores or nodes.
Implicit Parallelism
Certain functions in packages such as data.table, dplyr, or caret internally optimize performance across cores when available.
Large-Memory Computation
Use memory-efficient structures or external memory algorithms.
MapReduce Paradigm
Useful in distributed computing environments, especially with big data infrastructure like Hadoop.

Efficient Data Loading

Limit Rows and Columns
Use arguments such as nrows = in functions like read.csv() or fread() to load subsets of large datasets.

Use Specialized Packages for Large Data

In-Memory Matrix Packages (Single Class Type Support)
These packages interface with C++ to handle large matrices more efficiently:
- bigmemory, biganalytics, bigtabulate, synchronicity, bigalgebra, bigvideo
Out-of-Memory Storage (Multiple Class Types)
For datasets with various data types:
- ff package: Stores data on disk and accesses it as needed, suitable for mixed-type columns.

Handling Very Large Datasets (>10 GB)

When data size exceeds the capacity of a single machine, distributed computing becomes necessary:

Hadoop Ecosystem Integration
- RHadoop: A suite of R packages that integrate with the Hadoop framework.
- HadoopStreaming: Enables R scripts to be used as Hadoop mappers and reducers.
- Rhipe: Provides a more R-like interface to Hadoop, using Google’s Protocol Buffers for serialization.

2.4.3 Medium size

library("rio")

To import multiple files in a directory

str(import_list(dir()), which = 1)

To export a single data file

export(data, "data.csv")
export(data,"data.dta")
export(data,"data.txt")
export(data,"data_cyl.rds")
export(data,"data.rdata")
export(data,"data.R")
export(data,"data.csv.zip")
export(data,"list.json")

To export multiple data files

export(list(mtcars = mtcars, iris = iris), "data_file_type") 
# where data_file_type should substituted with the extension listed above

To convert between data file types

# convert Stata to SPSS
convert("data.dta", "data.sav")

2.4.4 Large size

2.4.4.1 Cloud Computing: Using AWS for Big Data

Amazon Web Service (AWS): Compute resources can be rented at approximately $1/hr. Use AWS to process large datasets without overwhelming your local machine.

2.4.4.2 Importing Large Files as Chunks

2.4.4.2.1 Using Base R

file_in <- file("in.csv", "r")  # Open a connection to the file
chunk_size <- 100000            # Define chunk size
x <- readLines(file_in, n = chunk_size)  # Read data in chunks
close(file_in)                  # Close the file connection

2.4.4.2.2 Using the `data.table` Package

library(data.table)
mydata <- fread("in.csv", header = TRUE)  # Fast and memory-efficient

2.4.4.2.3 Using the `ff` Package

library(ff)
x <- read.csv.ffdf(
  file = "file.csv",
  nrow = 10,          # Total rows
  header = TRUE,      # Include headers
  VERBOSE = TRUE,     # Display progress
  first.rows = 10000, # Initial chunk
  next.rows = 50000,  # Subsequent chunks
  colClasses = NA
)

2.4.4.2.4 Using the `bigmemory` Package

library(bigmemory)
my_data <- read.big.matrix('in.csv', header = TRUE)

2.4.4.2.5 Using the `sqldf` Package

library(sqldf)
my_data <- read.csv.sql('in.csv')

# Example: Filtering during import
iris2 <- read.csv.sql("iris.csv", 
    sql = "SELECT * FROM file WHERE Species = 'setosa'")

2.4.4.2.6 Using the `RMySQL` Package

library(RMySQL)

RQLite package

Download SQLite, pick “A bundle of command-line tools for managing SQLite database files” for Window 10
Unzip file, and open sqlite3.exe.
Type in the prompt
- sqlite> .cd 'C:\Users\data' specify path to your desired directory
- sqlite> .open database_name.db to open a database
- To import the CSV file into the database
  - sqlite> .mode csv specify to SQLite that the next file is .csv file
  - sqlite> .import file_name.csv datbase_name to import the csv file to the database
- sqlite> .exit After you’re done, exit the sqlite program

library(DBI)
library(dplyr)
library("RSQLite")
setwd("")
con <- dbConnect(RSQLite::SQLite(), "data_base.db")
tbl <- tbl(con, "data_table")
tbl %>% 
    filter() %>%
    select() %>%
    collect() # to actually pull the data into the workspace
dbDisconnect(con)

2.4.4.2.7 Using the `arrow` Package

library(arrow)
data <- read_csv_arrow("file.csv")

2.4.4.2.8 Using the `vroom` Package

library(vroom)

# Import a compressed CSV file
compressed <- vroom_example("mtcars.csv.zip")
data <- vroom(compressed)

2.4.4.2.9 Using the `data.table` Package

s = fread("sample.csv")

2.4.4.2.10 Comparisons Regarding Storage Space

test = ff::read.csv.ffdf(file = "")
object.size(test) # Highest memory usage

test1 = data.table::fread(file = "")
object.size(test1) # Lowest memory usage

test2 = readr::read_csv(file = "")
object.size(test2) # Second lowest memory usage

test3 = vroom::vroom(file = "")
object.size(test3) # Similar to read_csv

When dealing with large datasets—especially those exceeding 10 GB—standard data-loading strategies in R can become impractical due to memory constraints. One common approach is to store these datasets in a compressed format such as .csv.gz, which saves disk space while preserving compatibility with many tools.

Compressed files (e.g., csv.gz) are especially useful for archiving and transferring large datasets. However, R typically loads the entire dataset into memory before writing or processing, which is inefficient or even infeasible for very large files.

In such cases, sequential processing becomes essential. Rather than reading the entire dataset at once, you can process it in chunks or rows, minimizing memory usage.

Comparison between `read.csv()` and `readr::read_csv()`
Function	Supports Connections	Sequential Access	Performance	Notes
`read.csv()`	Yes	Yes	Slower	Base R; can work with gz-compressed files and allows looped reading
`readr::read_csv()`	Limited	No	Faster	High performance but re-reads lines even when using `skip`

While readr::read_csv() is faster and more efficient for smaller files, it has a limitation for large files when used with skip. The skip argument does not avoid reading the skipped rows—it simply discards them after reading. This leads to redundant I/O operations and significantly slows down performance for large files.

readr::read_csv(file, n_max = 100, skip = 0)        # Reads rows 1–100
readr::read_csv(file, n_max = 200, skip = 100)      # Re-reads rows 1–100, then reads 101–300

This approach is inefficient when processing data in chunks.

read.csv() can read directly from a file or a connection, and unlike readr::read_csv(), it maintains the connection state, allowing you to loop over chunks without re-reading prior rows.

Example: Sequential Reading with Connection

con <- gzfile("large_file.csv.gz", open = "rt")  # Text mode
headers <- read.csv(con, nrows = 1)              # Read column names
repeat {
  chunk <- tryCatch(read.csv(con, nrows = 1000), error = function(e) NULL)
  if (is.null(chunk) || nrow(chunk) == 0) break
  # Process 'chunk' here
}
close(con)

Occasionally, when reading compressed or encoded files, you might encounter the following error:

Error in (function (con, what, n = 1L, size = NA_integer_, signed = TRUE):

can only read from a binary connection

This can occur if the file is not correctly interpreted as text. A workaround is to explicitly set the connection mode to binary using "rb":

con <- gzfile("large_file.csv.gz", open = "rb")  # Binary mode

Although file() and gzfile() generally detect formats automatically, setting the mode explicitly can resolve issues when the behavior is inconsistent.

2.4.4.3 Sequential Processing for Large Data

# Open file for sequential reading
file_conn <- file("file.csv", open = "r")
while (TRUE) {
  # Read a chunk of data
  data_chunk <- read.csv(file_conn, nrows = 1000)
  if (nrow(data_chunk) == 0) break  # Stop if no more rows
  # Process the chunk here
}
close(file_conn)  # Close connection