11 Reading and Writing Data
Data comes from various sources and in various formats. They can be structured and unstructured. They can be numbers, texts, networks, audio, and visual and more.
In this chapter, we discuss how to read and write files in R. We’ll primarily be working with structured data. We’ll discuss unstructured textual data in the case study in the final module.
In R, we will be primarily working with data frames. A data frame is the recommended R data structure for storing and representing tabular data. If the data is not already a data frame, we will turn it into a format that R is able to process.
11.1 Datasets available in installed packages
First of all, before we load any external datasets, the base package provides a variety of toy datasets in the package datasets
. data()
lists all the datasets in datasets
.
Additionally, many add-on packages contain datasets that accompany the functions for users to test the code and explore the functions. We can load those datasets after loading the package. data(package = .packages(all.available = TRUE))
lists all datasets in all available packages.
To load a dataset from a package, we can use the function data()
. For example, data("starwars", package = "dplyr")
loads the starwars
dataset in the package dplyr
.
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Skyw… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo Droid <chr> <chr>
## # ℹ 1 more variable: starships <list>
We can also use the pattern pkg::name
to load an object from a package.
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Skyw… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo Droid <chr> <chr>
## # ℹ 1 more variable: starships <list>
11.2 Reading tabular text files
In terms of external datasets, the format of data that we may be most familiar with is probably spreadsheet-like text files. These are structured data in a tabular, or rectangular, shape.
In these files, each line is a record/case/observation. A record contains variables (columns/fields/dimensions/features) associated with that record; the variables are often separated by a one-character delimiter, such as a whitespace, tab, colon, or comma. Each record contains the same number of columns.
Tabular data is commonly stored in comma-separated values and tab-delimited values files. The comma-separated values file, CSV file, is a popular format for storing data, and many programs can import and export data in that format.
In base R, the function read.table()
and its family members read.csv()
and read.delim()
can read tabular text files. read.csv()
and read.delim()
(and a few others) are convenience functions based on read.table()
to read comma-separated and tab-separated files.
readr
provides read.table()
equivalents that are faster than base R functions. As with the read.table()
family, read_delim()
reads a delimited file, and read_csv()
and read_tsv()
are special cases of the general read_delim()
function.
Both groups of functions read data from our working directory, and then create a data frame.
We can also use both functions to read CSV files via a URL. They substitute a URL for a file name.
The readr
way:
header, separator, skipping lines, encoding
When reading files into R, there are a few elements in a file that we should keep an eye on: the header, the first few lines of a file, the separator, and encoding of the file. Does the first line of the file contain the names of the variables? What is the correct field separator character? Do the first few lines need to be skipped, and how many? Besides, we need to make sure that character fields are read in the correct encoding (e.g., when we work with Chinese characters), unless the file to be imported from is entirely in ASCII.
Let’s see an sample file. In this dataset, the first three lines have no data. The fourth line contains variable names. Then the next line is still empty.
If we use read.table()
, the first few lines that don’t have data items are skipped. By default header
is FALSE
. The delimiter is whitespace by default.
## V1 V2 V3
## 1 ticker evtdate car
## 2 EGAS 12FEB2014 0.02356
## 3 EGAS 27JUL2012 -0.00276
## 4 HGR 03JUL2002 0.01478
## 5 HGR 03MAY2005 -0.05070
## 6 HGR 05AUG2002 0.02337
We can also use the readr
function to read tabular data. Here we have specified the col_names
to be turned off, and skip
to be 3.
## # A tibble: 6 × 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 ticker evtdate car
## 2 EGAS 12FEB2014 0.02356
## 3 EGAS 27JUL2012 -0.00276
## 4 HGR 03JUL2002 0.01478
## 5 HGR 03MAY2005 -0.05070
## 6 HGR 05AUG2002 0.02337
With both methods, R has generated column names for us (V1, V2, and V3 from read.table()
; X1, X2 and X3 from read_table2()
).
Use ?read_table2
to find out the difference between read_table()
and read_table2()
.
We can overwrite these column names by names()
.
## ticker evtdate car
## 1 ticker evtdate car
## 2 EGAS 12FEB2014 0.02356
## 3 EGAS 27JUL2012 -0.00276
And then drop the first row.
## ticker evtdate car
## 2 EGAS 12FEB2014 0.02356
## 3 EGAS 27JUL2012 -0.00276
## 4 HGR 03JUL2002 0.01478
## 5 HGR 03MAY2005 -0.05070
## 6 HGR 05AUG2002 0.02337
## 7 HGR 20JAN2004 0.04206
11.3 Writing to CSV files
If we want to save a data frame to a CSV file, one option is to use write.csv()
, or write_csv()
from readr
.
write_csv(x, "filename.csv")
The function takes a data frame and a file to write to. The file is saved to our working directory. By default the columns names are written to the file (col_names = TRUE
).
11.4 Reading other text documents for storing data
In addition to delimiter-separated values files, there are many more formats and standards of text documents for storing data. These include JavaScript Object Notation (JSON), and eXtensible Markup Language (XML).
If you collect data from the Internet, most likely data is stored in JSON formats. JSON is the de facto standard for web applications to pass data between themselves. However, JSON is not native to R. Therefore, we need to translate it to a data structure friendly to R, such as a list, or a data frame.
In R, rjson
and jsonlite
, among other packages, read and write JSON data. jsonlite
has additional cleaning functionalities (e.g., flattening nested data frames) and can read ndjson data.
If we are lucky enough, we can use jsonlite::fromJSON()
to directly read JSON objects into R. But quite often, we need to take extra steps before R can readily read JSON objects. For instance, if we download the Yelp dataset from Kaggle, we need a middleman to help read those datasets to R data structures. See the discussions below.
11.5 Reading unstructured texts
To read unstructured texts that are not organized into tabular forms, we can use readLines()
from base R or read_lines()
from the package readr
. These functions read individual lines and then process them as strings to extract data items. The functions will return a character vector, with one value corresponding to each row in the file.
Back to the Yelp dataset:
library(jsonlite)
con <- read_lines("yelp_academic_dataset_business.json")
df <- stream_in(textConnection(con))
stream_in()
implements line-by-line processing of JSON data in the ndjson format over a connection (e.g., url, file).
Reading unstructured data as corpora for text analysis is another story. In these cases, we can use packages that are developed specifically for text mining or natural language processing.
11.6 Reading Excel files
Package readxl
and package xlsx
allow us to deal with Excel files of different versions conveniently. Note that xlsx
requires Java.
library(readxl)
dataset <- read_excel()
Excel file format is proprietary. When you have a choice, do not save your files to proprietary file formats so that your data will not be locked into a platform that you lack control of.
11.7 Reading Stata, SPSS, SAS files
Package haven
provides methods to read data from the major statistical software Stata, SPSS and SAS.
SPSS
library(haven)
dataset <- read_sav()
Stata
library(haven)
dataset <- read_stata()
SAS
library(haven)
dataset <- read_sas()
Package foreign
reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, and dBase. Note that foreign
does not support Stata after version 12.
11.8 Reading data from databases
There are many R packages developed to work with relational and non-relational databases. Look up for the package that can connect to your database on this CRAN Task View page Databases with R.
11.9 Reading many files
Sometimes, we have hundreds of datasets in our working directory to read into our workspace. They are stored in the same file format, and record the same sets of attributes of a topic we are probing into.
As a first step, we can use list.files()
to list all the files in the working directory, and observe if there is a pattern pertinent to all files. For instance, they have the same suffix, or their file names contain the same string.
Next, we may append them vertically into a new data frame in a batch.