11 Inputs and Outputs

Data comes from various sources and in various formats. They can be structured and unstructured. They can be numbers, texts, networks, audio, and visual and more.

In this chapter, we discuss how to read and write files in R. We’ll primarily be working with structured data. We’ll discuss unstructured textual data in the case study in the final module.

In the language of R, we will be primarily working with data frames. A data frame is the recommended R data structure for storing and representing tabular data. If the data is not already a data frame, we will turn it into a format that R is able to process.

11.1 Datasets available in installed packages

First of all, before we load any external datasets, the base package provides a variety of toy datasets in the package datasets. data() lists all the datasets in datasets.

Additionally, many add-on packages contain datasets that accompany the functions for users to test the code and explore the functions. We can load those datasets after loading the package. data(package = .packages(all.available = TRUE)) lists all datasets in all available packages.

To load a dataset from a package, we can use the function data(). For example, data("starwars", package = "dplyr") loads the starwars dataset in the package dplyr.

library(dplyr)
data("starwars", package = "dplyr")
head(starwars, 3)

## # A tibble: 3 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow      112 none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red          33 none  mascu… Naboo  
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## #   ³eye_color, ⁴birth_year, ⁵homeworld

We can also use pkg::name to load an object from a package.

starwars <- dplyr::starwars
head(starwars, 3)

## # A tibble: 3 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow      112 none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red          33 none  mascu… Naboo  
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## #   ³eye_color, ⁴birth_year, ⁵homeworld

11.2 Reading tabular text files

In terms of external datasets, the format of data that we may be most familiar with is probably spreadsheet-like text files. These are structured data in a tabular, or rectangular, shape.

In these files, each line is a record/case/observation. A record contains variables (columns/fields/dimensions/features) associated with that record; the variables are often separated by a one-character delimiter, such as a whitespace, tab, colon, or comma. Each record contains the same number of columns.

Tabular data is commonly stored in comma-separated values and tab-delimited values files. The comma-separated values file, CSV file, is a popular format for storing data, and many programs can import and export data in that format. For instance, R, Excel and most statistical packages allow us to import and export CSV files easily.

In base R, the function read.table() and its family members read.csv() and read.delim() can read these tabular text files. read.csv() and read.delim() (and a few others) are convenience functions based on read.table() to read comma-separated and tab-separated files.

sp500tickers <- read.csv("sp500tickers.csv")
head(sp500tickers)

readr provides read.table() equivalents that are faster than base R functions. As with the read.table() family, read_delim() reads a delimited file, and read_csv() and read_tsv() are special cases of the general read_delim() function.

library(readr)
sp500tickers <- read_csv("sp500tickers.csv")
head(sp500tickers)

Both groups of functions read data from our working directory, and then create a data frame.

We can also use both functions to read CSV files via a URL. They substitute a URL for a file name.

read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

The readr way:

read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

header, separator, skipping lines, encoding

When reading files into R, there are a few elements in a file that we should keep an eye on: the header, the first few lines of a file, the separator, and encoding of the file. Does the first line of the file contain the names of the variables? What is the correct field separator character? Do the first few lines need to be skipped, and how many? Besides, we need to make sure that character fields are read in the correct encoding (e.g. when we work with Chinese characters), unless the file to be imported from is entirely in ASCII.

Let’s see an sample file. In this dataset, the first three lines have no data. The fourth line contains variable names. Then the next line is still empty.

If we use read.table(), the first few lines that don’t have data items are skipped. By default header is FALSE. The delimiter is whitespace by default.

output <- read.table("output.txt")
head(output)

##       V1        V2       V3
## 1 ticker   evtdate      car
## 2   EGAS 12FEB2014  0.02356
## 3   EGAS 27JUL2012 -0.00276
## 4    HGR 03JUL2002  0.01478
## 5    HGR 03MAY2005 -0.05070
## 6    HGR 05AUG2002  0.02337

We can also use the readr function to read tabular data. Here we have specified the col_names to be turned off, and skip to be 3.

library(readr)
output2 <- read_table2("output.txt", col_names = FALSE, skip = 3)
head(output2)

## # A tibble: 6 × 3
##   X1     X2        X3      
##   <chr>  <chr>     <chr>   
## 1 ticker evtdate   car     
## 2 EGAS   12FEB2014 0.02356 
## 3 EGAS   27JUL2012 -0.00276
## 4 HGR    03JUL2002 0.01478 
## 5 HGR    03MAY2005 -0.05070
## 6 HGR    05AUG2002 0.02337

With both methods, R have generated column names for us (V1, V2, and V3 from read.table(); X1, X2 and X3 from read_table2()). Use ?read_table2 to find out the difference between read_table() and read_table2().

We can overwrite these column names by names().

names(output) <- output[1,]
head(output, 3)

##   ticker   evtdate      car
## 1 ticker   evtdate      car
## 2   EGAS 12FEB2014  0.02356
## 3   EGAS 27JUL2012 -0.00276

And then drop the first row.

output_new <- output[-1, ]
head(output_new)

##   ticker   evtdate      car
## 2   EGAS 12FEB2014  0.02356
## 3   EGAS 27JUL2012 -0.00276
## 4    HGR 03JUL2002  0.01478
## 5    HGR 03MAY2005 -0.05070
## 6    HGR 05AUG2002  0.02337
## 7    HGR 20JAN2004  0.04206

more on controlling input file

Both read.table() and readr functions have many parameters for controlling how they read and interpret the input file. Please check the help pages for more details. For instance, with both methods, we can set how to handle the missing strings.

“Import Dataset” feature

In RStudio, we can also use its “Import Dataset” feature conveniently to read files.

11.3 Writing to CSV files

If we want to save a data frame to a CSV file, one option is to use write.csv(), or write_csv() from readr.

write_csv(x, "filename.csv")

The function takes a data frame and a file to write to. The file is saved to our working directory. By default the columns names are written to the file (col_names = TRUE).

11.4 Reading other text documents for storing data

In addition to delimiter-separated values files, there are many more formats and standards of text documents for storing data. These include JavaScript Object Notation (JSON), and eXtensible Markup Language (XML).

If you collect data from the Internet, most likely data is stored in JSON formats. JSON is the de facto standard for web applications to pass data between themselves. However, JSON is not native to R. Therefore, we need to translate it to a data structure friendly to R, such as a list, or a data frame.

In R, rjson and jsonlite, among other packages, read and write JSON data. jsonlite has additional cleaning functionalities (e.g. flattening nested data frames) and can read ndjson data.

If we are lucky enough, we can use jsonlite::fromJSON() to directly read JSON objects into R. But quite often, we need to take extra steps before R can readily read JSON objects. For instance, if we download the Yelp dataset from Kaggle, we need a middleman to help read those datasets to R data structures. See the discussions below.

11.5 Reading unstructured texts

To read unstructured texts that are not organized into tabular forms, we can use readLines() from base R or read_lines() from the package readr. These functions read individual lines and then process them as strings to extract data items. The functions will return a character vector, with one value corresponding to each row in the file.

Back to the Yelp dataset:

library(jsonlite)
con <- read_lines("yelp_academic_dataset_business.json")
df <- stream_in(textConnection(con))

stream_in() implements line-by-line processing of JSON data in the ndjson format over a connection (e.g. url, file).

Reading unstructured data as corpora for text analysis is another story. In these cases, we can use packages that are developed specifically for text mining or natural language processing.

11.6 Reading Excel files

Package readxland package xlsx allow us to deal with Excel files of different versions conveniently. Note that xlsx requires Java.

library(readxl)
dataset <- read_excel()

Excel file format is proprietary. When you have a choice, do not save your files to proprietary file formats so that your data will not be locked into a platform that you lack control of.

11.7 Reading Stata, SPSS, SAS files

Package haven provides methods to read data from the major statistical software Stata, SPSS and SAS.

SPSS

library(haven)
dataset <- read_sav()

Stata

library(haven)
dataset <- read_stata()

SAS

library(haven)
dataset <- read_sas()

Package foreign reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, and dBase. Note that foreign does not support Stata after version 12.

11.8 Reading data from databases

There are many R packages developed to work with relational and non-relational databases. Look up for the package that can connect to your database on this CRAN Task View page Databases with R.

11.9 Reading many files

Sometimes, we have hundreds of datasets in our working directory to read into our workspace. They are stored in the same file format, and record the same sets of attributes of a topic we are probing into.

As a first step, we can use list.files() to list all the files in the working directory, and observe if there is a pattern pertinent to all files. For instance, they have the same suffix, or their file names contain the same string.

Next, we may write a loop to append them vertically into a new data frame.

# creates an empty data frame
df <- data.frame()

# list.files() returns a character vector storing all .CSV files  
files <- list.files(pattern = "*.csv")

# append data frames vertically into a new data frame
for (i in seq_along(files)) {
  temp <- read_csv(files[i])
  df <- rbind(df, temp)  
}

11.10 Saving and loading objects

We can save the objects we have created in an R session with save(), and load them into our workspace later with load().

save(object, file = "object")
load("object")

11.11 Collecting web data

To read data from the Internet into R directly, we either download them via an API provided by the online service that we would like to access, or scrape the data contained inside an HTML or XML page. We’ll go back to this topic in the chapter Collecting Web Data.