2.5 Using files in R

2.5.1 Reading files

Very often, the data we need to analyse is stored in external files. So the ability of R to access files is very important. Fortunately R provides a number of very convenient ways to import data. One of the simplest is from a comma-separated variable (CSV) file. This is also very flexible since a very common way that data is stored is in Excel spreadsheets. Users can save their spreadsheets in comma-separated values files (CSV) and then use R’s built in functionality to read and manipulate the data.

# R code to read a remote file (on GitHub) into a data frame pubsDataFrame
# Because the path url is quite long I've first copied it to a character string fname
# to improve readability

fname <- "https://raw.githubusercontent.com/mjshepperd/CS5702-Data/master/pubs.csv"
pubsDataFrame <- read.csv(fname, header = TRUE, stringsAsFactors = FALSE)

head(pubsDataFrame)

##              pubName  open        town weeklySales foodSales
## 1 The Dead Albatross  TRUE    Uxbridge        2735      1209
## 2   The Island Queen  TRUE   Islington        3644         0
## 3        Johnnys Bar FALSE Vladivostok           0         0
## 4           Red Lion  TRUE    Habrough        3263        NA
## 5          The Crown FALSE    Haccombe           0         0
## 6          Royal Oak FALSE      Haceby           0         0

In the above R code note that we use the function read.csv(); it has several arguments. The first argument is the path name to the csv file we wish to import. In this example we have nested another function file.choose() which will prompt the user to select the relevant file. Sometimes this can be very flexible. Next, typically the first row of a spreadsheet contains column descriptions so this can form a good basis for naming each column in R. Lastly, we have the argument stringsAsFactors = FALSE. This is something of a technicality but the default behaviour is to convert string character variables into factors (i.e., categories such as female and male) which isn’t generally the desired behaviour.

Note that PC users need to be careful about path names. On a Windows computer it will use the \ symbol to denote a hierarchy of sub-directories, whilst R uses / (similarly to macOS and Linux).

Sometimes when we read in a file, it is more convenient to be flexible about the path name rather than hard code it. In that case we can nest the function file.choose() which will prompt the user to select the relevant file. For example, we could have myDataFrame <- read.csv(file.choose(), header = TRUE).

Frequently we deal with text files, since they can be widely processed (not just by one software system). Typically a new row is defined with return and columns are separated with some predefined special character. The most common character is a commas in a CSV file, however sometimes we need to deal with other variable separators e.g., semicolon, space or tab. Base R provides a more general function read.table() where we can specify the separator, e.g., myDF <- read.table("myTabSepFile", sep = "\t", header = TRUE) where \t denotes a tab character. As with read.csv() the function argument header = TRUE conveys that the first row of the file contains the variable names.

2.5.2 Writing files

So far we have reviewed some methods for importing data into R from external files, but there are occasions when we’ll want to go the other way, i.e., exporting data from R so that the data can be archived or used by external applications.

Analogous to the read.csv() function is the write.csv() which outputs an R object to a delimited text file. The format is:

write.csv(myDF, "mydata.csv", sep=",", append = FALSE)

Here we are writing the (data frame) myDF to a file named mydata.csv and using commas as separators (although strictly there is no need to specify this). The append argument is set to false which means if the file already exists the write.csv() will overwrite it.

For a lot more detail, see the encyclopaedic R Data Import/Export manual which whilst it might seem rather intimidating covers almost every conceivable data import and export scenario.