Chapter 9 Importing Data

library(tidyverse)      # Our usual data manipulation tools
library(readxl)         # For reading MS Excel files
library(googlesheets4)  # For reading Google Sheets

Reading data from external sources is necessary. It is most common for data to be in a data-frame like storage, such as a MS Excel workbook, so we will concentrate on reading data into a data.frame.

In the typical way data is organized, we think of each column of data representing some trait or variable that we might be interested in. In general, we might wish to investigate the relationship between variables. In contrast, the rows of our data represent a single object on which the column traits are measured. For example, in a grade book for recording students scores throughout the semester, their is one row for every student and columns for each assignment. A greenhouse experiment data set will have a row for every plant and columns for treatment type and biomass.

9.1 Working directory

One concept that will be important is to recognize that every time you start up RStudio, it picks an appropriate working directory. This is the directory where it will first look for script files or data files. By default when you double click on an R script or Rmarkdown file to launch RStudio, it will set the working directory to be the directory that the file was in. Similarly, when you knit an Rmarkdown file, the working directory will be set to the directory where the Rmarkdown file is. For both of these reasons, I always program my scripts assuming that paths to any data files will be relative to where where my Rmarkdown file is. To set the working directory explicitly, you can use the GUI tools Session -> Set Working Directory....

The functions that we will use in this lab all accept a character string that denotes the location of the file. This location could be a web address, it could be an absolute path on your computer, or it could be a path relative to the location of your Rmarkdown file.

`'MyFile.csv'`	Look in the working directory for `MyFile.csv`.
`'data/Myfile.csv'`	In the working directory, there is a sub-directory called `data` and inside that folder there is a filed called `MyFile.csv`.
`'./data/Myfile.csv'`	In the working directory, there is a sub-directory called `data` and inside that folder there is a filed called `MyFile.csv`. The `./` is indicating The current working directory explicitly instead of implicitly.
`'../Project2/data/Myfile.csv'`	From the current working directory, go UP one directory level. In that folder there is a folder called `Project2` which in turn has a `data` directory, and inside that folder there is a filed called `MyFile.csv`. The `../` is indicating to go UP.
`'~/NAU/Research/data.csv'`	In Unix based operating systems (which includes MacOSX), a `~` denotes the user’s home directory. So this path is from my home directory, to a sub-directory `NAU` which has a Research sub-directory.

I strongly recommend using relative paths for specifying when doing data import so that the paths are not broken when you share your code with your collaborators or organize your directories to group your work. For example, if I were to specify the path ~/NAU/Research/deHeer/data-raw/FitKids.xls, then when I share my folder with Dr. deHeer’s graduate student (who works on a PC), the path won’t work. But if I code the path as data-raw/FitKids.xls and the Rmarkdown document lives in the same directory as the data-raw folder, then we are set and I can share the deHeer folder and everything will work regardless of who is running the code.

9.1.1 Recommended Project Folder Structure

When working with many projects and collaborators, it is useful to have a “standard” project directory structure so that everybody knows where certain project components should be located.

Folder	Description
`'data-raw'`	This directory has the raw data such as excel and csv files. It also contains scripts that import and wrangle the data in order to produce a clean data file.
`'data'`	This directory contains the cleaned up data in the form of `.rda` files.
`'docs'`	This directory contains all the Rmarkdown files that will generate reports and analysis summaries using the clean data.

9.2 R Binary Files

R objects have a file representation that allow for saving and loading that respects things like factor sort orders. These files are portable across all R platforms and the structure is open-source and many other programs can import this format.

Suffix	Description
`.rda` or `.RData`	One or more data objects. I prefer the `.RData` suffix as it is more descriptive, but unfortunately some operating systems don’t like the long suffix, so the `.rda` suffix is preferred.
`.rds`	A single R data object. I don’t recommend using this because it is unnecessary to have whole different suffix just to distiguish between a single object verse multiple objects.
`.rdx` and `.rdb`	A combination of files that is used when a package supports Lazy Loading. These should used primarily for packages with wide spread use.

For a given R object or objects, the function save() will save the objects into a .rda file which can be subsequently loaded using the load() command.

# In the data-raw directory, I might have script `Clean.R` that
# imports and cleans data. At the end of that script, I would save
# the file into the project data director
save(trees, file = 'data/trees.rda')      # If working directory is the project
save(trees, file = '../data/trees.rda')   # If working directory is data-raw

Generally I’ll have my the scripts that generate an analysis first import the cleaned up data

load('data/trees.rda')      # If working directory is the project folder
load('../data/trees.rda') # If working directory is the project/docs folder

If your project only has a few data files, one cleaning script, and one analysis Rmarkdown file, it is tempting to ignore creating the data-raw, data, and docs directories. However it is really obnoxious to impose order when things are already out of control and you need to modify 10 or 20 scripts. If there is any chance of a project growing in scope beyond just myself, it pays to organize where your raw-data, clean data, and analysis scripts live at the beginning of the project.

9.3 Comma Separated Data

To consider how data might be stored, we first consider the simplest file format… the comma separated values file. In this file time, each of the “cells” of data are separated by a comma. For example, the data file storing scores for three students might be as follows:

Able, Dave, 98, 92, 94
Bowles, Jason, 85, 89, 91
Carr, Jasmine, 81, 96, 97

Typically when you open up such a file on a computer with Microsoft Excel installed, Excel will open up the file assuming it is a spreadsheet and put each element in its own cell. However, you can also open the file using a more primitive program (say Notepad in Windows, TextEdit on a Mac) you’ll see the raw form of the data.

Having just the raw data without any sort of column header is problematic (which of the three exams was the final??). Ideally we would have column headers that store the name of the column.

LastName, FirstName, Exam1, Exam2, FinalExam
Able, Dave, 98, 92, 94
Bowles, Jason, 85, 89, 91
Carr, Jasmine, 81, 96, 97

To see another example, open the “Body Fat” dataset from the Lock $^{5}$ introductory text book at the website [http://www.lock5stat.com/datasets3e/BodyFat.csv]. The first few rows of the file are as follows:

Bodyfat,Age,Weight,Height,Neck,Chest,Abdomen,Ankle,Biceps,Wrist
32.3,41,247.25,73.5,42.1,117,115.6,26.3,37.3,19.7
22.5,31,177.25,71.5,36.2,101.1,92.4,24.6,30.1,18.2
22,42,156.25,69,35.5,97.8,86,24,31.2,17.4
12.3,23,154.25,67.75,36.2,93.1,85.2,21.9,32,17.1
20.5,46,177,70,37.2,99.7,95.6,22.5,29.1,17.7

To make R read in the data arranged in this format, we need to tell R three things:

Where does the data live? Often this will be the name of a file on your computer, but the file could just as easily live on the internet (provided your computer has internet access).
Is the first row data or is it the column names?
What character separates the data? Some programs store data using tabs to distinguish between elements, some others use white space. R’s mechanism for reading in data is flexible enough to allow you to specify what the separator is.

The primary function that we’ll use to read data from a file and into R is the function read.table(). This function has many optional arguments but the most commonly used ones are outlined in the table below.

Argument	Default	What it does
`file`		A character string denoting the file location
`header`	FALSE	Is the first line column headers?
`sep`	" "	What character separates columns. " " == any whitespace
`skip`	0	The number of lines to skip before reading data. This is useful when there are lines of text that describe the data or aren’t actual data
`na.strings`	‘NA’	What values represent missing data. Can have multiple. E.g. `c('NA', -9999)`
`quote`	" and ’	For character strings, what characters represent quotes.

To read in the “Body Fat” data set we could run the R command:

BodyFat <- read.table( 
  file   = 'http://www.lock5stat.com/datasets3e/BodyFat.csv', # where the data lives
  header = TRUE,                                              # first line is column names
  sep    = ',' )                                              # Data is separated by commas

str(BodyFat)

## 'data.frame':    100 obs. of  10 variables:
##  $ Bodyfat: num  32.3 22.5 22 12.3 20.5 22.6 28.7 21.3 29.9 21.3 ...
##  $ Age    : int  41 31 42 23 46 54 43 42 37 41 ...
##  $ Weight : num  247 177 156 154 177 ...
##  $ Height : num  73.5 71.5 69 67.8 70 ...
##  $ Neck   : num  42.1 36.2 35.5 36.2 37.2 39.9 37.9 35.3 42.1 39.8 ...
##  $ Chest  : num  117 101.1 97.8 93.1 99.7 ...
##  $ Abdomen: num  115.6 92.4 86 85.2 95.6 ...
##  $ Ankle  : num  26.3 24.6 24 21.9 22.5 22 23.7 21.9 24.8 25.2 ...
##  $ Biceps : num  37.3 30.1 31.2 32 29.1 35.9 32.1 30.7 34.4 37.5 ...
##  $ Wrist  : num  19.7 18.2 17.4 17.1 17.7 18.9 18.7 17.4 18.4 18.7 ...

Looking at the help file for read.table() we see that there are variants such as read.csv() that sets the default arguments to header and sep more intelligently. Also, there are many options to customize how R responds to different input.

9.4 MS Excel

Commonly our data is stored as a MS Excel file. There are two approaches you could use to import the data into R.

From within Excel, export the worksheet that contains your data as a comma separated values (.csv) file and proceed using the tools in the previous section. Any formulas in the worksheet are not saved, but the result of the calculation is presented.
Use functions within R that can work with the worksheet directly. Critically formulas in Excel are not imported, but the result of the formula is. One package that works nicely for this is the readxl package.

I generally prefer using option 2 because all of my collaborators can’t live without Excel and I’ve resigned myself to this. However if you have complicated formulas in your Excel file, it is often times safer to export it as a .csv file to guarantee the data imported into R is correct. Furthermore, other spreadsheet applications (such as Google sheets) requires you to export the data as a .csv file so it is good to know both paths.

Because R can only import a complete worksheet, the desired data worksheet must be free of notes to yourself about how the data was collected, preliminary graphics, or other stuff that isn’t the data. I find it very helpful to have a worksheet in which I describe the sampling procedure and describe what each column means (and give the units!), then a second worksheet where the actual data is, and finally a third worksheet where my “Excel Only” collaborators have created whatever plots and summary statistics they need.

The simplest package for importing Excel files seems to be the package readxl. Another package that does this is the XLConnect which does the Excel -> .csv conversion using Java. Another package the works well is the xlsx package, but it also requires Java to be installed. The nice thing about these two packages is that they also allow you to write Excel files as well. The RODBC package allows R to connect to various databases and it is possible to make it consider an Excel file as an extremely crude database.

The readxl package provides a function read_exel() that allows us to specify which sheet within the Excel file to read and what character specifies missing data (it assumes a blank cell is missing data if you don’t specifying anything). For the most part, the arguments are the same as read.csv but below are the most important changes and additions.

Argument	Meaning
`path`	The file argument is called `path` instead.
`sheet`	Which sheet to read. Either the sheet name or sheet number.
`range`	The cell range to read from. E.g. “A5:G98”

From GitHub, download the files Example_1.xls, through Example_5.xls, from the directory [https://github.com/dereksonderegger/444/tree/master/data-raw]. Place these files in the same directory that you store your course work or make a subdirectory data to store the files in. Make sure that the working directory that RStudio is using is that same directory (Session -> Set Working Directory).

# load the library that has the read.xls function. 
library(readxl)

# Where does the data live relative to my current working location? 
#
# In my directory where this Rmarkdown file lives, I have made a subdirectory 
#    named 'data-raw' to store all the data files. So the path to my data
#    file will be 'data-raw/Example_1.xls'.
# If you stored the files in the same directory as your RMarkdown script, you
#    don't have to add any additional information and you can just tell it the
#    file name 'Example_1.xls'
# Alternatively I could give the full path to this file starting at the root
#    directory which, for me, is '~/GitHub/444/data-raw/Example_1.xls'
#    but for Windows users it might be 'Z:/444/Lab9/Example_1.xls'. This looks
#    odd because Windows usually uses a backslash to represent the directory
#    structure, but a backslash has special meaning in R and so it wants 
#    to separate directories via forward slashes.

# read the first worksheet of the Example_1 file
# data.1 <- read_excel('~/GitHub/444/data-raw/Example_1.xls')  # absolute path
data.1 <- read_excel( 'data-raw/Example_1.xls'  )   # relative to this Rmarkdown file

# read the second worksheet where the second worksheet is named 'data'
data.2 <- read_excel('data-raw/Example_2.xls', sheet=2     )   
data.2 <- read_excel('data-raw/Example_2.xls', sheet='data')

There is one additional problem that shows up while reading in Excel files. Blank columns often show up in Excel files because at some point there was some text in a cell that got deleted but a space remains and Excel still thinks there is data in the column. To fix this, you could find the cell with the space in it, or you can select a bunch of columns at the edge and delete the entire columns. Alternatively, you could remove the column after it is read into R using typical data frame manipulation tools.

Open up the file Example_4.xls in Excel and confirm that the data sheet has name columns out to carb. Read in the data frame using the following code:

data.4 <- read_excel('data-raw/Example_4.xls', sheet='data')   # Extra Column Example

## New names:
## * `` -> ...13
## * `` -> ...14

str(data.4)

## tibble[,14] [34 × 14] (S3: tbl_df/tbl/data.frame)
##  $ model: chr [1:34] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg  : num [1:34] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl  : num [1:34] 6 6 4 6 8 6 8 4 4 6 ...
##  $ disp : num [1:34] 160 160 108 258 360 ...
##  $ hp   : num [1:34] 110 110 93 110 175 105 245 62 95 123 ...
##  $ drat : num [1:34] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt   : num [1:34] 2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec : num [1:34] 16.5 17 18.6 19.4 17 ...
##  $ vs   : num [1:34] 0 0 1 1 0 1 0 1 1 1 ...
##  $ am   : num [1:34] 1 1 1 0 0 0 0 0 0 0 ...
##  $ gear : num [1:34] 4 4 4 3 3 3 3 4 4 4 ...
##  $ carb : num [1:34] 4 4 1 1 2 1 4 2 2 4 ...
##  $ ...13: logi [1:34] NA NA NA NA NA NA ...
##  $ ...14: logi [1:34] NA NA NA NA NA NA ...

We notice that after reading in the data, there is an additional column that just has missing data (the NA stands for not available which means that the data is missing) and a row with just a single blank. Go back to the Excel file and go to row 4 column N and notice that the cell isn’t actually blank, there is a space. Delete the space, save the file, and then reload the data into R. You should notice that the extra columns are now gone.

9.5 Google Sheets

The Google alternative to MS Excel is Google Sheets where the data lives on Google’s cloud services. You could download a .csv version and import it but it is more convenient to use Google’s Sheets API, which is now on version 4. The R package that implements the API is googlesheets4. This package is built using a few other packages, notably the googledrive package which allows a user to access the contents of a Google drive.

It is useful to know that Google Sheets supports a robust notation for specifying cell ranges:

Sheet1!A1:B2 refers to the first two cells in the top two rows of Sheet1.
Sheet1!A:A refers to all the cells in the first column of Sheet1.
Sheet1!1:2 refers to all the cells in the first two rows of Sheet1.
Sheet1!A5:A refers to all the cells of the first column of Sheet 1, from row 5 onward.
A1:B2 refers to the first two cells in the top two rows of the first visible sheet.
Sheet1 refers to all the cells in Sheet1.

The first time you run a googledrive function, it will ask you to authenticate to the Google services and to store the authentication key in your R profile. It will warn you that you are allowing R to access your data. You’ll need to agree to this and R will store this authentication token.

9.5.1 Reading

The main function to read from a Google Sheet is read_sheet() The most pertinent parameters are below, but there are a few other interesting options.

Parameter	Description
`ss`	Something that identifies the Google sheet: a file ID, a URL or `dribble` which is the “drive tibble” that the `googledrive` package uses to represent Drive files.
`sheet`	This represents the tab within the spreadsheet file. It can be the numerical position or tab name.
`range`	The cell range to read from.

# The following link is to a data set on my Google Drive STA 444 directory
# I have it set so that anyone with the link can read the speadsheet so the
# code here will work for anybody.
link <- 'https://docs.google.com/spreadsheets/d/1pPbeJOqlT3N9fEXfO2sJbYddVWgeB8VSt61V0WNK6Vc'

# Just using the direct link
df <- read_sheet(link, sheet='Data')
head(df)

This works well for individual Sheets that you have either an ID or URL for. However, it would be nice to not have to look that information and instead give the path within your Google Drive. To do that, we need to authenticate and give R the rights to access the drive.

If your code is running in a non-interactive environment (for example when knitting your Rmarkdown document) you’ll need to make the authentication happen by finding a prior authorization token.

# Force google drive to use a prior cached authentication token...
# Notice I still had to agree to letting R have access to my Google Drive
# at some point in time!
options(gargle_oauth_email = "derek.sonderegger@gmail.com")

You can read about the details of how/why this works at the gargle package vinnette about non-interactive authentication

# Now read using the path within my Google Drive
df <- googledrive::drive_get('STA_444/CherryTrees') %>%
        read_sheet( sheet='Data' )

Another consideration is that Google limits the number of read requests in a short period of time and if you are tweaking your Rmarkdown, you might run into this issue. You might try caching the R chunk.

9.5.2 Writing

The googlesheets4 package also supports writing to the spreadsheet.

The function write_sheet writes a data frame to a sheet, whereas range_write will write data to a specific location within a sheet.

# Here I'm only allowed to write to the Sheet because I've authenticated
# with Google. I have this particular sheet set up so that anybody with the
# link can read the file, but cannot write to the file. So this code should
# not work for you!
write_sheet(data=trees, ss=link, sheet='Data')

9.6 Multiple files

There are several cases where our data is stored in multiple files and we want to read them in. If the data sources all have an identical column format, we can just stack the data frames together using the rbind command.

files <- c('file1.csv', 'file2.csv', 'file3.csv')  # Files to be read in.
data <- NULL   # what will the output data frame be named
for( file in files){  # for each element of our files vector
  temp.data <- read.csv(file)     # read in the file 
  data <- rbind(data, temp.data)  # Append it to our final data set
}

In the example above, we might need to modify the read.csv() command, but fortunately it isn’t too hard to combine a for loop with set of statements to read in the data set.

Generally when I have to read in multiple files, I need to inspect a directory and read all the files. Fortunately the function list.files() will return a of character strings containing all the files in a directory.

files <- list.files('~/GitHub/141/data-raw/') 
files %>% head()

## [1] "birth_weights.csv"       "cities.csv"             
## [3] "CO2.csv"                 "COVID-19"               
## [5] "COVID-19_LectureContent" "Export_R_Datasets.R"

9.7 Exercises

Download from GitHub the data file Example_5.xls. Open it in Excel and figure out which sheet of data we should import into R. At the same time figure out how many initial rows need to be skipped. Import the data set into a data frame and show the structure of the imported data using the str() command. Make sure that your data has $n=31$ observations and the three columns are appropriately named. If you make any modifications to the data file, comment on those modifications.
Download from GitHub the data file Example_3.xls. Import the data set into a data frame and show the structure of the imported data using the tail() command which shows the last few rows of a data table. Make sure the Tesla values are NA where appropriate and that both -9999 and NA are imported as NA values. If you make any modifications to the data file, comment on those modifications.
Download all of the files from GitHub data-raw/InsectSurveys directory which I’ve linked here. Each month’s file contains a sheet contains site level information about each of the sites that was surveyed. The second sheet contains information about the number of each species that was observed at each site. Import the data for each month and create a single site data frame with information from each month. Do the same for the observations. Document any modifications you make to the data files. Comment on the importance of consistency of your data input sheets.