Chapter 9 Importing Data
library(tidyverse) # Our usual data manipulation tools
library(readxl) # For reading MS Excel files
library(googlesheets4) # For reading Google Sheets
Reading data from external sources is necessary. It is most common for data to
be in a data-frame like storage, such as a MS Excel workbook, so we will concentrate
on reading data into a data.frame
.
In the typical way data is organized, we think of each column of data representing some trait or variable that we might be interested in. In general, we might wish to investigate the relationship between variables. In contrast, the rows of our data represent a single object on which the column traits are measured. For example, in a grade book for recording students scores throughout the semester, their is one row for every student and columns for each assignment. A greenhouse experiment data set will have a row for every plant and columns for treatment type and biomass.
9.1 Working directory
One concept that will be important is to recognize that every time you start up
RStudio, it picks an appropriate working directory. This is the directory where
it will first look for script files or data files. By default when you double
click on an R script or Rmarkdown file to launch RStudio, it will set the working
directory to be the directory that the file was in. Similarly, when you knit an
Rmarkdown file, the working directory will be set to the directory where the
Rmarkdown file is. For both of these reasons, I always program my scripts
assuming that paths to any data files will be relative to where where my
Rmarkdown file is. To set the working directory explicitly, you can use the
GUI tools Session -> Set Working Directory...
.
The functions that we will use in this lab all accept a character string that denotes the location of the file. This location could be a web address, it could be an absolute path on your computer, or it could be a path relative to the location of your Rmarkdown file.
'MyFile.csv' |
Look in the working directory for MyFile.csv . |
'data/Myfile.csv' |
In the working directory, there is a sub-directory
called data and inside that folder there is
a filed called MyFile.csv . |
'./data/Myfile.csv' |
In the working directory, there is a sub-directory
called data and inside that folder there is
a filed called MyFile.csv . The ./ is indicating
The current working directory explicitly instead of
implicitly. |
'../Project2/data/Myfile.csv' |
From the current working directory, go UP one
directory level. In that folder there is a folder
called Project2 which in turn has a data
directory, and inside that folder there is
a filed called MyFile.csv . The ../ is indicating
to go UP. |
'~/NAU/Research/data.csv' |
In Unix based operating systems (which includes MacOSX),
a ~ denotes the user’s home directory. So this path
is from my home directory, to a sub-directory NAU which
has a Research sub-directory. |
I strongly recommend using relative paths for specifying when doing data import
so that the paths are not broken when you share your code with your collaborators
or organize your directories to group your work. For example, if I were to specify
the path ~/NAU/Research/deHeer/data-raw/FitKids.xls
, then when I share my folder
with Dr. deHeer’s graduate student (who works on a PC), the path won’t work. But
if I code the path as data-raw/FitKids.xls
and the Rmarkdown document lives
in the same directory as the data-raw folder, then we are set and I can share
the deHeer
folder and everything will work regardless of who is running the code.
9.1.1 Recommended Project Folder Structure
When working with many projects and collaborators, it is useful to have a “standard” project directory structure so that everybody knows where certain project components should be located.
Folder | Description |
---|---|
'data-raw' |
This directory has the raw data such as excel and csv files. It also contains scripts that import and wrangle the data in order to produce a clean data file. |
'data' |
This directory contains the cleaned up data in the form of .rda files. |
'docs' |
This directory contains all the Rmarkdown files that will generate reports and analysis summaries using the clean data. |
9.2 R Binary Files
R objects have a file representation that allow for saving and loading that respects things like factor sort orders. These files are portable across all R platforms and the structure is open-source and many other programs can import this format.
Suffix | Description |
---|---|
.rda or .RData |
One or more data objects. I prefer the .RData suffix as it is more descriptive, but unfortunately some operating systems don’t like the long suffix, so the .rda suffix is preferred. |
.rds |
A single R data object. I don’t recommend using this because it is unnecessary to have whole different suffix just to distiguish between a single object verse multiple objects. |
.rdx and .rdb |
A combination of files that is used when a package supports Lazy Loading. These should used primarily for packages with wide spread use. |
For a given R object or objects, the function save()
will save the objects into
a .rda
file which can be subsequently loaded using the load()
command.
# In the data-raw directory, I might have script `Clean.R` that
# imports and cleans data. At the end of that script, I would save
# the file into the project data director
save(trees, file = 'data/trees.rda') # If working directory is the project
save(trees, file = '../data/trees.rda') # If working directory is data-raw
Generally I’ll have my the scripts that generate an analysis first import the cleaned up data
load('data/trees.rda') # If working directory is the project folder
load('../data/trees.rda') # If working directory is the project/docs folder
If your project only has a few data files, one cleaning script, and one analysis
Rmarkdown file, it is tempting to ignore creating the data-raw
, data
, and docs
directories. However it is really obnoxious
to impose order when things are already out of control and you need to modify
10 or 20 scripts. If there is any chance of a project growing in scope beyond just
myself, it pays to organize where your raw-data, clean data, and analysis scripts
live at the beginning of the project.
9.3 Comma Separated Data
To consider how data might be stored, we first consider the simplest file format… the comma separated values file. In this file time, each of the “cells” of data are separated by a comma. For example, the data file storing scores for three students might be as follows:
Able, Dave, 98, 92, 94
Bowles, Jason, 85, 89, 91
Carr, Jasmine, 81, 96, 97
Typically when you open up such a file on a computer with Microsoft Excel installed, Excel will open up the file assuming it is a spreadsheet and put each element in its own cell. However, you can also open the file using a more primitive program (say Notepad in Windows, TextEdit on a Mac) you’ll see the raw form of the data.
Having just the raw data without any sort of column header is problematic (which of the three exams was the final??). Ideally we would have column headers that store the name of the column.
LastName, FirstName, Exam1, Exam2, FinalExam
Able, Dave, 98, 92, 94
Bowles, Jason, 85, 89, 91
Carr, Jasmine, 81, 96, 97
To see another example, open the “Body Fat” dataset from the Lock\(^{5}\) introductory text book at the website [http://www.lock5stat.com/datasets3e/BodyFat.csv]. The first few rows of the file are as follows:
Bodyfat,Age,Weight,Height,Neck,Chest,Abdomen,Ankle,Biceps,Wrist
32.3,41,247.25,73.5,42.1,117,115.6,26.3,37.3,19.7
22.5,31,177.25,71.5,36.2,101.1,92.4,24.6,30.1,18.2
22,42,156.25,69,35.5,97.8,86,24,31.2,17.4
12.3,23,154.25,67.75,36.2,93.1,85.2,21.9,32,17.1
20.5,46,177,70,37.2,99.7,95.6,22.5,29.1,17.7
To make R read in the data arranged in this format, we need to tell R three things:
Where does the data live? Often this will be the name of a file on your computer, but the file could just as easily live on the internet (provided your computer has internet access).
Is the first row data or is it the column names?
What character separates the data? Some programs store data using tabs to distinguish between elements, some others use white space. R’s mechanism for reading in data is flexible enough to allow you to specify what the separator is.
The primary function that we’ll use to read data from a file and into R is the
function read.table()
. This function has many optional arguments but the most
commonly used ones are outlined in the table below.
Argument | Default | What it does |
---|---|---|
file |
A character string denoting the file location | |
header |
FALSE | Is the first line column headers? |
sep |
" " | What character separates columns. " " == any whitespace |
skip |
0 | The number of lines to skip before reading data. This is useful when there are lines of text that describe the data or aren’t actual data |
na.strings |
‘NA’ | What values represent missing data. Can have multiple. E.g. c('NA', -9999) |
quote |
" and ’ | For character strings, what characters represent quotes. |
To read in the “Body Fat” data set we could run the R command:
<- read.table(
BodyFat file = 'http://www.lock5stat.com/datasets3e/BodyFat.csv', # where the data lives
header = TRUE, # first line is column names
sep = ',' ) # Data is separated by commas
str(BodyFat)
## 'data.frame': 100 obs. of 10 variables:
## $ Bodyfat: num 32.3 22.5 22 12.3 20.5 22.6 28.7 21.3 29.9 21.3 ...
## $ Age : int 41 31 42 23 46 54 43 42 37 41 ...
## $ Weight : num 247 177 156 154 177 ...
## $ Height : num 73.5 71.5 69 67.8 70 ...
## $ Neck : num 42.1 36.2 35.5 36.2 37.2 39.9 37.9 35.3 42.1 39.8 ...
## $ Chest : num 117 101.1 97.8 93.1 99.7 ...
## $ Abdomen: num 115.6 92.4 86 85.2 95.6 ...
## $ Ankle : num 26.3 24.6 24 21.9 22.5 22 23.7 21.9 24.8 25.2 ...
## $ Biceps : num 37.3 30.1 31.2 32 29.1 35.9 32.1 30.7 34.4 37.5 ...
## $ Wrist : num 19.7 18.2 17.4 17.1 17.7 18.9 18.7 17.4 18.4 18.7 ...
Looking at the help file for read.table()
we see that there are variants such
as read.csv()
that sets the default arguments to header and sep more intelligently.
Also, there are many options to customize how R responds to different input.
9.4 MS Excel
Commonly our data is stored as a MS Excel file. There are two approaches you could use to import the data into R.
From within Excel, export the worksheet that contains your data as a comma separated values (.csv) file and proceed using the tools in the previous section. Any formulas in the worksheet are not saved, but the result of the calculation is presented.
Use functions within R that can work with the worksheet directly. Critically formulas in Excel are not imported, but the result of the formula is. One package that works nicely for this is the
readxl
package.
I generally prefer using option 2 because all of my collaborators can’t live without Excel and I’ve resigned myself to this. However if you have complicated formulas in your Excel file, it is often times safer to export it as a .csv file to guarantee the data imported into R is correct. Furthermore, other spreadsheet applications (such as Google sheets) requires you to export the data as a .csv file so it is good to know both paths.
Because R can only import a complete worksheet, the desired data worksheet must be free of notes to yourself about how the data was collected, preliminary graphics, or other stuff that isn’t the data. I find it very helpful to have a worksheet in which I describe the sampling procedure and describe what each column means (and give the units!), then a second worksheet where the actual data is, and finally a third worksheet where my “Excel Only” collaborators have created whatever plots and summary statistics they need.
The simplest package for importing Excel files seems to be the package
readxl
. Another package that does this is the XLConnect
which does the
Excel -> .csv conversion using Java. Another package the works well is the
xlsx
package, but it also requires Java to be installed. The nice thing about
these two packages is that they also allow you to write Excel files as well. The
RODBC
package allows R to connect to various databases and it is possible to
make it consider an Excel file as an extremely crude database.
The readxl
package provides a function read_exel()
that allows us to specify
which sheet within the Excel file to read and what character specifies missing
data (it assumes a blank cell is missing data if you don’t specifying anything).
For the most part, the arguments are the same as read.csv
but below are the
most important changes and additions.
Argument | Meaning |
---|---|
path |
The file argument is called path instead. |
sheet |
Which sheet to read. Either the sheet name or sheet number. |
range |
The cell range to read from. E.g. “A5:G98” |
From GitHub, download the files Example_1.xls
, through Example_5.xls
, from
the directory [https://github.com/dereksonderegger/444/tree/master/data-raw].
Place these files in the same directory that you store your course work or make
a subdirectory data to store the files in. Make sure that the working directory
that RStudio is using is that same directory (Session -> Set Working Directory).
# load the library that has the read.xls function.
library(readxl)
# Where does the data live relative to my current working location?
#
# In my directory where this Rmarkdown file lives, I have made a subdirectory
# named 'data-raw' to store all the data files. So the path to my data
# file will be 'data-raw/Example_1.xls'.
# If you stored the files in the same directory as your RMarkdown script, you
# don't have to add any additional information and you can just tell it the
# file name 'Example_1.xls'
# Alternatively I could give the full path to this file starting at the root
# directory which, for me, is '~/GitHub/444/data-raw/Example_1.xls'
# but for Windows users it might be 'Z:/444/Lab9/Example_1.xls'. This looks
# odd because Windows usually uses a backslash to represent the directory
# structure, but a backslash has special meaning in R and so it wants
# to separate directories via forward slashes.
# read the first worksheet of the Example_1 file
# data.1 <- read_excel('~/GitHub/444/data-raw/Example_1.xls') # absolute path
.1 <- read_excel( 'data-raw/Example_1.xls' ) # relative to this Rmarkdown file
data
# read the second worksheet where the second worksheet is named 'data'
.2 <- read_excel('data-raw/Example_2.xls', sheet=2 )
data.2 <- read_excel('data-raw/Example_2.xls', sheet='data') data
There is one additional problem that shows up while reading in Excel files. Blank columns often show up in Excel files because at some point there was some text in a cell that got deleted but a space remains and Excel still thinks there is data in the column. To fix this, you could find the cell with the space in it, or you can select a bunch of columns at the edge and delete the entire columns. Alternatively, you could remove the column after it is read into R using typical data frame manipulation tools.
Open up the file Example_4.xls
in Excel and confirm that the data sheet has
name columns out to carb
. Read in the data frame using the following code:
.4 <- read_excel('data-raw/Example_4.xls', sheet='data') # Extra Column Example data
## New names:
## * `` -> ...13
## * `` -> ...14
str(data.4)
## tibble[,14] [34 × 14] (S3: tbl_df/tbl/data.frame)
## $ model: chr [1:34] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num [1:34] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num [1:34] 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num [1:34] 160 160 108 258 360 ...
## $ hp : num [1:34] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num [1:34] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:34] 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num [1:34] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:34] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num [1:34] 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num [1:34] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num [1:34] 4 4 1 1 2 1 4 2 2 4 ...
## $ ...13: logi [1:34] NA NA NA NA NA NA ...
## $ ...14: logi [1:34] NA NA NA NA NA NA ...
We notice that after reading in the data, there is an additional column that just has missing data (the NA stands for not available which means that the data is missing) and a row with just a single blank. Go back to the Excel file and go to row 4 column N and notice that the cell isn’t actually blank, there is a space. Delete the space, save the file, and then reload the data into R. You should notice that the extra columns are now gone.
9.5 Google Sheets
The Google alternative to MS Excel is Google Sheets where the data lives on
Google’s cloud services. You could download a .csv
version and import it but
it is more convenient to use Google’s Sheets API, which is now on version 4. The
R package that implements the API is googlesheets4
. This package is built using
a few other packages, notably the googledrive
package which allows a user to
access the contents of a Google drive.
It is useful to know that Google Sheets supports a robust notation for specifying cell ranges:
- Sheet1!A1:B2 refers to the first two cells in the top two rows of Sheet1.
- Sheet1!A:A refers to all the cells in the first column of Sheet1.
- Sheet1!1:2 refers to all the cells in the first two rows of Sheet1.
- Sheet1!A5:A refers to all the cells of the first column of Sheet 1, from row 5 onward.
- A1:B2 refers to the first two cells in the top two rows of the first visible sheet.
- Sheet1 refers to all the cells in Sheet1.
The first time you run a googledrive
function, it will ask you to authenticate
to the Google services and to store the authentication key in your R profile. It
will warn you that you are allowing R to access your data. You’ll need to agree
to this and R will store this authentication token.
9.5.1 Reading
The main function to read from a Google Sheet is read_sheet()
The most pertinent parameters are below, but there are a few other interesting options.
Parameter | Description |
---|---|
ss |
Something that identifies the Google sheet: a file ID, a URL or dribble which is the “drive tibble” that the googledrive package uses to represent Drive files. |
sheet |
This represents the tab within the spreadsheet file. It can be the numerical position or tab name. |
range |
The cell range to read from. |
# The following link is to a data set on my Google Drive STA 444 directory
# I have it set so that anyone with the link can read the speadsheet so the
# code here will work for anybody.
<- 'https://docs.google.com/spreadsheets/d/1pPbeJOqlT3N9fEXfO2sJbYddVWgeB8VSt61V0WNK6Vc'
link
# Just using the direct link
<- read_sheet(link, sheet='Data')
df head(df)
This works well for individual Sheets that you have either an ID or URL for. However, it would be nice to not have to look that information and instead give the path within your Google Drive. To do that, we need to authenticate and give R the rights to access the drive.
If your code is running in a non-interactive environment (for example when knitting your Rmarkdown document) you’ll need to make the authentication happen by finding a prior authorization token.
# Force google drive to use a prior cached authentication token...
# Notice I still had to agree to letting R have access to my Google Drive
# at some point in time!
options(gargle_oauth_email = "derek.sonderegger@gmail.com")
You can read about the details of how/why this works at the gargle
package
vinnette about
non-interactive authentication
# Now read using the path within my Google Drive
<- googledrive::drive_get('STA_444/CherryTrees') %>%
df read_sheet( sheet='Data' )
Another consideration is that Google limits the number of read requests in a short period of time and if you are tweaking your Rmarkdown, you might run into this issue. You might try caching the R chunk.
9.5.2 Writing
The googlesheets4
package also supports writing to the spreadsheet.
The function write_sheet
writes a data frame to a sheet, whereas range_write
will write data to a specific location within a sheet.
# Here I'm only allowed to write to the Sheet because I've authenticated
# with Google. I have this particular sheet set up so that anybody with the
# link can read the file, but cannot write to the file. So this code should
# not work for you!
write_sheet(data=trees, ss=link, sheet='Data')
9.6 Multiple files
There are several cases where our data is stored in multiple files and we want
to read them in. If the data sources all have an identical column format, we
can just stack the data frames together using the rbind
command.
<- c('file1.csv', 'file2.csv', 'file3.csv') # Files to be read in.
files <- NULL # what will the output data frame be named
data for( file in files){ # for each element of our files vector
<- read.csv(file) # read in the file
temp.data <- rbind(data, temp.data) # Append it to our final data set
data }
In the example above, we might need to modify the read.csv()
command, but
fortunately it isn’t too hard to combine a for
loop with set of statements to
read in the data set.
Generally when I have to read in multiple files, I need to inspect a directory
and read all the files. Fortunately the function list.files()
will return a
of character strings containing all the files in a directory.
<- list.files('~/GitHub/141/data-raw/')
files %>% head() files
## [1] "birth_weights.csv" "cities.csv"
## [3] "CO2.csv" "COVID-19"
## [5] "COVID-19_LectureContent" "Export_R_Datasets.R"
9.7 Exercises
Download from GitHub the data file Example_5.xls. Open it in Excel and figure out which sheet of data we should import into R. At the same time figure out how many initial rows need to be skipped. Import the data set into a data frame and show the structure of the imported data using the
str()
command. Make sure that your data has \(n=31\) observations and the three columns are appropriately named. If you make any modifications to the data file, comment on those modifications.Download from GitHub the data file Example_3.xls. Import the data set into a data frame and show the structure of the imported data using the
tail()
command which shows the last few rows of a data table. Make sure the Tesla values areNA
where appropriate and that both-9999
andNA
are imported as NA values. If you make any modifications to the data file, comment on those modifications.Download all of the files from GitHub
data-raw/InsectSurveys
directory which I’ve linked here. Each month’s file contains a sheet contains site level information about each of the sites that was surveyed. The second sheet contains information about the number of each species that was observed at each site. Import the data for each month and create a singlesite
data frame with information from each month. Do the same for theobservations
. Document any modifications you make to the data files. Comment on the importance of consistency of your data input sheets.