Where does data (e.g., a data frame or tibble) come from? If we don’t enter it ourselves (e.g., with the tibble()
or tribble()
commands (see Chapter 5) we usually import it from an external source. The scope of such sources is vast and here we only cover the most common candidates: Data that is already stored in text form or other file formats that can easily be coerced into linear or rectangular data structures.
Orientation and navigation
An important pre-requisite to loading data is that we are able to orient ourselves on our computer and can navigate or point to the location at which data files may be stored. The two key questions to ask and answer prior to reading or writing data are:
- Where am I?
- Where is my data?
Working directory
The first question (“Where am I?”) addresses the notion of our current working directory.
This is typically the directory on our computer in which we started our R environment, the location of our current R script, or — if we are working within an RStudio project — the home directory of our current project.
We can use the function getwd()
to determine (or rather obtain the name of) our current working directory:
# Get current working directory (wd):
my_wd <- getwd()
my_wd
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
Note that the getwd()
function returns a character string and — depending on our operating system — uses either forward slashes (/
) or backward slashes (\
) to separate the hierarchy levels of different directories.
This character string represents the address of our current working directory (or the global path of our current working directory).
We can think of this hierarchy as an inverted tree of directories: The top level are the most general (or global) locations, whereas the directories get increasingly specific (or local) as the path gets longer. When using functions for reading or writing files, we need to specify their locations in our commands (see the distinction between global and local file paths below.)
Corresponding to getwd()
, the function setwd()
(with its only argument dir
specifying a string that points to an existing location on our computer) allows setting (i.e., changing) our current working directory to dir
:
# Set current working directory:
setwd(dir = my_wd) # set dir to my_wd (set above)
getwd() # same dir (as set to my_wd)
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
And list.files()
provides a list of all files and directories in our current working directory:
# List files and directories:
list.files() # in current working directory
list.files(my_wd) # in some specific directory
File paths
The second question (“Where is my data?”) implies that data doesn’t necessarily need to be stored at the location of our current working directory. Let’s suppose that we want to load some data file (called data_t1.csv
), which we downloaded from an online source at http://rpository.com/ds4psy/data/data_t1.csv.
But rather than saving it in our current working directory my_wd
, the file data_t1.csv
is stored in some parallel directory (called data
). In this case, there are two principle ways of loading our data file:
- Change our current working directory to a different directory that contains the data:
# (1) Changing working directory to load data:
# Get working directory:
my_wd <- getwd()
my_wd # prints the current wd:
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
# Change the current working directory to the "data" subdirectory:
# setwd("/Users/hn/Dropbox/GitHub/ds4psy_book/data") # absolute path
setwd("./data") # relative to current directory "."
getwd() # verify new location:
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data"
# Read data from the NEW working directory:
t1 <- read_csv("data_t1.csv") # read csv data file
# Return to the original working directory:
setwd(my_wd) # setwd to original directory
getwd() # back in my_wd
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
- Stay at our current working directory, but read data from a different directory:
# (2) Reading data from another directory:
# (a) provide absolute/full path of the data file:
# t2 <- read_csv("/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")
# t2 <- read_csv("/Users/hn/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")
t2 <- read_csv("./data/data_t1.csv") # relative path
# (b) provide relative path of the data file:
t3 <- read_csv("./data/data_t1.csv")
# (c) relative to (platform dependent) home directory:
# t4 <- read_csv("~/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")
t4 <- t3
# (d) provide the path to an online source of the data file:
t0 <- read_csv("http://rpository.com/ds4psy/data/data_t1.csv")
The second method — staying where we are, but importing files from other directories — is typically preferred.
Nevertheless, our options of changing our working directory or pointing to different directories allow for a variety of ways for pointing to the location of a data file. But before we discuss the difference between global and local file paths, let’s quickly verify that all of the above methods actually loaded the same data:
# Check whether t0 to t4 are all equal:
all.equal(t0, t1) &
all.equal(t1, t2) &
all.equal(t2, t3) &
all.equal(t3, t4)
#> [1] TRUE
Global vs. local file paths
The main difference in specifying locations on our computers is that between using global and local paths:
A global path is the full (or absolute) address of a directory and/or file on our computer.
Evaluating getwd()
in R yields the global path of our current working directory.
A local path is the address of a directory and/or file relative to our current location (or working directory).
When specifying local paths, a dot symbol .
denotes your current location, and two dots ..
denote “one level up in the directory hierarchy.” Thus, if my current working directory is at /ds4psy/stuff/code
, the relative path ./..
denotes /ds4psy/stuff
(i.e., one level above my current location .
) and the relative path ./data
denotes a directory /ds4psy/stuff/data
(i.e., both data
and code
are sub-directories of /ds4psy/stuff
).
A fact that initially is confusing to many people is that local and global file paths can point to the same locations — they really are just two different ways of pointing to an address (typically directories or files on our computer).
Thinking of two different locations on a map may help: To find out how to get from our current location \(A\) to another location \(B\), we can either look up the global address of \(B\) (i.e., using street names, numbers, or the map’s coordinate system) or describe the way in a local fashion by adopting \(A\)’s perspective (i.e., “keep going, turn right on the 2nd street, then straight ahead, before turning left at…”). Thus, although the global and local descriptions are different, they can both get us from \(A\) to \(B\).
If both types of path can describe the same location, why does their difference matter for us?
Importantly, global paths always contain the top-level directories of a particular computer, whereas a local path typically ignores top-level details. As global paths differ between different computers, they should not be used in our code (or should be clearly marked as being user-specific, if they are being used). In other words: We should ideally only use local paths (and keep all files belonging together in or below a single working directory) to enable transferring code to other people or machines.
Practice
Suppose our current working directory is at "/Users/Me/Desktop/stuff/uni/courses/ds4psy/ch_06"
:
Where do the following (local) paths point to?
"./.."
"./../../"
"./../courses/ds4psy/ch_04"
"./../../courses/ds4psy/ch_05"
At which level of our computer system are the following directories located?
"./data"
/Users/myself/Downloads/weirdo.jpg
"./../ch_05/code"
./../../psychopathics
"/Users/Tea/Documents/work/taxes-2020.xls"
What’s the difference between "."
and ".."
? And what’s the difference between .
and "."
?
Please note: Knowing how to specify file paths has about as much to do with R as knowing how to use our keyboard.
It’s not R, of course, but a pre-requisites for using it productively.
Sharing scripts and data files
To share your R scripts and data files with others it is best to work with an RStudio project and store all related scripts and files within this project.
In R projects, it is customary to save all your R scripts in a specific directory (e.g., called R
) and store all data files in a dedicated data directory (e.g., data
).
Importantly, using only relative file paths (i.e., relative to the current script or to the project’s working directory) also allows archiving or transferring your scripts and data files, as long as the directory structure remains intact. By archiving your entire project folder (e.g., as my_project.zip
, or a folder that includes the subfolders R
and data
), you can transfer your archive to another person or computer and your scripts will keep working.
Being here
A modern alternative to using the getwd()
and setwd()
functions is provided by the here package (Müller, 2017), which answers the question “Where am I?” in a straightforward manner:
We are here()
.
here determines the path to our current working directory (or project directory) when it is loaded and provides a here()
function that returns the name of this directory or other directories, whose names can be provided as additional arguments (of type character) and separated by commas:
library(here) # loads the package
here() # the current directory
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
here("data") # the sub-directory "./data"
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data"
here("_book", "images") # a sub-sub-directory "./_book/images"
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/_book/images"
The brilliant idea of here is that all paths within a project can be specified relative to our current working directory, which is located and denoted as here()
.
Note: As the lubridate package (covered in Chapter 10: Time) also contained a (now deprecated) function named here()
, we may have to use here::here()
to explicate that we want to use the here()
function from the here package (i.e., its corresponding namespace). If only the here package is loaded, calling here()
suffices as well.
Data used
In this chapter, we will use a variety of data files. As many of them are stored in non-standard formats,
they are not included in the ds4psy package, but stored on a web server (at http://rpository.com).
Below, we will illustrate how they can be imported directly from their online source.
Alternatively, you can use a web browser to download the files to a directory on your computer
(e.g., in a sub-directory called data
) and import them from there.
Getting ready
This chapter formerly assumed that you have read and worked through Chapter 11: Import data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 11: Import data of r4ds is still recommended.
Please do the following to get started:
Create an R Markdown (.Rmd
) document (for instructions, see Appendix F and the templates linked in Section F.2).
Structure your document by inserting headings and empty lines between different parts.
Here’s an example how your initial file could look:
---
title: "Chapter 6: Importing data"
author: "Your name"
date: "2022 July 15"
output: html_document
---
Add text or code chunks here.
# Exercises (06: Importing data)
## Exercise 1
## Exercise 2
etc.
<!-- The end (eof). -->
Create an initial code chunk below the header of your .Rmd
file that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).
Save your file (e.g., as 06_import.Rmd
in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
Now that we can orient ourselves on our computers and navigate between various directories,
we are ready to read more about reading data with readr.