Where does data (e.g., a data frame or tibble) come from? If we don’t enter it ourselves (e.g., with the tibble()
or tribble()
commands (see Chapter 5) we usually import it from an external source. The scope of such sources is vast and here we only cover the most common candidates: Data that is already stored in text form or other file formats that can easily be coerced into linear or rectangular data structures.
Orientation and navigation
In most modern computer operating systems, files are organized in a hierarchical system of directories or folders.
An important pre-requisite to loading data is that we are able to orient ourselves on our computer and can navigate or point to the locations at which data files are being stored.
The two key questions to ask and answer prior to reading or writing data are:
Where am I?
Where is my data?
Working directory
The first question (“Where am I?”) addresses the notion of our current working directory.
This is typically the directory on our computer in which we start our R environment, the location of our current R script, or — if we are working within an RStudio project — the main directory of our current project.
We can use the function getwd()
to determine (or rather obtain the name of) our current working directory:
# Get current working directory (wd):
(my_wd <- getwd())
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
Note that the getwd()
function returns a character string and — depending on our operating system — uses either forward slashes (/
) or backward slashes (\
) to separate the hierarchy levels of different directories.
This character string represents the address of our current working directory (or the global path of our current working directory).
We can think of this hierarchy as an inverted tree of directories: The top level are the most general (or global) locations, whereas the (sub-)directories get increasingly specific (or local) as the path gets longer. When using functions for reading or writing files, we need to specify their locations in our function calls (see the distinction between global and local file paths below).
Corresponding to getwd()
, the function setwd()
(with its only argument dir
specifying a string that points to an existing location on our computer) allows setting (i.e., changing) our current working directory to dir
:
# Set current working directory:
setwd(dir = my_wd) # set dir to my_wd (set above)
getwd() # same dir (as set to my_wd)
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
And list.files()
provides a list of all files and directories in our current working directory:
# List files and directories:
list.files() # in current working directory
list.files(my_wd) # in some specific directory
File paths
The second question (“Where is my data?”) implies that data doesn’t necessarily need to be stored at the location of our current working directory. Let’s suppose that we want to load some data file (called data_t1.csv
), which we downloaded from an online source at http://rpository.com/ds4psy/data/data_t1.csv.
But rather than saving it in our current working directory my_wd
, the file data_t1.csv
is stored in some parallel directory (called data
). In this case, there are two principle ways of loading our data file:
- Change our current working directory to a different directory that contains the data:
# (1) Changing working directory to load data:
# Get working directory:
my_wd <- getwd()
my_wd # prints the current wd:
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
# Change the current working directory to the "data" subdirectory:
# setwd("/Users/hn/Dropbox/GitHub/ds4psy_book/data") # absolute path
setwd("./data") # relative to current directory "."
getwd() # verify new location:
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data"
# Read data from the NEW working directory:
t1 <- read_csv("data_t1.csv") # read csv data file
# Return to the original working directory:
setwd(my_wd) # setwd to original directory
getwd() # back in my_wd
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
- Stay at our current working directory, but read data from a different directory:
# (2) Reading data from another directory:
# (a) provide absolute/full path of the data file:
# t2 <- read_csv("/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")
# t2 <- read_csv("/Users/hn/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")
t2 <- read_csv("./data/data_t1.csv") # relative path
# (b) provide relative path of the data file:
t3 <- read_csv("./data/data_t1.csv")
# (c) relative to (platform dependent) home directory:
# t4 <- read_csv("~/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data/data_t1.csv")
t4 <- t3
# (d) provide the path to an online source of the data file:
t0 <- read_csv("http://rpository.com/ds4psy/data/data_t1.csv")
The second method — staying where we are, but importing files from other directories — is typically preferred.
Nevertheless, our options of changing our working directory or pointing to different directories allow for a variety of ways for pointing to the location of a data file. But before we discuss the difference between global and local file paths, let’s quickly verify that all of the above methods actually loaded the same data:
# Check whether t0 to t4 are all equal:
all.equal(t0, t1) &
all.equal(t1, t2) &
all.equal(t2, t3) &
all.equal(t3, t4)
#> [1] TRUE
Absolute vs. relative paths
A path describes the location of a directory or file on a computer system (typically as a character string).
We can think of it as the “address” of a computer directory or file.
Paths can be expressed in two main ways — as absolute or relative paths:
An absolute path is the full (or global) address of a directory or file on our computer.
Evaluating getwd()
in R yields the global path of our current working directory.
A relative (or local) path is the address of a directory or file relative to our current location (i.e., working directory).
When specifying paths, some abbreviations are helpful. The most common ones are:
.
(i.e., the dot symbol) denotes the current location (i.e., working directory)
..
(i.e., two dot symbols) denote the current parent directory (i.e., “one level up in the hierarchy”)
~
(i.e., the squiggly tilde symbol) denotes a user’s home directory
/
(i.e., the forward slash) denotes the machine’s root directory (on UNIX systems)
As the interpretation of .
and ..
is always based on the current location, these symbols are used when specifying relative (or local) paths. By contrast, symbols like ~
or /
are abbreviations for an absolute (or global) path.
For instance, if my current working directory is at /ds4psy/stuff/code
, the relative path ./..
denotes /ds4psy/stuff
(i.e., one level above my current location .
).
By contrast, when adopting the perspective from a working directory /ds4psy/stuff/code
, the relative path ./data
denotes a sub-directory /ds4psy/stuff/code/data
(i.e., one level below my current location .
).
But if we had first moved one level up to change our working directory to /ds4psy/stuff/code
, then the same relative path ./data
would point to a sub-directory that is at the same level as /ds4psy/stuff/code
(i.e., both data
and code
are sub-directories of /ds4psy/stuff
).
Importantly, absolute and relative paths can point to the same locations — they really are just two different ways of pointing to an address (typically directories or files on our computer). But as local paths are always interpreted relative to the current working directory, the same local path points to different locations when we begin in a different working directory.
Thinking of two different locations on a map may help:
To find out how to get from our current location \(A\) to another location \(B\), we can either look up the absolute/global address of \(B\) (i.e., using street names, numbers, or the map’s coordinate system) or provide directions in a relative/local fashion by adopting \(A\)’s perspective (i.e., “keep going, turn right on the 2nd street, then straight ahead, before turning left at…”).
Whereas the absolute address of \(B\) is independent of our current location \(A\), providing relative directions (e.g., “turn right”) assumes knowledge of our current location \(A\).
Thus, although the absolute and relative descriptions differ, they can both get us from \(A\) to \(B\).
If both types of file paths can describe the same location, why does their difference matter for us?
Importantly, global paths always contain the top-level directories of a particular computer, whereas a local path typically ignores top-level details. As long as all files that belong together are transferred together with their directory structure, local paths will work (in the same way as local directions work if we transfer the part of the map that they refer to). By contrast, global file paths differ between different computers and should be avoided in our code (or must be clearly marked as being user-specific, if they are being used). In other words: We should ideally only use local paths (and keep all files belonging together in or below a single working directory) to enable transferring code to other people or machines.
Practice
Suppose our current working directory is at "/Users/Me/Desktop/stuff/uni/courses/ds4psy/ch_06"
:
Where do the following (local) paths point to?
"./.."
"./../../"
"./../courses/ds4psy/ch_04"
"./../../courses/ds4psy/ch_05"
At which level of our computer system are the following directories located?
"./data"
/Users/myself/Downloads/weird_thingy.jpg
"./../ch_05/code"
./../../psychopathics
"/Users/Tea/Documents/work/taxes-2023.xls"
What’s the difference between "."
and ".."
? (And what’s the difference between "."
and getwd()
?)
Explain the following statements and provide corresponding examples:
- “Absolute and relative paths are different ways to point to a location.”
- “The same relative path can point to different locations.”
Please note:
Knowing how to specify file paths has about as much to do with R as knowing how to use our keyboard.
It’s not R, of course, but a pre-requisite for using R productively.
Sharing scripts and data files
To share your R scripts and data files with others it is best to work with an RStudio project and store all related scripts and files within this project.
In R projects, it is customary to save all your R scripts in a specific directory (e.g., called R
) and store all data files in a dedicated data directory (e.g., data
).
Importantly, using only relative file paths (i.e., relative to the current script or to the project’s working directory) also allows archiving or transferring your scripts and data files, as long as the directory structure remains intact. By archiving your entire project folder (e.g., as my_project.zip
, or a folder that includes the subfolders R
and data
), you can transfer your archive to another person or computer and your scripts will keep working.
Being here
An alternative to using the getwd()
and setwd()
functions is provided by the here package (Müller, 2017), which answers the question “Where am I?” in a straightforward manner:
We are here()
.
here determines the path to our current working directory (or project directory) when it is loaded and provides a here()
function that returns the name of this directory or other directories, whose names can be provided as additional arguments (of type character) and separated by commas:
library(here) # loads the package
here() # the current directory
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book"
here("data") # the sub-directory "./data"
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/data"
here("_book", "images") # a sub-sub-directory "./_book/images"
#> [1] "/Users/hneth/Desktop/stuff/Dropbox/GitHub/ds4psy_book/_book/images"
The one, but brilliant idea of here is that all paths within a project can be specified relative to our current working directory, which is located and determined by here()
.
Note:
As the lubridate package (covered in Chapter 10: Time) also contained a (now deprecated) function named here()
, we may have to use here::here()
to explicate that we want to use the here()
function from the here package (i.e., its corresponding namespace). If only the here package is loaded, calling here()
is sufficient.
Getting ready
This chapter formerly assumed that you have read and worked through Chapter 11: Import data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 11: Import data of r4ds is still recommended.
Please do the following to get started:
Create an R Markdown (.Rmd
) document (for instructions, see Appendix F and the templates linked in Section F.2).
Structure your document by inserting headings and empty lines between different parts.
Here’s an example how your initial file could look:
---
title: "Chapter 6: Importing data"
author: "Your name"
date: "2024 February 22"
output: html_document
---
Add text or code chunks here.
# Exercises (06: Importing data)
## Exercise 1
## Exercise 2
etc.
<!-- The end (eof). -->
Create an initial code chunk below the header of your .Rmd
file that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).
Save your file (e.g., as 06_import.Rmd
in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
Now that we can orient ourselves on our computers and navigate between various directories,
we are ready to read more about reading data with readr.