5.2 Importing data
Importing data is one of the most important, but also most mundane steps in analyzing data. Unfortunately, anything that goes wrong at this step is likely to affect everything else that follows.
As we typically deal with tabular (or rectangular) data, the utils package of R contains a range of read.table()
functions that read files into data frames from various formats. The most commonly used of these are:
read.csv()
andread.csv2
for importing comma-separated value (csv) filesread.delim()
andread.delim2()
for importing other delimited files (e.g., using the TAB character to separate the values of different variables)read.fwf()
for reading fixed width format (fwf) files
The readr package of the tidyverse provides similar and additional functions for reading (or “parsing”) vectors and importing data files into a simplified type of data frame (known as a “tibble”).
Both the utils and the readr packages also provide a range of write()
functions that allow exporting (and storing) data files in various formats.
Importing and exporting files also assume some knowledge about how to denote paths to files or computer locations (on a local file systems or remote servers).
As these topics are covered in Chapter 6: Importing data, the rest of this section only contains some excerpts and examples. More details are available at the following sections:
Section 6.1.2 Orientation and navigation describes how to denote paths
Section 6.2 Essential readr commands introduces key readr functions
5.2.1 File locations and paths
A well-organized project typically contains various (sub-)directories for storing different types of data. For instance, many projects contain dedicated sub-directories for data
, images
, or code
files.
The fact that not all files are stored in the same directory makes it necessary to know or set one’s current working directory, as well as point to the locations of files in other directories. When working with RStudio projects, R sets a session’s original working directory to the project folder.
File paths are descriptions of locations on a computer, typically encoded as character strings. They usually need to be specified when loading a data file or linking to an image, as well as other files.
To make an R project as self-contained as possible (i.e., independent of the particular folder structure on our personal computer), all files needed in a project should be stored in the project folder or its sub-directories. When including a file from some folder, always use relative file paths to specify its location.
Key commands for getting and setting file paths in R include:
# (1) Getting and setting file path:
getwd() # get current (absolute) file path
<- getwd() # store file path
wd
setwd(wd) # set current (absolute) file path
# (2) Navigating relative file paths:
setwd(".") # "." marks current location
setwd("./data") # move 1 level down into "data" (if "data" exists)
setwd("..") # move 1 level upwards
setwd("./..") # move 1 level upwards (from current location)
# Assuming 2 sub-directories ("./code" and "./data"):
setwd("code") # move down into directory "code"
setwd("../data") # move into parallel directory "data"
setwd("../code") # move into parallel directory "code"
setwd("..") # move 1 level up
The here package (Müller, 2017) simplifies these commands, but also requires an understanding of file paths.
5.2.2 Reading and writing files
The main way to get data into R is by importing (or “reading”) data files. Doing this requires not only the existence of the file, but also knowing its storage location. Storage locations can be local (on our own computer) or remote (on some online server), with various intermediate cases (e.g., on another drive or computer on the same network).
In R, all functions that read or write files use a flexible file
argument that typically describes a path to a file (as a character string), but can also specify a connection (to a server), or even literal data (as a single string or a raw vector).
Key readr functions include:
read_csv()
vs.read_csv2()
for reading comma-separated data filesread_delim()
for reading data files not delimited by commaswrite_csv()
vs.write_csv2()
for writing comma-separated data fileswrite_delim()
for writing data files not delimited by commas
Dealing with problems:
- For parsing vectors, see Section 11.3 Parsing a vector of the r4ds book (Wickham & Grolemund, 2017) or Section 6.2.1 Parsing vectors of the ds4psy book (Neth, 2022a).