8 Importing
8.1 Reading
Hadley Wickham and Garrett Grolemund, R for Data Science
J.D. Long and Paul Teetor, R Cookbook (2nd ed.)
8.2 Getting started
For this hands-on exercise, we will be loading the data contained in a CSV file. Base R has the ability to read CSV files with the read.csv()
function, but we’ll be using the read_csv()
function, which is in the {readr} package. This package loads when we run library(tidyverse)
.
We will also be loading the contents of an Excel file. While the {readxl} package is part of the tidyverse, it is not part of the core that loads with library(tidyverse)
: we have to run the library()
function separately.
8.3 Reading a CSV file
CSV files are very commonly used for storing flat files. They don’t have any formatting—just the number or text in the cell.
The package {readr} is designed to make importing these files simple.
(This example comes straight from https://readr.tidyverse.org/index.html)
Run this chunk to create an object called mtcars
, from a CSV file of the same name.
# example
mtcars <- read_csv("mtcars.csv")
You’ll see that the console displays the type that has been assigned to each variable. In this case, they are all of the type “double”, which is a numeric type (think numbers with decimals or exponents; not integers).
Adding the cols()
argument allows us to alter what {readr} has decided for us. For example, we could set the cyl
variable to be an integer.
# example
mtcars <- read_csv("mtcars.csv",
col_types =
cols(cyl = col_integer()))
Insert an R chunk and rerun the example above, but with the am
variable also set to integer and gear
set to character.
# solution
read_csv("mtcars.csv",
col_types =
cols(cyl = col_integer(),
am = col_integer(),
gear = col_character())
)
## # A tibble: 32 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
The {readr} package allows a lot of control over how the file is read. Of particular utility are
na = ""
– specify which values you want to be turned intoNA
skip = 0
– specify how many rows to skipn_max = Inf
– the maximum number of records to read
Read the first 5 rows of the “mtcars.csv” file.
# solution
read_csv("mtcars.csv",
n_max = 5)
## # A tibble: 5 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
In some instances, we may wish to read a sub-set of rows in the middle of our file, and retain the headers. In this code below, we do just that.
The first line uses the names()
function to read the variable names, and assign them to an object that we can use in the subsequent read_csv()
function.
# set the column names
mtcars_col_names <- names(read_csv("mtcars.csv", n_max = 0))
# read file starting at 5th row, adding column names
read_csv("mtcars.csv",
col_names = mtcars_col_names,
skip = 5,
n_max = 10)
## # A tibble: 10 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 6 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## 7 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## 8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
## 9 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
## 10 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8.4 Reading an Excel file
If anything, Excel files are more common than CSV and other plain-text data files. They seem to multiply like coat hangers in the closet…
And as we see in the article by Karl Broman & Kara Woo, “Data Organization in Spreadsheets”, the features of Excel files can encourage some ways of storing information that makes it hard for us to analyze. Excel files can also contain a wide variety of data format types. {readxl} tries to figure out what’s going on, but like {readr}, it allows you to override some of those automatic decisions.
In this example, our Excel file is in a different directory, one below our current directory, called “data”. Working with file paths like this, we can import the contents of data files and worksheets that are anywhere on our computer, the local network, or the internet.
deaths <- read_excel("data/deaths.xlsx")
deaths
## # A tibble: 18 × 6
## `Lots of people` ...2 ...3 ...4 ...5 ...6
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 simply cannot resist writing <NA> <NA> <NA> <NA> some notes
## 2 at the top <NA> of their spread…
## 3 or merging <NA> <NA> <NA> cells
## 4 Name Profession Age Has kids Date of birth Date of death
## 5 David Bowie musician 69 TRUE 17175 42379
## 6 Carrie Fisher actor 60 TRUE 20749 42731
## 7 Chuck Berry musician 90 TRUE 9788 42812
## 8 Bill Paxton actor 61 TRUE 20226 42791
## 9 Prince musician 57 TRUE 21343 42481
## 10 Alan Rickman actor 69 FALSE 16854 42383
## 11 Florence Henderson actor 82 TRUE 12464 42698
## 12 Harper Lee author 89 FALSE 9615 42419
## 13 Zsa Zsa Gábor actor 99 TRUE 6247 42722
## 14 George Michael musician 53 FALSE 23187 42729
## 15 Some <NA> <NA> <NA> <NA> <NA>
## 16 <NA> also like to write stuff <NA> <NA> <NA> <NA>
## 17 <NA> <NA> at the bottom, <NA> <NA>
## 18 <NA> <NA> <NA> <NA> <NA> too!
What do you notice about the “Date of birth” and “Date of death” columns?
Read in the “deaths” file, but use the range =
option to specify when to start reading the file:
# Example
deaths <- read_excel("data/deaths.xlsx",
range = "A1:B3")
deaths
## # A tibble: 2 × 2
## `Lots of people` ...2
## <chr> <chr>
## 1 simply cannot resist writing <NA>
## 2 at the
# solution
deaths <- read_excel(
"data/deaths.xlsx",
range = "A5:F15"
)
deaths
## # A tibble: 10 × 6
## Name Profession Age `Has kids` `Date of birth` `Date of death`
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
## 7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19 00:00:00
## 9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25 00:00:00
Use the skip =
and n_max =
options to achieve the same thing:
# solution
deaths <- read_excel(
"data/deaths.xlsx",
skip = 4,
n_max = 10
)
deaths
## # A tibble: 10 × 6
## Name Profession Age `Has kids` `Date of birth` `Date of death`
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
## 7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19 00:00:00
## 9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25 00:00:00
If we know how many rows to skip at the top, and how many to omit at the bottom, we can insert a calculation as to what our n_max()
value needs to be. This would be useful in the circumstance where we have multiple sheets we need to read, each having the same structure with a fixed number of header and footer rows that do not contain relevant data, but the number of rows in between varies.
In the case of the “deaths.xlsx” file, we have four rows of junk data at the top (prior to the variable names), and another four rows at the bottom. By subtracting those numbers from the total count of rows in the file, we have the object valid_rows
that we can apply as the value for the n_max =
in our read_excel()
function.
# first, read the data file
deaths <- read_excel("data/deaths.xlsx")
# determine the total number of rows in the sheet
rowcount <- nrow(deaths)
# calculate the number of rows with relevant data "rowcount2"
valid_rows <- rowcount - 4 - 4 # 4 rows of junk data at the top after the header, 4 rows of junk data at the bottom
read_excel("data/deaths.xlsx",
skip = 4,
n_max = valid_rows)
## # A tibble: 10 × 6
## Name Profession Age `Has kids` `Date of birth` `Date of death`
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
## 7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19 00:00:00
## 9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25 00:00:00
For more information on reading areas of an Excel sheet, see this article at the {readxl} site: https://readxl.tidyverse.org/articles/sheet-geometry.html
Excel files often (almost always?) have multiple sheets in them. It’s possible to specify which one you want to use with the sheets = ""
option. You can also use the position number (if you happen to know it).
Note that if you don’t specify the sheet, {readxl} will default to the first one.
The excel_sheets()
function will tell you the name of the sheets in an Excel file.
datasets <- "data/datasets.xlsx"
excel_sheets(datasets)
## [1] "iris" "mtcars" "chickwts" "quakes"
# Example
read_excel(datasets, "iris")
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
Now, read in the “mtcars” sheet using the name of the sheet.
# Solution
read_excel(datasets, "mtcars")
## # A tibble: 32 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
And finally, read in the “quakes” sheet, using the position.
# Solution
read_excel(datasets, 4)
## # A tibble: 1,000 × 5
## lat long depth mag stations
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -20.4 182. 562 4.8 41
## 2 -20.6 181. 650 4.2 15
## 3 -26 184. 42 5.4 43
## 4 -18.0 182. 626 4.1 19
## 5 -20.4 182. 649 4 11
## 6 -19.7 184. 195 4 12
## 7 -11.7 166. 82 4.8 43
## 8 -28.1 182. 194 4.4 15
## 9 -28.7 182. 211 4.7 35
## 10 -17.5 180. 622 4.3 19
## # … with 990 more rows
Methods for saving dataframes will be introduced later in the course, in the “Exporting data & graphics” chapter.
8.5 Exercise
8.5.1 Discussion
The Excel portion of this chapter relies on some common methods to deal with how spreadsheet files are often used (Excel isn’t the only spreadsheet software, just very common.) Read the journal article by Karl Broman & Kara Woo, “Data Organization in Spreadsheets”(Broman and Woo 2017) and come to class ready to discuss the following questions.
Discussion questions:
How many people in your discussion group have used spreadsheets, in their research, courses, or at home?
How many people in your discussion group have accidentally done something that made them frustrated or sad? How many of them have done the sorts of things that Broman and Woo advise against?
What are basic principles for using spreadsheets for good data organization?
What are good approaches for handling dates in spreadsheets?
8.6 Reference Material
You can find more details in R for Data Science: https://r4ds.had.co.nz/data-import.html
{readr} https://readr.tidyverse.org/