8 Importing

8.1 Reading

Hadley Wickham and Garrett Grolemund, R for Data Science

J.D. Long and Paul Teetor, R Cookbook (2nd ed.)

8.2 Getting started

For this hands-on exercise, we will be loading the data contained in a CSV file. Base R has the ability to read CSV files with the read.csv() function, but we’ll be using the read_csv() function, which is in the {readr} package. This package loads when we run library(tidyverse).

We will also be loading the contents of an Excel file. While the {readxl} package is part of the tidyverse, it is not part of the core that loads with library(tidyverse): we have to run the library() function separately.

library(tidyverse)  # we will be using the functions of {readr}
library(readxl)     

8.3 Reading a CSV file

CSV files are very commonly used for storing flat files. They don’t have any formatting—just the number or text in the cell.

The package {readr} is designed to make importing these files simple.

(This example comes straight from https://readr.tidyverse.org/index.html)

Run this chunk to create an object called mtcars, from a CSV file of the same name.

# example
mtcars <- read_csv("mtcars.csv")

You’ll see that the console displays the type that has been assigned to each variable. In this case, they are all of the type “double”, which is a numeric type (think numbers with decimals or exponents; not integers).

Adding the cols() argument allows us to alter what {readr} has decided for us. For example, we could set the cyl variable to be an integer.

# example
mtcars <- read_csv("mtcars.csv",
                   col_types =
                     cols(cyl = col_integer()))

Insert an R chunk and rerun the example above, but with the am variable also set to integer and gear set to character.

# solution
read_csv("mtcars.csv",
         col_types = 
           cols(cyl = col_integer(),
                am = col_integer(),
                gear = col_character())
)
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am gear   carb
##    <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1 4         4
##  2  21       6  160    110  3.9   2.88  17.0     0     1 4         4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1 4         1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0 3         1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0 3         2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0 3         1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0 3         4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0 4         2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0 4         2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0 4         4
## # … with 22 more rows

The {readr} package allows a lot of control over how the file is read. Of particular utility are

  • na = "" – specify which values you want to be turned into NA

  • skip = 0 – specify how many rows to skip

  • n_max = Inf – the maximum number of records to read

Read the first 5 rows of the “mtcars.csv” file.

# solution
read_csv("mtcars.csv", 
         n_max = 5)
## # A tibble: 5 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2

In some instances, we may wish to read a sub-set of rows in the middle of our file, and retain the headers. In this code below, we do just that.

The first line uses the names() function to read the variable names, and assign them to an object that we can use in the subsequent read_csv() function.

# set the column names
mtcars_col_names <- names(read_csv("mtcars.csv", n_max = 0))

# read file starting at 5th row, adding column names
read_csv("mtcars.csv", 
         col_names = mtcars_col_names, 
         skip = 5, 
         n_max = 10)
## # A tibble: 10 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  2  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  3  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  4  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  5  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
##  6  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
##  7  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
##  8  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
##  9  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
## 10  15.2     8  276.   180  3.07  3.78  18       0     0     3     3

8.4 Reading an Excel file

If anything, Excel files are more common than CSV and other plain-text data files. They seem to multiply like coat hangers in the closet…

And as we see in the article by Karl Broman & Kara Woo, “Data Organization in Spreadsheets”, the features of Excel files can encourage some ways of storing information that makes it hard for us to analyze. Excel files can also contain a wide variety of data format types. {readxl} tries to figure out what’s going on, but like {readr}, it allows you to override some of those automatic decisions.

In this example, our Excel file is in a different directory, one below our current directory, called “data”. Working with file paths like this, we can import the contents of data files and worksheets that are anywhere on our computer, the local network, or the internet.

deaths <- read_excel("data/deaths.xlsx")
deaths
## # A tibble: 18 × 6
##    `Lots of people`             ...2                     ...3   ...4     ...5          ...6         
##    <chr>                        <chr>                    <chr>  <chr>    <chr>         <chr>        
##  1 simply cannot resist writing <NA>                     <NA>   <NA>     <NA>          some notes   
##  2 at                           the                      top    <NA>     of            their spread…
##  3 or                           merging                  <NA>   <NA>     <NA>          cells        
##  4 Name                         Profession               Age    Has kids Date of birth Date of death
##  5 David Bowie                  musician                 69     TRUE     17175         42379        
##  6 Carrie Fisher                actor                    60     TRUE     20749         42731        
##  7 Chuck Berry                  musician                 90     TRUE     9788          42812        
##  8 Bill Paxton                  actor                    61     TRUE     20226         42791        
##  9 Prince                       musician                 57     TRUE     21343         42481        
## 10 Alan Rickman                 actor                    69     FALSE    16854         42383        
## 11 Florence Henderson           actor                    82     TRUE     12464         42698        
## 12 Harper Lee                   author                   89     FALSE    9615          42419        
## 13 Zsa Zsa Gábor                actor                    99     TRUE     6247          42722        
## 14 George Michael               musician                 53     FALSE    23187         42729        
## 15 Some                         <NA>                     <NA>   <NA>     <NA>          <NA>         
## 16 <NA>                         also like to write stuff <NA>   <NA>     <NA>          <NA>         
## 17 <NA>                         <NA>                     at the bottom,  <NA>          <NA>         
## 18 <NA>                         <NA>                     <NA>   <NA>     <NA>          too!

What do you notice about the “Date of birth” and “Date of death” columns?

Read in the “deaths” file, but use the range = option to specify when to start reading the file:

# Example
deaths <- read_excel("data/deaths.xlsx", 
                     range = "A1:B3")

deaths
## # A tibble: 2 × 2
##   `Lots of people`             ...2 
##   <chr>                        <chr>
## 1 simply cannot resist writing <NA> 
## 2 at                           the
# solution
deaths <- read_excel(
  "data/deaths.xlsx",
  range = "A5:F15"
)

deaths
## # A tibble: 10 × 6
##    Name               Profession   Age `Has kids` `Date of birth`     `Date of death`    
##    <chr>              <chr>      <dbl> <lgl>      <dttm>              <dttm>             
##  1 David Bowie        musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
##  2 Carrie Fisher      actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
##  3 Chuck Berry        musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
##  4 Bill Paxton        actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
##  5 Prince             musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
##  6 Alan Rickman       actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
##  7 Florence Henderson actor         82 TRUE       1934-02-14 00:00:00 2016-11-24 00:00:00
##  8 Harper Lee         author        89 FALSE      1926-04-28 00:00:00 2016-02-19 00:00:00
##  9 Zsa Zsa Gábor      actor         99 TRUE       1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael     musician      53 FALSE      1963-06-25 00:00:00 2016-12-25 00:00:00

Use the skip = and n_max = options to achieve the same thing:

# solution
deaths <- read_excel(
  "data/deaths.xlsx",
  skip = 4,
  n_max = 10
)

deaths
## # A tibble: 10 × 6
##    Name               Profession   Age `Has kids` `Date of birth`     `Date of death`    
##    <chr>              <chr>      <dbl> <lgl>      <dttm>              <dttm>             
##  1 David Bowie        musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
##  2 Carrie Fisher      actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
##  3 Chuck Berry        musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
##  4 Bill Paxton        actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
##  5 Prince             musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
##  6 Alan Rickman       actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
##  7 Florence Henderson actor         82 TRUE       1934-02-14 00:00:00 2016-11-24 00:00:00
##  8 Harper Lee         author        89 FALSE      1926-04-28 00:00:00 2016-02-19 00:00:00
##  9 Zsa Zsa Gábor      actor         99 TRUE       1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael     musician      53 FALSE      1963-06-25 00:00:00 2016-12-25 00:00:00

If we know how many rows to skip at the top, and how many to omit at the bottom, we can insert a calculation as to what our n_max() value needs to be. This would be useful in the circumstance where we have multiple sheets we need to read, each having the same structure with a fixed number of header and footer rows that do not contain relevant data, but the number of rows in between varies.

In the case of the “deaths.xlsx” file, we have four rows of junk data at the top (prior to the variable names), and another four rows at the bottom. By subtracting those numbers from the total count of rows in the file, we have the object valid_rows that we can apply as the value for the n_max = in our read_excel() function.

# first, read the data file
deaths <- read_excel("data/deaths.xlsx")
# determine the total number of rows in the sheet
rowcount <- nrow(deaths)

# calculate the number of rows with relevant data "rowcount2"
valid_rows <- rowcount - 4 - 4   # 4 rows of junk data at the top after the header, 4 rows of junk data at the bottom

read_excel("data/deaths.xlsx",
           skip = 4,
           n_max = valid_rows)
## # A tibble: 10 × 6
##    Name               Profession   Age `Has kids` `Date of birth`     `Date of death`    
##    <chr>              <chr>      <dbl> <lgl>      <dttm>              <dttm>             
##  1 David Bowie        musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
##  2 Carrie Fisher      actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
##  3 Chuck Berry        musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
##  4 Bill Paxton        actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
##  5 Prince             musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
##  6 Alan Rickman       actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
##  7 Florence Henderson actor         82 TRUE       1934-02-14 00:00:00 2016-11-24 00:00:00
##  8 Harper Lee         author        89 FALSE      1926-04-28 00:00:00 2016-02-19 00:00:00
##  9 Zsa Zsa Gábor      actor         99 TRUE       1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael     musician      53 FALSE      1963-06-25 00:00:00 2016-12-25 00:00:00

For more information on reading areas of an Excel sheet, see this article at the {readxl} site: https://readxl.tidyverse.org/articles/sheet-geometry.html

Excel files often (almost always?) have multiple sheets in them. It’s possible to specify which one you want to use with the sheets = "" option. You can also use the position number (if you happen to know it).

Note that if you don’t specify the sheet, {readxl} will default to the first one.

The excel_sheets() function will tell you the name of the sheets in an Excel file.

datasets <- "data/datasets.xlsx"

excel_sheets(datasets)
## [1] "iris"     "mtcars"   "chickwts" "quakes"
# Example
read_excel(datasets, "iris")
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Now, read in the “mtcars” sheet using the name of the sheet.

# Solution
read_excel(datasets, "mtcars")
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

And finally, read in the “quakes” sheet, using the position.

# Solution
read_excel(datasets, 4)
## # A tibble: 1,000 × 5
##      lat  long depth   mag stations
##    <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1 -20.4  182.   562   4.8       41
##  2 -20.6  181.   650   4.2       15
##  3 -26    184.    42   5.4       43
##  4 -18.0  182.   626   4.1       19
##  5 -20.4  182.   649   4         11
##  6 -19.7  184.   195   4         12
##  7 -11.7  166.    82   4.8       43
##  8 -28.1  182.   194   4.4       15
##  9 -28.7  182.   211   4.7       35
## 10 -17.5  180.   622   4.3       19
## # … with 990 more rows

Methods for saving dataframes will be introduced later in the course, in the “Exporting data & graphics” chapter.


8.5 Exercise

8.5.1 Discussion

The Excel portion of this chapter relies on some common methods to deal with how spreadsheet files are often used (Excel isn’t the only spreadsheet software, just very common.) Read the journal article by Karl Broman & Kara Woo, “Data Organization in Spreadsheets”(Broman and Woo 2017) and come to class ready to discuss the following questions.

Discussion questions:

  1. How many people in your discussion group have used spreadsheets, in their research, courses, or at home?

  2. How many people in your discussion group have accidentally done something that made them frustrated or sad? How many of them have done the sorts of things that Broman and Woo advise against?

  3. What are basic principles for using spreadsheets for good data organization?

  4. What are good approaches for handling dates in spreadsheets?


8.6 Reference Material

You can find more details in R for Data Science: https://r4ds.had.co.nz/data-import.html

{readr} https://readr.tidyverse.org/

{readxl} https://readxl.tidyverse.org/index.html