1.2 The Importance of Tidy Data

The learning objectives for this section are to:

  • Define tidy data and to transform non-tidy data into tidy data

One unifying concept of this book is the notion of tidy data. As defined by Hadley Wickham in his 2014 paper published in the Journal of Statistical Software, a tidy dataset has the following properties:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy. In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience). Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.

As a quick example, consider the following data illustrating death rates in Virginia in 1940 in a classic table format:

      Rural Male Rural Female Urban Male Urban Female
50-54       11.7          8.7       15.4          8.4
55-59       18.1         11.7       24.3         13.6
60-64       26.9         20.3       37.0         19.3
65-69       41.0         30.9       54.6         35.1
70-74       66.0         54.3       71.1         50.0

While this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy. This format violates the tidy form because there are variables in both the rows and columns. In this case the variables are age category, gender, and urban-ness. Finally, the death rate itself, which is the fourth variable, is presented inside the table.

Converting this data to tidy format would give us

library(tidyr)
library(dplyr)

VADeaths %>%
  tbl_df() %>%
  mutate(age = row.names(VADeaths)) %>%
  gather(key, death_rate, -age) %>%
  separate(key, c("urban", "gender"), sep = " ") %>%
  mutate(age = factor(age), urban = factor(urban), gender = factor(gender))
Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 20 x 4
   age   urban gender death_rate
   <fct> <fct> <fct>       <dbl>
 1 50-54 Rural Male         11.7
 2 55-59 Rural Male         18.1
 3 60-64 Rural Male         26.9
 4 65-69 Rural Male         41  
 5 70-74 Rural Male         66  
 6 50-54 Rural Female        8.7
 7 55-59 Rural Female       11.7
 8 60-64 Rural Female       20.3
 9 65-69 Rural Female       30.9
10 70-74 Rural Female       54.3
11 50-54 Urban Male         15.4
12 55-59 Urban Male         24.3
13 60-64 Urban Male         37  
14 65-69 Urban Male         54.6
15 70-74 Urban Male         71.1
16 50-54 Urban Female        8.4
17 55-59 Urban Female       13.6
18 60-64 Urban Female       19.3
19 65-69 Urban Female       35.1
20 70-74 Urban Female       50  

1.2.1 The “Tidyverse”

There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is sometimes referred to as the “tidyverse” because of their dependence on and presumption of tidy data. “Tidyverse” packages include

  • ggplot2: a plotting system based on the grammar of graphics

  • magrittr: defines the %>% operator for chaining functions together in a series of operations on data

  • dplyr: a suite of (fast) functions for working with data frames

  • tidyr: easily tidy data with spread() and gather() functions

We will be using these packages extensively in this book.

The “tidyverse” package can be used to install all of the packages in the tidyverse at once. For example, instead of starting an R script with this:

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

You can start with this:

library(tidyverse)