Chapter 9 Data preparation strategy

Topics:

  • Tidy data
  • tsibble objects for storing, manipulating, and visualizing time series data. Frequency of time series: the index parameter. key parameter(s).
  • Applying dplyr verbs to tsibble objects: filter, select, mutate, group_by, summarize

Readings:

Optional: To learn how to wrangle and visualize data using the Tidyverse packages, you may find it useful to go through the Tidyverse Fundamentals with R modules on Datacamp. Datacamp also offers a range of other learning modules for developing data science skills in R.

Assignment: Extract and prepare your data.

9.1 Overview: Extraction, transformation, and loading of data

Before undertaking any data analysis project, you need to organize your data into a format to make it ready for analysis. Very commonly, the data you wish to work with will not come to you in a nice format that makes it ready to analyze. Often it will be necessary to transform the data, applying a sequence of manipulations to get it into a nice format such as a single table.

If you have saved your prepared table to a database or local file, your may finally need to load the data into memory on your working machine as a prelude to commence analysis. These steps together are the extract-transform-load (ETL) stage of a data analysis project.

The bad news is that working data scientists generally report that the ETL stage is the most time-consuming part of a data science project. The good news is that the R tidyverse packages offer a number of helpful tools to somewhat ease the pain of ETL work, also known informally as data wrangling.1

The ETL steps needed for a given project will depend on the nature of the data and on how they are originally organized. We can characterize how we want the data to look at the end of the ETL stage. To the extent possible, we want the data to be tidy.

9.2 Organize your data into a tidy data frame

Readings: TSDS Sections 1.2-1.3

Hadley Wickham (???) codifies the concept of tidy data:

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

“Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analysing immediately, this is the exception, not the rule.” In real life, the systems people and organizations use to collect, manage, and store data are governed by many priorities: end-user convenience, clarity of presentation, storage costs, processing speed, etc. Making data fit for analysis by data scientists is typically a minor consideration, if it is thought of at all.

Expanding on the theme, Wickham identifies five of the most common problems with non-tidy, “messy” datasets:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

The Tidyverse packages integrate a range of tools to help with transforming the messy data often encountered in the wild into tidy formats more suitable for analysis. See the Tidyverse website for an overview, or this Coursera course for more guidance.

  • Organize your data table(s) so that they ‘match’ your question.

9.3 Convert your data into a tsibble object

In this step of the assignment, you will convert your tidy data frame into a tsibble object. Doing so in effect tells R: “These data actually form a time series. One column, which I designate the index, contains time values, in equal intervals.” Taking this step enables the use of specific tools for data visualization, exploratory analysis, and forecasting for time series data. The tsibble package for R provides “a data infrastructure for tidy temporal data”. Review the documentation for instructions.

9.4 Data preparation strategy: Design your end-point data table(s)

Starting point: Multiple source files, mess, etc. This is real life as a data scientist.

What’s your desired end point? How will you prepare your data to make it ready to analyze?

Data preparation is a creative activity. (Your jobs are secure.)

9.4.1 Design your end point first (at least, in your head).

  • Which columns? In which order?
  • Which data types should the different columns have?
  • For tsibble objects:
    • Which column is the index? Must have date+time values, at regular intervals.
    • Which columns contain key values?
      • These are values that don’t change with time.
      • Each value of the key corresponds to a distinct time series.
      • Cannot have duplicate rows that share an index + key value.
    • Remaining columns contain observational data: one row for each time step and key value.
      • What data types should these be?
      • Might choose to drop columns you aren’t going to use, to reduce clutter.

9.4.2 Typical structure for a time series data table

Date + time Series Value_1 Value_2 Value_3
2020-02-01 “Virginia” 33.57 29 “friendly”
2020-02-01 “Idaho” 0.22 18 “hostile”
index key
[date] [fctr] [dbl] [int] [fctr]

Then wrangle your data to get to your desired end point.

Recommended practices:

  • Put index field in the left-most column.
  • Next, put all the key fields.
  • Then finally the data values. Start with the most important ones.

9.5 Additional resources and next steps

To learn how to wrangle and visualize data using the Tidyverse packages, you may find it useful to go through the Tidyverse Fundamentals with R modules on Datacamp, or the analogous course on Coursera. Datacamp also offers a range of other learning modules for developing data science skills in R.

With your data organized into a tsibble object, you are positioned to do some exploratory data analysis, the topic of the next assignment. If you want to get a head start, check out the functions for data visualization and exploratory analysis of time series in the feasts package. Or work through the examples in Chapter 2 or Chapter 4 of Forecasting: Principles and Practice, 3rd ed.


  1. “Wrangling” refers to work with cattle, sheep, and other livestock.↩︎