Chapter 8 Data preparation: Overview
Topics:
- Tidy data
tsibbleobjects for storing, manipulating, and visualizing time series data. Frequency of time series: theindexparameter.keyparameter(s).- Applying
dplyrverbs totsibbleobjects:filter,select,mutate,group_by,summarize - Periodicity. Seasonal periods.
Readings:
- FPP, Section 2.1
- Optional: To learn how to wrangle and visualize data using the Tidyverse packages, you may find it useful to go through the Tidyverse Fundamentals with R modules on Datacamp. Datacamp also offers a range of other learning modules for developing data science skills in R.
Assignment: Extract, transform, and load your data.
8.1 Data preparation strategy: Design your end-point data table(s)
Starting point: Multiple source files, mess, etc. This is real life as a data scientist!
What’s your desired end point?
Data ETL is a creative activity! (Your jobs are secure.)
8.1.1 Design your end point first (at least, in your head).
- Which columns? In which order?
- Which data types should the different columns have?
- For
tsibbleobjects:- Which column is the
index? Must have date+time values, at regular intervals. - Which columns contain
keyvalues?- These are values that don’t change with time.
- Each value of the
keycorresponds to a distinct time series. - Cannot have duplicate rows that share an
index+keyvalue.
- Remaining columns contain observational data: one row for each time step and
keyvalue.- What data types should these be?
- Might choose to drop columns you aren’t going to use, to reduce clutter.
- Which column is the
8.1.2 Typical structure for a time series data table:
| Date + time | Series | Value_1 | Value_2 | Value_3 |
|---|---|---|---|---|
| 2020-02-01 | “Virginia” | 33.57 | 29 | “friendly” |
| 2020-02-01 | “Idaho” | 0.22 | 18 | “hostile” |
| … | … | … | … | … |
index |
key |
|||
| [date] | [fctr] | [dbl] | [int] | [fctr] |
Then wrangle your data to get to your desired end point.
Recommended practices:
- Put
indexfield in the left-most column. - Next, put all the
keyfields. - Then finally the data values. Start with the most important ones.