Part 4 Week 1 Synchronous
Goal: talk about article Tidy Data by Hadley Wickham and learn tidy data.
4.1 Why tidy data?
Tidy datasets are easy to manipulate, model and visualize.
It is often said that 80% of data analysis is spent on the process of cleaning and preparing
the data.
4.2 What is tidy data?
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).
4.2.1 Data structure
- Most statistical datasets are rectangular tables made up of rows and columns.
- The columns are almost always labeled.
- The rows are sometimes labeled.
Look at Table 1. The table has two columns and three rows, and both rows and columns are labeled.
treatmenta | treatmentb | |
---|---|---|
John Smith | NA | 13 |
Jane Doe | 12 | 5 |
Mary Johnson | 6 | 15 |
Table 1: Typical presentation dataset.
There are other ways to structure the same underlying data. Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different.
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatmenta | NA | 12 | 6 |
treatmentb | 13 | 5 | 15 |
Table 2: The same data as in Table 1 but structured differently.
Our vocabulary of rows and columns is simply not rich enough to describe why the two tables represent the same data.
4.2.2 Data semantics
- A dataset is a collection of values.
- numbers (if quantitative)
- strings (if qualitative)
- Every value belongs to a variable and an observation.
- A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
- An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Table 3 reorganizes Table 1 to make the values, variables and observations more clear.
The dataset contains
- 18 values
- 3 variables
- person, with three possible values (John Smith, Mary Johnson, and Jane Doe).
- treatment, with two possible values (a and b).
- result, with five or six values depending on how you think of the missing value (NA, 16, 3, 2, 11, 1).
- 6 observations
- Experimental design: every combination of person and treatment was measured (completely crossed design)
name | trt | result |
---|---|---|
John Smith | a | NA |
Jane Doe | a | 12 |
Mary Johnson | a | 6 |
John Smith | b | 13 |
Jane Doe | b | 5 |
Mary Johnson | b | 15 |
Table 3: The same data as in Table 1 but with variables in columns and observations in rows.
It is diffcult to precisely define variables and observations in general.
Example 1,
- height and weight columns (2 variables)
- height and width columns (values of a dimension variable?)
Example 2,
- home phone and work phone variables
- phone number and number type variables?
Rule of thumb - Functional relationships between variables (e.g., z is a linear combination of x and y, density is the ratio of weight to volume) - Make comparisons between groups of observations (e.g., average of group a vs. average of group b)
4.2.3 Tidy data
4.2.3.1 Tidy data
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
4.2.3.2 Messy data
Messy data is any other arrangement of the data.
Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable.
The layout ensures that values of different variables from the same observation are always paired.
4.2.3.3 Order of variables and observations
While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values. Think the role of a variable in analysis: fixed by the design of the data collection, measured during the course of the experiment.
- Fixed variables describe the experimental design and are known in advance.
- Measured variables are what we actually measure in the study.
- Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous.
4.3 Tidying messy datasets
Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable.
Tools: melting, string splitting, and casting.
4.3.1 Column headers are values, not variable names
Table 4 shows a subset of a typical dataset of this form.
This dataset explores the relationship
between income and religion in the US. It comes from a report produced by the Pew Research
Center, an American think-tank that collects data on attitudes to topics ranging from religion
to the internet, and produces many reports that contain datasets in this format.
This dataset has three variables, religion, income and frequency. To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows.
Note: this arrangement can be called messy, while in come cases it can be extremely useful.
religion | <$10k | $10–20k | $20–30k | $30–40k | $40–50k | $50–75k |
---|---|---|---|---|---|---|
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 |
Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 |
Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 |
Hindu | 1 | 9 | 7 | 9 | 11 | 34 |
Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 |
Jehovah’s Witness | 20 | 27 | 24 | 24 | 21 | 30 |
Jewish | 19 | 19 | 25 | 25 | 30 | 95 |
Table 4: The first ten rows of data on income and religion from the Pew Forum. Three columns, $75-100k, $100-150k and >150k, have been omitted.
Melting is parameterized by a list of columns that are already variables, or colvars
for short.
The other columns are converted into two variables: a new variable called column
that contains repeated column headings and a new variable called value
that contains the
concatenated data values from the previously separate columns.
This is illustrated in Table 5
with a toy dataset. The result of melting is a molten dataset.
|
|
The Pew dataset has one colvar
, religion, and melting yields Table 6.
To better reflect their roles in this dataset, the variable column
has been renamed to income, and the value
column to freq.
This form is tidy because each column represents a variable and each row represents an observation, in this case a demographic unit corresponding to a combination of religion and income.
religion | income | freq |
---|---|---|
Agnostic | <$10k | 27 |
Agnostic | $10–20k | 34 |
Agnostic | $20–30k | 60 |
Agnostic | $30–40k | 81 |
Agnostic | $40–50k | 76 |
Agnostic | $50–75k | 137 |
Agnostic | $75–100k | 122 |
Agnostic | $100–150k | 109 |
Agnostic | >150k | 84 |
Agnostic | Don’t know/refused | 96 |
Table 6: The first ten rows of the tidied Pew survey dataset on income and religion. The column has been renamed to income, and value to freq.
Another common use of this data format is to record regularly spaced observations over time.
For example, the Billboard dataset shown in Table 7 records the date a song first entered the Billboard Top 100. It has variables for artist, track, date.entered, rank and week. The rank in each week after it enters the top 100 is recorded in 75 columns, wk1 to wk75. If a song is in the Top 100 for less than 75 weeks the remaining columns are filled with missing values.
This form of storage is not tidy, but it is useful for data entry. It reduces duplication since otherwise each song in each week would need its own row, and song metadata like title and artist would need to be repeated.
year | artist | track | time | date.entered | wk1 | wk2 | wk3 | |
---|---|---|---|---|---|---|---|---|
1 | 2000 | 2 Pac | Baby Don’t Cry | 4:22 | 2000-02-26 | 87 | 82 | 72 |
2 | 2000 | 2Ge+her | The Hardest Part Of … | 3:15 | 2000-09-02 | 91 | 87 | 92 |
3 | 2000 | 3 Doors Down | Kryptonite | 3:53 | 2000-04-08 | 81 | 70 | 68 |
6 | 2000 | 98^0 | Give Me Just One Nig… | 3:24 | 2000-08-19 | 51 | 39 | 34 |
7 | 2000 | A*Teens | Dancing Queen | 3:44 | 2000-07-08 | 97 | 97 | 96 |
8 | 2000 | Aaliyah | I Don’t Wanna | 4:15 | 2000-01-29 | 84 | 62 | 51 |
9 | 2000 | Aaliyah | Try Again | 4:03 | 2000-03-18 | 59 | 53 | 38 |
10 | 2000 | Adams, Yolanda | Open My Heart | 5:30 | 2000-08-26 | 76 | 76 | 74 |
Table 7: The first eight Billboard top hits for 2000. Other columns not shown are wk4, wk5, …, wk75.
This dataset has colvars
year, artist, track, time, and date.entered. Melting yields
Table 8. column
has been converted to week by extracting the number, and date has been computed from date.entered and week.
year | artist | time | track | date | week | rank |
---|---|---|---|---|---|---|
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-02-26 | 1 | 87 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-04 | 2 | 82 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-11 | 3 | 72 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-18 | 4 | 77 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-25 | 5 | 87 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-04-01 | 6 | 94 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-04-08 | 7 | 99 |
2000 | 2Ge+her | 3:15 | The Hardest Part Of … | 2000-09-02 | 1 | 91 |
2000 | 2Ge+her | 3:15 | The Hardest Part Of … | 2000-09-09 | 2 | 87 |
2000 | 2Ge+her | 3:15 | The Hardest Part Of … | 2000-09-16 | 3 | 92 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-08 | 1 | 81 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-15 | 2 | 70 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-22 | 3 | 68 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-29 | 4 | 67 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-05-06 | 5 | 66 |
Table 8: First fifteen rows of the tidied Billboard dataset. The date column does not appear in the original table, but can be computed from date.entered and week.
4.3.2 Multiple variables stored in one column
After melting, the column variable names often becomes a combination of multiple underlying variable names.
This is illustrated by the tuberculosis (TB) dataset, a sample of which is shown in Table 9.
This dataset comes from the World Health Organization, and records the counts of confirmed tuberculosis cases by country, year, and demographic group. The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).
country | year | m014 | m1524 | m2534 | m3544 | m4554 | m5564 | m65 | mu | f014 | |
---|---|---|---|---|---|---|---|---|---|---|---|
11 | AD | 2000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | NA |
37 | AE | 2000 | 2 | 4 | 4 | 6 | 5 | 12 | 10 | NA | 3 |
61 | AF | 2000 | 52 | 228 | 183 | 149 | 129 | 94 | 80 | NA | 93 |
88 | AG | 2000 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 |
137 | AL | 2000 | 2 | 19 | 21 | 14 | 24 | 19 | 16 | NA | 3 |
166 | AM | 2000 | 2 | 152 | 130 | 131 | 63 | 26 | 21 | NA | 1 |
179 | AN | 2000 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | NA | 0 |
208 | AO | 2000 | 186 | 999 | 1003 | 912 | 482 | 312 | 194 | NA | 247 |
237 | AR | 2000 | 97 | 278 | 594 | 402 | 419 | 368 | 330 | NA | 121 |
266 | AS | 2000 | NA | NA | NA | NA | 1 | 1 | NA | NA | NA |
Table 9: Original TB dataset. Corresponding to each ‘m’ column for males, there is also an ‘f’ column for females, f1524, f2534 and so on. These are not shown to conserve space. Note the mixture of 0s and missing values (NA). This is due to the data collection process and the distinction is important for this dataset.
Table 10(a) shows the results of melting the TB dataset, and Table 10(b) shows the results of splitting the single column column into two real variables: age and sex.
|
|
4.3.3 Variables are stored in both rows and columns
The most complicated form of messy data occurs when variables are stored in both rows and columns.
Table 11 shows daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for 5 months in 2010.
It has variables in
- individual columns (id, year, month)
- spread across columns (day, d1-d31)
- spread across rows (tmin, tmax) (minimum and maximum temperature).
- The element column is not a variable; it stores the names of variables.
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
MX17004 | 2010 | 1 | tmax | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 1 | tmin | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 2 | tmax | NA | 27.3 | 24.1 | NA | NA | NA | NA | NA |
MX17004 | 2010 | 2 | tmin | NA | 14.4 | 14.4 | NA | NA | NA | NA | NA |
MX17004 | 2010 | 3 | tmax | NA | NA | NA | NA | 32.1 | NA | NA | NA |
MX17004 | 2010 | 3 | tmin | NA | NA | NA | NA | 14.2 | NA | NA | NA |
MX17004 | 2010 | 4 | tmax | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 4 | tmin | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 5 | tmax | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 5 | tmin | NA | NA | NA | NA | NA | NA | NA | NA |
Table 11: Original weather dataset. There is a column for each possible day in the month. Columns d9 to d31 have been omitted to conserve space.
To tidy this dataset we first melt it with colvars
id, year, month and the column that contains
variable names, element. This yields Table 12(a).
This dataset is mostly tidy, but we have two variables stored in rows: tmin and tmax, the
type of observation.
Fixing the issue with the type of observation requires the cast
, or unstack, operation.
This performs the inverse of melting by rotating the element variable back out into the columns (Table 12(b)). This form is tidy. There is one variable in each column, and each row represents a day’s observations.
|
|
4.3.4 Multiple types in one table
During tidying, each type of observational unit should be stored in its own table.
The Billboard dataset described in Table 8 actually contains observations on two types of observational units: the song and its rank in each week.
This manifests itself through the duplication of facts about the song: artist and time are repeated for every song in each week.
The Billboard dataset needs to be broken down into two datasets: a song dataset which stores artist, song name and time, and a ranking dataset which gives the rank of the song in each week.
year | artist | time | track | date | week | rank |
---|---|---|---|---|---|---|
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-02-26 | 1 | 87 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-04 | 2 | 82 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-11 | 3 | 72 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-18 | 4 | 77 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-03-25 | 5 | 87 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-04-01 | 6 | 94 |
2000 | 2 Pac | 4:22 | Baby Don’t Cry | 2000-04-08 | 7 | 99 |
2000 | 2Ge+her | 3:15 | The Hardest Part Of … | 2000-09-02 | 1 | 91 |
2000 | 2Ge+her | 3:15 | The Hardest Part Of … | 2000-09-09 | 2 | 87 |
2000 | 2Ge+her | 3:15 | The Hardest Part Of … | 2000-09-16 | 3 | 92 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-08 | 1 | 81 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-15 | 2 | 70 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-22 | 3 | 68 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-04-29 | 4 | 67 |
2000 | 3 Doors Down | 3:53 | Kryptonite | 2000-05-06 | 5 | 66 |
Table 8: First fifteen rows of the tidied Billboard dataset. The date column does not appear in the original table, but can be computed from date.entered and week.
Table 13 shows these two datasets. You could also imagine a week dataset which would record background information about the week, maybe the total number of songs sold or similar demographic information.
|
|
4.3.5 One type in multiple tables
A single type of observational unit spread out over multiple tables or files.
These tables and files are often split up by another variable, so that each represents a single year, person, or location.
- Read the files into a list of tables.
- For each table, add a new column that records the original file name (because the file name is often the value of an important variable).
- Combine all tables into a single table.
4.3.5.1 List of tables
140 yearly baby name tables provided by the US Social Security Administration and combines them into a single file.
Popularity in 1880
Rank | Male name | Percent of total males |
Female name | Percent of total females |
---|---|---|---|---|
1 | John | 8.1546% | Mary | 7.2383% |
2 | William | 8.0507% | Anna | 2.6679% |
3 | James | 5.0060% | Emma | 2.0521% |
4 | Charles | 4.5169% | Elizabeth | 1.9866% |
5 | George | 4.3294% | Minnie | 1.7888% |
6 | Frank | 2.7382% | Margaret | 1.6167% |
7 | Joseph | 2.2230% | Ida | 1.5081% |
8 | Thomas | 2.1402% | Alice | 1.4487% |
9 | Henry | 2.0642% | Bertha | 1.3524% |
10 | Robert | 2.0397% | Sarah | 1.3196% |
Popularity in 2020
Rank | Male name | Percent of total males |
Female name | Percent of total females |
---|---|---|---|---|
1 | Liam | 1.0734% | Olivia | 1.0014% |
2 | Noah | 0.9966% | Emma | 0.8898% |
3 | Oliver | 0.7725% | Ava | 0.7472% |
4 | Elijah | 0.7117% | Charlotte | 0.7426% |
5 | William | 0.6848% | Sophia | 0.7410% |
6 | James | 0.6689% | Amelia | 0.7255% |
7 | Benjamin | 0.6627% | Isabella | 0.6891% |
8 | Lucas | 0.6160% | Mia | 0.6372% |
9 | Henry | 0.5845% | Evelyn | 0.5394% |
10 | Alexander | 0.5543% | Harper | 0.5013% |
4.3.5.2 Single table
Baby names
year | name | percent | sex |
---|---|---|---|
1880 | John | 0.081541 | boy |
1880 | William | 0.080511 | boy |
1880 | James | 0.050057 | boy |
1880 | Charles | 0.045167 | boy |
1880 | George | 0.043292 | boy |
1880 | Frank | 0.02738 | boy |
1880 | Joseph | 0.022229 | boy |
1880 | Thomas | 0.021401 | boy |
1880 | Henry | 0.020641 | boy |
1880 | Robert | 0.020404 | boy |
1880 | Edward | 0.019965 | boy |
1880 | Harry | 0.018175 | boy |
1880 | Walter | 0.014822 | boy |
1880 | Arthur | 0.013504 | boy |
4.4 Tidy tools
Tidy tools, tools that take tidy datasets as input and return tidy datasets as output.
4.4.1 Manipulation
- Filter: subsetting or removing observations based on some condition.
- Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
- Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
- Sort: changing the order of observations.
4.4.2 Visualization
- base plot()
- ggplot2
4.4.3 Modeling
- y ~ a + b + c * d