Chapter 2 Foundations

In this chapter:

  • The data preparation process

  • The principles of data quality, and how dirty data is part of data quality

  • Understanding what makes data “clean” or “dirty”, and why context matters

  • Understanding tidy data principles

2.1 The data preparation process

As we get started on the path of preparing our data, we should think first about where we want to go.

Ellis and Leek’s article “How to Share Data for Collaboration” (Ellis and Leek 2017) is aimed at scientists (“data generators”) who have collected the data and are preparing it for further analysis in an academic environment—the authors speak about “preparing data for a statistician.”

While Ellis and Leek have a specific context in mind, the principles in their paper have broad applicability, and the practices are essential in any environment, including a business, government, or non-profit organization. And in those contexts, the roles and responsibilities and the division of labour in the workflow, are often different than in an academic environment. The data collection and storage might be to support a business need, and the use of the data for business intelligence is a secondary benefit. As a result, there is often a middle person who isn’t involved in the collection of the data but who does the preparation of the data for the modelling, visualization, and reporting. That same person might also be responsible for the modelling, visualization, and reporting.

At the end of the data preparation phase, before visualization and modelling can start, there should be:

1. The raw data.

2. A tidy dataset (Wickham 2014).

3. A code book describing each variable and its values in the tidy dataset.

4. An explicit and exact recipe used by the researcher to go from 1 to 2 to 3.

Over the next few chapters in this book, we will look at examples of creating a tidy data set (including cleaning the data), creating the code book, and documenting our steps along the way, thereby creating an “explicit and exact recipe.”

Whether we are preparing data as part of our own analytic project or to make it available for sharing, these should be our goals.

2.1.1 Elements of an iterative process

This list below are things that make up the process you will go through in preparing your data for analysis and modelling. These are not “steps”, in that data preparation is an iterative process, it’s not linear or even necessarily sequential.

  1. Save a copy of the original data. Do not make any edits, and consider making the file “read-only”.

  2. Start a new blank “readme” file, and in that file record a few key points, including the project’s research objective and information about the data file’s origin. You can also draft an outline of the steps that you think you’re going to take in the preparation process, as the first contribution to a literate programming approach.

  3. Import the data into your R environment.

  4. Explore and validate the data, assessing the structure, looking for missing values and other “dirty” elements.

  5. Clean the data. (And then validate to make sure your cleaning has been successful.)

This whole process is iterative (you will be adding to the readme file at every step), and the individual steps themselves are iterative. For example, your first code to import the data may be revisited, as you evaluate and assign variable types through the various arguments in the read function. Or your first cleaning code may address one dirty element, only to expose a second that you hadn’t previously identified.

In addition, you will be using your data wrangling, manipulation, and visualization skills at various points along the way.

2.2 Data quality

Part of the data preparation process is to ask, “What is the data’s quality?”

One resource that is useful to frame your thinking on this is Statistics Canada’s Quality Guidelines (Statistics Canada 2019). Like other official statistical agencies around the world, Statistics Canada’s reputation is staked on making high quality data available to the widest possible audience. They have identified six elements of data quality.

  • Relevance: The degree to which the data meets the user’s needs and relates to the issues that the user cares about.

  • Timeliness and punctuality: The delay between the information reference period and the date when the data becomes available.

  • Accuracy and reliability: Accuracy “is the degree to which the information correctly describes the phenomena it was designed to measure.”

  • Accessibility and clarity: “The ease with which users can learn that the information (including metadata) exists, find it, view it and import it into their own work environment.”

  • Interpretability: “The availability of supplementary information and metadata needed to interpret and use statistical information appropriately.”

  • Coherence and comparability: “The degree to which it can be reliably combined and compared with other statistical information within a broad analytical framework over time.”

In our data preparation process, we want to ensure our own work is accessible and interpretable; this is the motivation behind documentation.

Another typology of “data quality” is found in (Wang, Reddy, and Kon 1995). High quality data is:

  • Accessible

  • Interpretable

  • Useful

  • Believable

“Dirty data” is data that falls short on the believable dimension, in particular, evaluating whether the data are complete, consistent, and accurate. It is important to note that these categories are not mutually exclusive; a variable might be simultaneously inconsistent and inaccurate.

Complete

For our purposes, “complete” means whether any values in each record are missing (internally complete).

“Complete” does not mean that every possible member of the population is represented in the data. With a well-designed sample, it is possible that a sub-set of the population can provide an accurate set of measures about the population. Furthermore, it is possible to determine whether the records are an accurate representation of the whole.

Consistent

We will consider a measure to be “consistent” if the same value is reported in the same way.

Some examples of consistency:

  • Units are consistent. One example is temperature, and ensuring that the values are consistently reporting in degrees Celsius, not mixing Fahrenheit and Kelvin. Another might be in a survey or form with an international audience, where salary might be completed by respondents in the values of their local currency.

  • Spelling is consistent. My home province in Canada is “British Columbia”, but is often abbreviated to “B.C.” or “BC”. Or consider the 57 different ways that “Philadelphia” was spelled in the US Paycheck Protection Program (PPP) applications. (Au 2020a)

  • In some cases, the mode of data collection can introduce differences in the value recorded. This can apply to everything from variability of scientific instruments such as air quality sensors (Khreis et al. 2008) to how people respond to surveys conducted in different media (Abeysundera 2015) (Holbrook, Green, and Krosnick 2003) (St-Pierre and Béland 2004).

Accurate

When we say “accurate”, we mean that the value recorded in our data is the value measured.

Some examples of inaccurate data are:

  • The use of a default (or sentinel) value, inserted as a placeholder for unknown or missing values in place of a missing or “unknown” value. This is sometimes part of the data collection or database software, leading to values that should have been “unknown” being entered as the default. This can be the consequence of a data entry validation process that requires an entry (that is, data entry cannot continue until the field has an entry) or an entry in a particular format, such as a date. In these cases, the sentinel value can be entered instead of an explicit “NA”.

Dr Davis Lawrence, director of safety-literature database the SafeLit Foundation…tells me that ‘in most US states the quality of police crash reports is at best poor for use as a research tool. … Data-quality checks were rare and when quality was evaluated it was found wanting. For example, in Louisiana for most crashes in the 1980s most of the occupants were males who were born on January 1st, 1950. Almost all of the vehicles involved in crashes were the 1960 model year.’ Except they weren’t. These were just the default settings. (Perez 2019, 190)

  • A data entry error. Perhaps the most common are typographical errors, where the wrong value is entered. Another type is what we might call a “variable transposition error” (where the value is transposed one column over, something that happens all too often with address records, where the city name might end up in the state/province column).

  • Values are transformed automagically by software. Microsoft’s Excel spreadsheet program is the most famous of these, converting many non-date values into date format, and assigning new values to store the data. For example, entering the character string “SEPT1” gets converted to September 1st of the current year. There is documented evidence that this software behaviour has caused errors in gene research. (Weinstein 2004), (Abeysooriya 2021)

  • Dates may be stored as a numeric value (or “serial number”) representing the number of days elapsed from a fixed starting point—a starting point that varies by operating system. For example, Excel for Windows uses January 1, 1900 as the first day, while earlier version Macintosh computers, and by extension Excel for Macintosh, used January 1, 1904 as the start date. Thus the same date would be stored as different values. 2(https://docs.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system)

  • Contradiction. For example, imagine that a single individual is entered into two databases, but when we compare them we see that the date of birth differs, perhaps due to a non-ISO8601 entry: 07-08-79 and 08-07-79 both have the same digits but one could be mm-dd-yy and the other dd-mm-yy…we just don’t know which is the correct one. Or is one a typo?

  • Cultural ignorance (for want of a better term).

“Prawo Jazdy” was a supposed Polish national who was listed by the Garda Síochána in a police criminal database as having committed more than 50 traffic violations in Ireland. A 2007 memorandum stated that an investigation revealed prawo jazdy [ˈpra.vɔ ˈjaz.dɨ] to be Polish for ‘driving licence’, with the error arising due to officers mistaking the phrase, printed on Polish driving licences, to be a personal name while issuing traffic tickets. (Wikipedia contributors 2021)

2.2.1 Tidy data

When we are cleaning our data, we should also consider the structure. The goal should be a tidy structure, one that meets the following three principles or rules:

  • Each variable must have its own column.

  • Each observation must have its own row.

  • Each value must have its own cell. (Wickham 2014)

In many instances, achieving a tidy structure may require reshaping the data, using the {tidyr}(Wickham 2021b) packages’s pivot_() functions. In others, the data may have multiple variables in each cell which need to be split (such as addresses) into a separate column for each variable.

Other common structural problems that fall outside the four broad data quality categories include:

  1. Column headers are values instead of variable names. A frequently-encountered example are units of time, as in the example below. In the table below, the “values” are the years. Tables structured in this way can only display the values of one variable; in this case, it is the population of the country.
country 1952 2007
Canada 14,785,584 33,390,141
United States 157,553,000 301,139,947

A preferable structure would be to have a variable “year” and another “population”. This tidy structure also permits other variables measured during those years to be incorporated, rather than displayed in a separate table.

country year population life expectancy
Canada 1952 14,785,584 68.75
Canada 2007 33,390,141 80.65
United States 1952 157,553,000 68.44
United States 2007 301,139,947 78.24
  1. Variable names that are duplicated (this is possible in some data storage formats, including plain-text and Excel files).

  2. Inappropriate data types, such as numbers or logical values stored as character strings.

  3. Multiple data types in a column. For example, some records are numbers and some are character strings. The values themselves could be entirely correct, but if some number entry includes commas as thousands separators, the variable will be read and stored as a character type.

2.3 Cleaning the data

If the data fails to meet our standards or quality, we need to clean the data. Which doesn’t sound like a lot of fun. Didn’t we want to be data scientists or business intelligence experts or academic scientists uncovering the insights hiding in the data? Don’t we want to be doing analysis?

There is a strong argument to be made that the process of cleaning data is a fundamental part of the analytic process, or the corresponding statement that any analytic process requires data cleaning.

Randy Au has written “The act of cleaning data imposes values/judgments/interpretations upon data intended to allow downstream analysis algorithms to function and give results. That’s exactly the same as doing data analysis. In fact, “cleaning” is just a spectrum of reusable data transformations on the path towards doing a full data analysis.” (Au 2020a)

At this point we should ask what are we doing when we say we are “cleaning the data”? And how can we confirm that it is “cleaned” in the way that we have defined?

The first challenge: How do we find the things that are problematic with our data?

The second challenge: What can and should we do about them?

We will see some of these challenges and solutions in the next few chapters, as part of the data import process. We will also return to explicitly address these challenges in Chapters 11 and 12.

References

Abeysooriya, Megan AND Kasu, Mandhri AND Soria. 2021. “Gene Name Errors: Lessons Not Learned.” PLOS Computational Biology 17 (7): 1–13. https://doi.org/10.1371/journal.pcbi.1008984.
Abeysundera, Melanie. 2015. “Using Total Survey Error to Study Mode Effect and Other Applications.” In Proceedings of the Statistical Society of Canada Annual Meeting, Survey Methods Section.
Au, Randy. 2020a. “Data Cleaning IS Analysis, Not Grunt Work.” https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt.
Ellis, Sharon E., and Jeffrey T. Leek. 2017. “How to Share Data for Collaboration.” The American Statistician 72 (1): 53–57. https://doi.org/10.1080/00031305.2017.1375987.
Holbrook, Allyson L., Melanie C. Green, and Jon A. Krosnick. 2003. “Telephone Versus Face-to-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Response Bias.” Public Opinion Quarterly 67 (1): 79–125. https://doi.org/10.1086/346010.
Khreis, H, J Johnson, K Jack, B Dadashova, and ES Park. 2008. Evaluating the Performance of Low-Cost Air Quality Monitors in Dallas, Texas.” Int J Environ Res Public Health 19 (3): 1647. https://doi.org/10.3390/ijerph19031647.
Perez, Caroline Criado. 2019. Invisible Women: Data Bias in a World Designed for Men. Abrams Press.
———. 2019. Statistics Canada Quality Guidelines.” 12-539-X. Sixth. Statistics Canada. https://www150.statcan.gc.ca/n1/pub/12-539-x/12-539-x2019001-eng.htm.
St-Pierre, M., and Y. Béland. 2004. Mode effects in the Canadian Community Health Survey: A comparison of CAPI and CATI.” In Proceedings of the American Statistical Association Meeting, Survey Research Methods.
Wang, Richard Y., M. P. Reddy, and Henry B. Kon. 1995. “Toward Quality Data: An Attribute-Based Approach.” Decision Support Systems 13 (3): 349–72.
Weinstein, Barry R Zeeberg AND Joseph Riss AND David W Kane AND Kimberly J Bussey AND Edward Uchio AND W Marston Linehan AND J Carl Barrett AND John N. 2004. “Mistaken Identifiers: Gene Name Errors Can Be Introduced Inadvertently When Using Excel in Bioinformatics.” BMC Bioinfomatics 5 (80). https://doi.org/doi.org/10.1186/1471-2105-5-80.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.
———. 2021b. tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
Wikipedia contributors. 2021. “Prawo Jazdy (Alleged Criminal) — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/wiki/Prawo_Jazdy_(alleged_criminal).

  1. “Differences between the 1900 and the 1904 date system in Excel”↩︎