Chapter 8 Data Cleaning and Validation

8.1 Introduction

Data scientists, by some accounts, spend 80% of their time cleaning data. Ensuring that your data is of high quality ensures that your analysis, and thus your conclusions, are also of high quality, is an essential step in the data science process.

8.2 Data Quality

Statistics Canada Quality Guidelines–6th edition (2019)

8.3 Data Cleaning

Data cleansing–wikipedia entry

8.4 Data Validation

Data validation–wikipedia entry

Marco Di Zio, Nadežda Fursova, et al., Methodology for data validation 1.0 (revised edition June 2016), Essnet Validat Foundation {PDF}

8.5 Using R

Mark van der Loo and Edwin de Jonge, Statistical Data Cleaning with Applications in R (Loo and Jonge 2018)

Edwin de Jonge and Mark van der Loo, 2013, An introduction to data cleaning with R, Discussion Paper, Statistics Netherlands {PDF}

Samuel E. Buttrey, Lyn R. Whitaker, 2017, A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R (Buttrey and Whitaker 2017)

8.6 R Packages

8.6.1 `{validate}`

CRAN: validate: Data Validation Infrastructure – “Declare data validation rules and data quality indicators; confront data with them and analyze or visualize the results. The package supports rules that are per-field, in-record, cross-record or cross-dataset. Rules can be automatically analyzed for rule type and connectivity.”

Vignette: Introduction to Validate

Articles:

Edwin de Jonge and Mark van der Loo, 2019, Data Validation Infrastructure for R, arXiv:1912.09759v1 {PDF}

8.6.2 `{vim}`

Joachim Schork, “Report Missing Values in Data Frame in R”

-30-

References

Buttrey, Samuel E., and Lyn R. Whitaker. 2017. A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in r. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119080053.

Loo, Mark van der, and Edwin de Jonge. 2018. Statistical Data Cleaning with Applications in r. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118897126.