Chapter 8 Data Cleaning and Validation
8.1 Introduction
Data scientists, by some accounts, spend 80% of their time cleaning data. Ensuring that your data is of high quality ensures that your analysis, and thus your conclusions, are also of high quality, is an essential step in the data science process.
8.2 Data Quality
Statistics Canada Quality Guidelines–6th edition (2019)
8.3 Data Cleaning
Data cleansing–wikipedia entry
8.4 Data Validation
Data validation–wikipedia entry
Marco Di Zio, Nadežda Fursova, et al., Methodology for data validation 1.0 (revised edition June 2016), Essnet Validat Foundation {PDF}
8.5 Using R
Mark van der Loo and Edwin de Jonge, Statistical Data Cleaning with Applications in R (Loo and Jonge 2018)
Edwin de Jonge and Mark van der Loo, 2013, An introduction to data cleaning with R, Discussion Paper, Statistics Netherlands {PDF}
Samuel E. Buttrey, Lyn R. Whitaker, 2017, A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R (Buttrey and Whitaker 2017)
8.6 R Packages
8.6.1 {validate}
CRAN: validate: Data Validation Infrastructure – “Declare data validation rules and data quality indicators; confront data with them and analyze or visualize the results. The package supports rules that are per-field, in-record, cross-record or cross-dataset. Rules can be automatically analyzed for rule type and connectivity.”
Vignette: Introduction to Validate
Articles:
- Edwin de Jonge and Mark van der Loo, 2019, Data Validation Infrastructure for R, arXiv:1912.09759v1 {PDF}