3.2 FAIR data principles
The FAIR data principles (Wilkinson et al. 2016) are:
- Findability
- Accessibility
- Interoperability
- Reusability
These are considered so important the G20 leaders, at the 2016 G20 Hangzhou summit, issued a statement endorsing the application of FAIR principles to research. However, excepting confidentialityissues, they can be considered to apply far wider. The principles are intended to encourage or enhance the capability of computer systems to locate, access, integrate, and reuse data with little or no human intervention.
Apart from the need to have a persistent, stable and available source pointed at by a unique DOI30 the principles also emphasise the role of meta-data.
- F2: data are described with rich metadata
- A1: meta-data are retrievable by their identifier [DOI] using a standardized communications protocol
- I1: meta-data use a formal, accessible, shared, and broadly applicable language for knowledge representation
- R1.2: meta-data are associated with detailed provenance
- R1.3: meta-data meet domain-relevant community standard
Meta-data: is data about data. Meta-data describes how the data set is organised and the meanings of individual variables. For example, we might say that the variable Person ID comprises alpha-numeric characters (so it’s a string) and to be valid it must be (i) unique and (ii) exactly 8 characters in length. This information facilitates meaningful processing and analysis.
How we curate and share data is fundamental to meaningful data science so it is a theme we will return to many times. Meaningful meta-data is crucial to this process. Two sources if you wish to read further are the JavaTpoint tutorial and a tutorial on meta-data in the context of extract-transfrom-load (ETL systems).
3.2.1 A bad example of data sharing
The following example is drawn from a study on data quality problems with some widely shared NASA software defect data sets (Shepperd et al. 2013). Many researchers have been interested in using machine learning algorithms to classify software components as defective or not. For a systematic review of research in this area see (Hall et al. 2011). At least 100 research papers were based on these publicly-shared NASA31 software defect data sets not least because they were in the public domain and easy to access.
However, upon closer inspection we realised that the data sets had a lot of problems! These included negative values for counts, floating point values for counts and multiple violations of referential integrity constraints. As an example (where LOC is lines of code):
TOTAL_LOC \(\geq\) COMMENT_LOC + BLANK_LOC + EXECUTABLE_LOC
By checking for these constraints we found many problems in the data sets and published some cleaned versions for future research. But a question arises about the value of the previous research that was based on incorrect data. How much have their results been impacted. This again highlights the importance of data quality checking.
In terms of implementing this kind of integrity checking, R has a very useful R package {validate} that allows rules to be specified and applied. You can get more information from here.
References
A DOI or digital object identifier is a unique link to some digital resource e.g., a document and is governed by the International DOI Foundation.↩︎
To be clear we are not blaming the errors on NASA but most likely on subsequent merging and pre-processing by researchers wishing to support their communities by engaging in data sharing. Unfortunately, it is all too easy to make mistakes and then these are propagated.↩︎