3.2 FAIR data principles

The FAIR data principles (Wilkinson et al. 2016) are:

Findability
Accessibility
Interoperability
Reusability

These are considered so important the G20 leaders, at the 2016 G20 Hangzhou summit, issued a statement endorsing the application of FAIR principles to research. However, excepting confidentialityissues, they can be considered to apply far wider. The principles are intended to encourage or enhance the capability of computer systems to locate, access, integrate, and reuse data with little or no human intervention.

Apart from the need to have a persistent, stable and available source pointed at by a unique DOI³⁰ the principles also emphasise the role of meta-data.

F2: data are described with rich metadata
A1: meta-data are retrievable by their identifier [DOI] using a standardized communications protocol
I1: meta-data use a formal, accessible, shared, and broadly applicable language for knowledge representation
R1.2: meta-data are associated with detailed provenance
R1.3: meta-data meet domain-relevant community standard

Meta-data: is data about data. Meta-data describes how the data set is organised and the meanings of individual variables. For example, we might say that the variable Person ID comprises alpha-numeric characters (so it’s a string) and to be valid it must be (i) unique and (ii) exactly 8 characters in length. This information facilitates meaningful processing and analysis.

How we curate and share data is fundamental to meaningful data science so it is a theme we will return to many times. Meaningful meta-data is crucial to this process. Two sources if you wish to read further are the JavaTpoint tutorial and a tutorial on meta-data in the context of extract-transfrom-load (ETL systems).

3.2.1 A bad example of data sharing

The following example is drawn from a study on data quality problems with some widely shared NASA software defect data sets (Shepperd et al. 2013). Many researchers have been interested in using machine learning algorithms to classify software components as defective or not. For a systematic review of research in this area see (Hall et al. 2011). At least 100 research papers were based on these publicly-shared NASA³¹ software defect data sets not least because they were in the public domain and easy to access.

However, upon closer inspection we realised that the data sets had a lot of problems! These included negative values for counts, floating point values for counts and multiple violations of referential integrity constraints. As an example (where LOC is lines of code):

TOTAL_LOC $\geq$ COMMENT_LOC + BLANK_LOC + EXECUTABLE_LOC

By checking for these constraints we found many problems in the data sets and published some cleaned versions for future research. But a question arises about the value of the previous research that was based on incorrect data. How much have their results been impacted. This again highlights the importance of data quality checking.

In terms of implementing this kind of integrity checking, R has a very useful R package {validate} that allows rules to be specified and applied. You can get more information from here.

References

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2011. “A Systematic Literature Review on Fault Prediction Performance in Software Engineering.” IEEE Transactions on Software Engineering 38 (6): 1276–1304.

Shepperd, Martin, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. “Data Quality: Some Comments on the NASA Software Defect Datasets.” IEEE Transactions on Software Engineering 39 (9): 1208–15.

Wilkinson, Mark D, Michel Dumontier, Isbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 1–9.

A DOI or digital object identifier is a unique link to some digital resource e.g., a document and is governed by the International DOI Foundation.↩︎
To be clear we are not blaming the errors on NASA but most likely on subsequent merging and pre-processing by researchers wishing to support their communities by engaging in data sharing. Unfortunately, it is all too easy to make mistakes and then these are propagated.↩︎