How can we be more FAIR in science

A look a the A. fumigatus field

Genetics

Mariana Santos Silva

Genetics

Anna Fensel

AI & Data science

How and Why of data sharing

Have you ever been confused about data?

How can you plot this?

Are we missing data?

Do the colours mean anything?

What unit concentrations was measured? What coordinate system are we using?

Maybe you cannot access it?

What do we find?

Most research data is \(\dots\)

  • mostly stored in excel formats;

  • Fickle;

  • Impossible to validate;

  • In the best case, only well described for humans.

Consequences?

  • Delayed research;

  • Critical errors in data analysis [1];

  • Increased barrier to entry for labs that lack resources;

  • Impossible questions (missing metadata);

  • For important data this may directly affect peoples lives.

Found requirements for data standardisation

What requirements did I find?

  • We need a knowledge base \(\rightarrow\) ASPAR_KR.

  • Excel documents should be central to the solution.

  • A solution should not require too much technical backround.

  • It must integrate with existing standards such as ISA, MiXS and JERM.

What can make your work more FAIR?

The FAIRDS

  • Validates excel sheets against ENA standards.

  • Converts them to a stable and scalable format.

  • Allows addition of new standards via excel sheets.

Made by Nijsen et al. [2].

Jasper Koehorst

Bart Nijsen1

Peter J Schaap

FAIR enough, but what does it do?

Going from excel \(\rightarrow\) terse resource discriptor framework.

How does it work?

  • Experimental design is part of the dataset.

  • Minimum information standards are used.

  • New templates can be introduced.

  • You make excel templates.
  • Fill these in with your data.
  • Upload them, FAIRDS makes RDF.

A quick demo of the FAIRDS/ASPAR_KR

Imagine the following situation.

  • A researcher takes air samples in Arhnem and Nijmegen.
  • He wants to know if the resistance fraction is higher in Arnhem or Nijmegen.
  • He uses Hylke et al [3] method of air sampling with the delta traps.

The delta trap method – Image by Bo Briggeman
  • Per city, 4 locations are sampled.

  • Using the two layer culture…

    • Strips are grown on Flamingo agar…
    • And Flamingo agar with ITR.
  • Lets see how to enter these things in FAIRDS.

Map data from the open streetmap project [4].

The data set to be FAIRified.

The FAIRification programme.

The FAIR data, how can we use this?

An usecase – verifying data

Hylke’s air sampling data

PREFIX schema: <http://schema.org/>
prefix geo: <http://www.opengis.net/ont/geosparql#> 
prefix sf: <http://www.opengis.net/ont/sf#> 
PREFIX cats: <http://cats.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX fair: <http://fairbydesign.nl/ontology/>
prefix jerm:     <http://jermontology.org/ontology/JERMOntology#> 
SELECT *
WHERE {
    
    # Get properties of the DeltaTrap.
    ?deltaTrap fair:packageName 'DeltaTrap' ;
        <https://w3id.org/mixs/terms/0000011> ?arrival_date ;
         geo:hasGeometry/geo:asWKT ?point .

    ?observationalUnit jerm:hasPart ?deltaTrap ;
                schema:identifier ?obsId .


    # Get properties of the culture
    ?deltaTrap fair:derives ?twoLayerCulture ;
        fair:start_date ?start_date ;
        fair:end_date ?end_date ;
        <https://w3id.org/mixs/terms/0000011> ?analysis_date ;
    
    # What is the amount of time the seals were exposed?
    BIND(day(?end_date - ?start_date) as ?air_exposure_days)
    # What is the transfer time?
    BIND(day(?arrival_date - ?end_date) as ?transfer_time)
    # Time untill analysis after arrival?
    BIND(day(?analysis_date - ?arrival_date) as ?time_to_analysis)

    # Distance from WUR
    SERVICE <https://query.wikidata.org/sparql> {
        # What things are a municipality?
        ?municipality wdt:P31 wd:Q2039348.
        # What things have a place?
        ?municipality wdt:P625 ?placeOfInterest .
        # Take only the thing that is near the place of interest.
        FILTER(?municipality = wd:Q1305) . # Arnhem
    }
    BIND (geof:distance(?point, ?placeOfInterest, uom:kilometre) as ?d_km)
}

The results

Transfer time to the Lab

Relation time and distance
Figure 1: Relation between distance and time in the postal system.

Outlook

What lays ahead?

  • Get started today!
    is your data FAIR yet?

  • Every bit helps.

I remind you: it is possible

Future perspectives

Reccomendation

  • Departmental databases?

  • One central repository?

    • ASPAR_KR
  • Useages?

How a database system can be established, redrawn from [5]

Usage of FAIR data to answer questions

Besides just transparancy FAIR data can also be used for \(\dots\)

  • How is the genotype related to environmental factors?
    Knowing the coordinates in a standard format is a good first step.

  • Can we predict the anti-fungal resistance of fungi better?
    Having all resistance assays be described in one way is a good first step.

Conclusion

  • Researchers need to be mindful of their data.

  • An excel based workflow is prone to error.

  • FAIR data tooling like FAIRDS need to be adopted and improved.

  • Thank you!

Additional slides

  1. Challenges encountered during the project

  2. Why I picked the FAIRDS

  3. Improvements to the FAIRDS

Challenges

Personal

  • Interviewing people.

  • Contributing to the java programme.

Domain

  • Running all of the software correctly

  • Understanding what people are doing in the lab.

Why the FAIRDS

There is no programme yet in widespread use for minimal metadata validation.

alternative + -
seek4science * Supports ISA
* Sharing of templates online.
* A bit more complicated to contribute to.
* Not available locally via excel sheets.
FAIRshare * Locally available
* Limited in scope to genomics or immunology.
* Not clear how to expand it.
ENA and NCBI * Allow upload of large datasets with searchable metadata * Do not enforce the quality of their metadata.
4tu and zenodo * General storage of large datasets * Only a small amount of metadata annotation is supported.

Improvements to the FAIRDS

  • Should allow more user-friendly addition of new standards.

  • Non-interactive batch mode should be supported.

  • Should inter-operate with more data formats:

    • SEEK.
    • Should allow local data storage for non WUR users.
  • Should inter-operate with more databases:

    • Zenodo & 4tu – automatic deposit.
  • Complete refactor of the programme is required.

References

More information

My thesis could not be made without the help of a lot of collaborators:

Jasper for offering his help with the FAIRDS. Anna and Mariana, for support and guidance during the thesis; Martin and Hylke, for their data and aiding with the usecases; Murambiwa, Christopher for their data and critical discussion; and Sijmen for the insight on the structure of my thesis.

Cited works

1.
Abeysooriya M, Soria M, Kasu MS, Ziemann M (2021) Gene name errors: Lessons not learned. PLOS Computational Biology 17(7):e1008984. https://doi.org/10.1371/journal.pcbi.1008984
2.
Nijsse B, Schaap PJ, Koehorst JJ (2023) FAIR data station for lightweight metadata management and validation of omics studies. GigaScience 12:giad014. https://doi.org/10.1093/gigascience/giad014
3.
4.
OpenStreetMap contributors (2017) Planet dump retrieved from https://planet.osm.org
5.
Costello MJ, Appeltans W, Bailly N, et al (2014) Strategies for the sustainability of online open-access biodiversity databases. Biological Conservation 173:155–165. https://doi.org/10.1016/j.biocon.2013.07.042

Footnotes

  1. Does not want to share his likeness