How can we be more FAIR in science

A look a the A. fumigatus field

Sibbe Bakker

Genetics

Mariana Santos Silva

Genetics

Anna Fensel

AI & Data science

What do we find?

Most research data is \(\dots\)

mostly stored in excel formats;
Fickle;
Impossible to validate;
In the best case, only well described for humans.

Consequences?

Delayed research;
Critical errors in data analysis [1];
Increased barrier to entry for labs that lack resources;
Impossible questions (missing metadata);
For important data this may directly affect peoples lives.

Found requirements for data standardisation

What requirements did I find?

We need a knowledge base \(\rightarrow\) ASPAR_KR.
Excel documents should be central to the solution.
A solution should not require too much technical backround.
It must integrate with existing standards such as ISA, MiXS and JERM.

What can make your work more FAIR?

The FAIRDS

Validates excel sheets against ENA standards.
Converts them to a stable and scalable format.
Allows addition of new standards via excel sheets.

Made by Nijsen et al. [2].

FAIR enough, but what does it do?

Going from excel \(\rightarrow\) terse resource discriptor framework.

How does it work?

Experimental design is part of the dataset.
Minimum information standards are used.
New templates can be introduced.

You make excel templates.
Fill these in with your data.
Upload them, FAIRDS makes RDF.

A quick demo of the FAIRDS/ASPAR_KR

Imagine the following situation.

A researcher takes air samples in Arhnem and Nijmegen.
He wants to know if the resistance fraction is higher in Arnhem or Nijmegen.
He uses Hylke et al [3] method of air sampling with the delta traps.

The delta trap method – Image by Bo Briggeman

Per city, 4 locations are sampled.
Using the two layer culture…
- Strips are grown on Flamingo agar…
- And Flamingo agar with ITR.
Lets see how to enter these things in FAIRDS.

Map data from the open streetmap project [4].

The data set to be FAIRified.

The FAIRification programme.

The FAIR data, how can we use this?

An usecase – verifying data

Hylke’s air sampling data

PREFIX schema: <http://schema.org/>
prefix geo: <http://www.opengis.net/ont/geosparql#> 
prefix sf: <http://www.opengis.net/ont/sf#> 
PREFIX cats: <http://cats.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX fair: <http://fairbydesign.nl/ontology/>
prefix jerm:     <http://jermontology.org/ontology/JERMOntology#> 
SELECT *
WHERE {
    
    # Get properties of the DeltaTrap.
    ?deltaTrap fair:packageName 'DeltaTrap' ;
        <https://w3id.org/mixs/terms/0000011> ?arrival_date ;
         geo:hasGeometry/geo:asWKT ?point .

    ?observationalUnit jerm:hasPart ?deltaTrap ;
                schema:identifier ?obsId .


    # Get properties of the culture
    ?deltaTrap fair:derives ?twoLayerCulture ;
        fair:start_date ?start_date ;
        fair:end_date ?end_date ;
        <https://w3id.org/mixs/terms/0000011> ?analysis_date ;
    
    # What is the amount of time the seals were exposed?
    BIND(day(?end_date - ?start_date) as ?air_exposure_days)
    # What is the transfer time?
    BIND(day(?arrival_date - ?end_date) as ?transfer_time)
    # Time untill analysis after arrival?
    BIND(day(?analysis_date - ?arrival_date) as ?time_to_analysis)

    # Distance from WUR
    SERVICE <https://query.wikidata.org/sparql> {
        # What things are a municipality?
        ?municipality wdt:P31 wd:Q2039348.
        # What things have a place?
        ?municipality wdt:P625 ?placeOfInterest .
        # Take only the thing that is near the place of interest.
        FILTER(?municipality = wd:Q1305) . # Arnhem
    }
    BIND (geof:distance(?point, ?placeOfInterest, uom:kilometre) as ?d_km)
}

The results

Outlook

What lays ahead?

Get started today!
is your data FAIR yet?
Every bit helps.

I remind you: it is possible

Future perspectives

Reccomendation

Departmental databases?
One central repository?
- ASPAR_KR
Useages?

How a database system can be established, redrawn from [5]

Usage of FAIR data to answer questions

Besides just transparancy FAIR data can also be used for \(\dots\)

How is the genotype related to environmental factors?
Knowing the coordinates in a standard format is a good first step.
Can we predict the anti-fungal resistance of fungi better?
Having all resistance assays be described in one way is a good first step.

Conclusion

Researchers need to be mindful of their data.
An excel based workflow is prone to error.
FAIR data tooling like FAIRDS need to be adopted and improved.
Thank you!

Additional slides

Challenges encountered during the project
Why I picked the FAIRDS
Improvements to the FAIRDS

Challenges

Personal

Interviewing people.
Contributing to the java programme.

Domain

Running all of the software correctly
Understanding what people are doing in the lab.

Why the FAIRDS

There is no programme yet in widespread use for minimal metadata validation.

alternative	+	-
seek4science	* Supports ISA * Sharing of templates online.	* A bit more complicated to contribute to. * Not available locally via excel sheets.
FAIRshare	* Locally available	* Limited in scope to genomics or immunology. * Not clear how to expand it.
ENA and NCBI	* Allow upload of large datasets with searchable metadata	* Do not enforce the quality of their metadata.
4tu and zenodo	* General storage of large datasets	* Only a small amount of metadata annotation is supported.

Improvements to the FAIRDS

Should allow more user-friendly addition of new standards.
Non-interactive batch mode should be supported.
Should inter-operate with more data formats:
- SEEK.
- Should allow local data storage for non WUR users.
Should inter-operate with more databases:
- Zenodo & 4tu – automatic deposit.
Complete refactor of the programme is required.

References

More information

sibbe.bakker@wur.nl

Read my thesis.

My thesis could not be made without the help of a lot of collaborators:

Jasper for offering his help with the FAIRDS. Anna and Mariana, for support and guidance during the thesis; Martin and Hylke, for their data and aiding with the usecases; Murambiwa, Christopher for their data and critical discussion; and Sijmen for the insight on the structure of my thesis.

Cited works

Abeysooriya M, Soria M, Kasu MS, Ziemann M (2021) Gene name errors: Lessons not learned. PLOS Computational Biology 17(7):e1008984. https://doi.org/10.1371/journal.pcbi.1008984

Nijsse B, Schaap PJ, Koehorst JJ (2023) FAIR data station for lightweight metadata management and validation of omics studies. GigaScience 12:giad014. https://doi.org/10.1093/gigascience/giad014

Kortenbosch HH, Van Leuven F, Zwaan BJ, Snelders E (2022) Catching more air: An effective and simple-to-use air sampling approach to assess aerial resistance fractions in Aspergillus Fumigatus. Microbiology

OpenStreetMap contributors (2017) Planet dump retrieved from https://planet.osm.org

Costello MJ, Appeltans W, Bailly N, et al (2014) Strategies for the sustainability of online open-access biodiversity databases. Biological Conservation 173:155–165. https://doi.org/10.1016/j.biocon.2013.07.042

Footnotes

Does not want to share his likeness