Chapter 3 Data integration

This chapter explains the different steps of the automated integration R-script. This script can be downloaded from the Github-page for this project.

3.1 Spreadsheet preparation

Besides the two thesauri (chapter 2), the R-script also requires the plant trait spreadsheets containing all measurement data as input. A few requirements have to be met to make these spreadsheet suitable for integration:

  • All spreadsheets have to be comma-seperated value (.csv) files. If not the case, these can easily be converted from e.g. Excel files.
  • All values in columns assigned to the “trait” category should be in the same unit of measurement (e.g. not one value in meters and the next in centimeters).
  • Columns assigned to the “trait” category should only contain numeric values. If the measurement unit is also give within a value cell, it should be removed. The unit of a certain column should be specified through the units thesaurus only.
  • Within each individual spreadsheet, the use of decimal seperators should be consistent (not ‘,’ in one and ‘.’ in another column).

3.2 Loading and preparing input data

3.2.2 Validate dataset and metadata thesaurus

This function checks if all datasets are in the right format for the workflow and that the metadata thesaurus is filled out correctly. In the case of common mistakes (as in Fig. 2 of the manuscript) the name of the dataset and column header is given. It is also checked if column headers entered in the metadata thesaurus are present in each dataset.

3.2.3 Loading thesauri and preperation

read_excel() is used to load in both thesauri. Therefore, they don’t have to be converted to .csv files like the measurement spreadsheets.

Several redundant rows and columns are stripped from both thesauri to make them machine-readable. This includes the removal of references, basisOfRecord and coordinates information. A data frame is constructed that relates verbatim trait names to standardized terms.

All cleaned spreadsheets (3.1) should be in the same folder called “raw_datasets”. Their file path and names are loaded. A data frame is also made containing various metadata information for reach spreadsheet.

3.3 Integrating datasets

The different datasets are integrated and combined with the use of a loop. Column header standardization, adding of metadata information, unit harmonization and the conversion to a long-table format are done seperately for every dataset and combined to a total dataframe after every iteration.

An empty data frame is made to append all datasets to.

Column headers for a given dataset are selected and loaded from the corresponding .csv file. Numeric and character columns are loaded seperately and bound together.

Columns containing trait data are converted from wide to long format. Columns “traitValue” and “traitName” are added.

References, basis of record, coordinate system and SRS are added from the metadata thesaurus.

Measurement units are added as a column. These are standardized to “cm”.

Add verbatim trait names as column. Trait identifiers are added.

Rows are bound for every dataset.