Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Defining dataset specifications to communicate data quality characteristics
Peter Desmet, Stijn Van Hoey, Dimitri Brosens

Building: CTEC
Room: Auditorium
Date: 2016-12-06 04:00 PM – 04:15 PM
Last modified: 2016-10-15


The Darwin Core standard provides a list of community-ratified terms for sharing biodiversity information. Although some terms have strict definitions, most allow users a certain level of freedom in how to interpret these. This degree of freedom has enabled a wide range of biodiversity data to be mapped to Darwin Core, but it complicates automated data aggregation and processing. One way to resolve this are community specific guidelines describing how data should be mapped, but few have been created or adopted. Moreover, these are intended for humans only.

Inspired by existing data validation specifications in other fields, we propose the usage of a specification file, describing the constraints to which the data should comply. Its syntax is  human- and machine-readable, so it can be used to communicate expected data quality/conformity and to validate data automatically. The scope of the set of rules can be specific to a dataset, publisher or community, which allows bottom-up and top-down adoption.

In this talk, we will present a prototype format for these specifications, where the rules are defined on the level of individual terms and expressed as a YAML file. We also present prototype software to validate data with these specifications. We hope it will trigger a discussion on how to express data specifications and mapping guidelines.