Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
A Standards Architecture for Integrating Information in Biodiversity Science
Donald Hobern, Andrea Hahn

Building: CTEC
Room: Auditorium
Date: 2016-12-05 11:00 AM – 11:15 AM
Last modified: 2016-10-15


In this presentation, we will identify what we believe are the essential elements in a standards architecture for how we represent, share, and use biodiversity data.  Our shared vision should include enabling human users and machines to find all of the information and to traverse all of the data connections that a knowledgeable researcher can see in the biodiversity literature, collections and other resources. We should be able to start from any point in the biodiversity data graph and find the meaningful links to associated data objects. From specimen to taxon concept to taxon name to publication; from sequence to associated sequences to taxon concepts to species occurrences; etc.

This means that our data architecture needs to pay attention to the following matters (quite independently of the challenges of delivering the infrastructures that underpin their successful implementation):
Agreement on the set of core data classes within the biodiversity domain which we consider important enough to standardise (specimen, collection, taxon name, taxon concept, sequence, gene, publication, taxon trait, or whatever we all agree).
Agreement on the set of core relationships between instances of these classes that we consider important enough to standardise (specimen identifiedAs taxon concept, taxon name publishedIn publication, etc.).
Making sure that our data publishing mechanisms (cores, extensions, etc.) align accurately with these classes and support these relationships – this mainly means reworking the current confused interplay between cores, DwC classes, use of dcterms:type and use of basisOfRecord – every record should be clearly identified as an instance of a class (or a view of several linked class instances) and (for the core data classes) this should form the basis for inference and interpretation.
An ongoing process of defining for each core class what properties are mandatory (maybe only: id, class), highly desirable (depending on the class, things like: decimal coordinates, scientific name, identifiedAs, publishedIn), generally agreed (many other properties for which we have working vocabularies and do not want unnecessary multiplication, e.g.: waterbody, maximumDepthInMeters) or optional/bespoke (anything else that any data publisher wishes to include). In other words, allow any properties to be shared but ensure that the contours of the data are clear to standard tools.
A set of good examples of datasets mapped into this model, using various serialisations.
While accommodating plain text and URIs in the same fields enables data publishing from the enables data publishing from the widest possible range of sources, it leaves problems for data aggregators and users.