Publishing sample data using the GBIF IPT
Éamonn Ó Tuama, Markus Döring, Kyle Braak, Tim Robertson, Olaf Bánki

Building: Elmia Congress Centre, Jönköping
Room: Rydbergsalen
Date: 2014-10-29 11:05 AM – 11:20 AM
The Darwin Core (DwC) vocabulary is intended to facilitate the sharing of information on the  occurrence of taxa in nature “as documented by observations, specimens, samples, and related information”. Particularly as used in Darwin Core Archives (DwC-A) (http://rs.tdwg.org/dwc/terms/guides/text/index.htm), it has enabled the publication, in a standardised way, of an unprecedented number of records pertaining to observations and specimens. Now, GBIF, in association with its partners in the EU BON project (http://eubon.eu), is exploring how DwC can be extended to support publication of sample-based data, i.e., observations related by a standard sampling protocol and where quantitative information about species occurrences is recorded. We report here on how the addition of a minimal set of new properties allows encoding of some essential characteristics of sample-based data while drawing on the existing  capabilities and supporting infrastructure offered by the GBIF Integrated Publishing Toolkit (http://www.gbif.org/ipt) for working with DwC-A.

DwC is a flat glossary of terms which deliberately eschews the semantic rigour of more formal ontologies in order to support ease-of-use. Given this design choice, how well suited is DwC as an exchange format for "sample-based" data?  The aim is not to establish how data should be captured or modelled, but rather demonstrate one way data can be exposed to maximize discoverability and reuse, even if based, only, on a view of some aspects of a data set.

DwC already provides a rich set of terms, organised into several classes (e.g., Occurrence, Event, Location, Taxon, Identification), many of which are relevant for describing sample-based data. Drawing on several sources of input (GBIF organised workshop on sample data, May 2013; discussions on the EU BON and TDWG mailing lists), a small set of terms (eventID, samplingProtocol, sampleSize, sampleSizeUnit, quantity and quantityType), of which the latter four are new, were identified as essential.

DwC-A imposes a relatively simple, one-to-many relational model in which a row in a (central) core table can be linked to many rows in one or more (surrounding) extension tables. Table column headers typically map to Darwin Core terms although terms from other vocabularies can also be used. Currently, the IPT and GBIF.org services support the use of two cores: Taxon and Occurrence. In order to encode sample-based data, we here propose a third, new core, the Event (i.e. sampling event) core, to be used in association with an Occurrence extension. The Event core elements are mainly drawn from the DwC classes Event, Location and Geological Context with the addition of the two new terms sampleSize and sampleSizeUnit. The Occurrence extension draws from the Occurrence, Taxon and Identification classes with the addition of the two new terms quantity and quantityType.

In this presentation, we demonstrate the use of the Event core and Occurrence extension with several example data sets and show how they can be augmented with additional extensions for, e.g., environmental measurements and vegetation plot data.


Acknowledgement: EU BON partners, TDWG mailing list contributors and GBIF sample data workshop participants informed this work and are gratefully acknowledged.