Occurrence data at the INBO: opening up our data publication workflow
Peter Desmet, Dimitri Brosens

Building: Elmia Congress Centre, Jönköping
Room: Rum 11
Date: 2014-10-30 12:10 PM – 12:25 PM
Last modified: 2014-10-03


The Research Institute for Nature and Forest (INBO) collects biodiversity data for research purposes and to support policy in Flanders, Belgium. Since 2011, we have been gradually publishing our data in the framework of the Global Biodiversity Information Facility (GBIF, http://www.gbif.org). The INBO has currently published 16 datasets, covering 5.2 million occurrences, including plant surveys, fish observations, and bird tracking data and is one of the major biodiversity data publishers in Belgium.

In collaboration with the LifeWatch project (http://lifewatch.inbo.be) the INBO is now publishing its data as open data, under a Creative Commons Zero waiver, referencing norms for data use (https://github.com/LifeWatchINBO/norms-for-data-use) modelled on the Canadensys and VertNet norms. We are also opening up our data publication process. Each new dataset now originates as an internal database view, a test IPT environment and a public GitHub repository (e.g. https://github.com/LifeWatchINBO/vis-inland-occurrences) with the same shortname as the to be published dataset. Metadata are written in Markdown, in article format and are versioned (e.g. https://github.com/LifeWatchINBO/bird-tracking-gull-occurrences/blob/master/metadata.md): this often forms the basis of a data paper. The quality of the data is assessed using several tools, such as Open Refine and QGIS, and issues and tasks regarding the data and metadata are publicly recorded on Github (e.g. https://github.com/LifeWatchINBO/vis-inland-occurrences/issues), which allows external users to see current issues and submit their own. Once most issues are resolved, the metadata are imported to our production IPT (http://data.inbo.be/ipt/) and a Darwin Core archive is created, published and registered. To maintain the same level of quality and consistency for all our datasets, we have started to document our data publication guidelines (https://github.com/LifeWatchINBO/data-publication-guidelines).

There are still numerous issues in our data publication workflow, including how to assess data quality for bigger datasets, how to interpret certain Darwin Core fields, how to better integrate GitHub and IPT, and a need to reassess resource metadata on IPT. We hope that by opening this discussion, we can create data publication guidelines in collaboration with others, so we collectively publish datasets that are more consistent, better documented, and have a higher data quality.