Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
Beyond Barriers: Exporting data quality assessments from Spain
Arturo H. Ariño, Francisco Pando, Javier Otegui

Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-29 02:50 PM – 03:05 PM
Last modified: 2013-10-05

Abstract


Over the past decade, two research groups in Spain have converged in ensuring that the quality and fitness-for-use of the primary biodiversity data (PBD) mobilized through GBIF can be assessed. Fitness-for-use (FFU), or the ability to assess whether data or datasets can be sensibly used for some purpose, mandates a candid release of any limits, problems, or uncertainties that may lie in the data.  Often, these boundary conditions can only be known through data analysis, aimed at forecasting what patterns should exist in the data, and then exposing outliers or oddities in the data distribution. On the other hand, functional quality control (inserting quality checks in the data flow from data collection to data publishing) may help greatly reduce the extent of problems discovered, and increase general FFU. Both approaches feed on each other: quality checks can be devised from a logical analysis of the workflow, but also as a reaction to strange patterns found in the data that may indicate a quality problem, or impose limits on their FFU—for example, limiting their use for a set of purposes.

GBIF.ES, the Spanish Coordination node of GBIF, among many other contributions such as Data Quality Workshops or the Biodiversity Data Quality hub (BDQ)[1], has produced a set of rules to validate Darwin Core: DarwinTest (DT)[2], a software application to validate and check records from tables in a DarwinCoreV1.2/DarwinCoreV1.4/DarwinCoreArchive format. DT allows for easy checking of many common errors arising from digitization and, at least as often, from migration from other platforms or databases. The system highlights conformational problems in the data by enforcing the standard upon the data, marking out any that do not conform to the set standard before they can be published. It also allows for calculation of the Apparent Quality Index[3] of the dataset.  Although DT can be used by the data provider, it is often the node that will test the incoming data, as a much sought-after function of GBIF.ES is hosting PBD on behalf of many institutions and research facilities throughout Spain. This type of functional quality control has a great impact in reducing the amount of noise in the data published, as it allows data to be iteratively corrected before indexing.

The research group at the University of Navarra, on the other hand, focuses in patterning indexed data. Through a number of visualizing or statistical techniques, the group analyses data for spatial, chronological, taxonomical, and other types of patterns, either alone or in combination, aiming at making outliers (be them data, datasets, or data groups) stand out for further scrutiny. Together with gbif.es, the group analyzed the status of the data published from Spain[4],[5], but also produced an online tool (BIDDSAT)[6] that can be used on any GBIF publisher or dataset to quickly look for a specific set of patterns, thereby helping the publisher discover potential quality issues. In collaboration with the GBIF Secretariat’s science team, the group extended these analysis to the assessment of entire iterations of the global GBIF index as well[7], helping the informatics team devise further quality control checks especially in the time domain.

In our contribution we will briefly describe the ideas behind pattern analysis, and will summarize how DarwinTest and BIDDSAT work and contribute to data quality and fitness for use of the primary biodiversity data served from gbif.es and beyond.


[1] http://www.gbif.es/BDQ.php

[2] http://www.gbif.es/darwin_test/Darwin_Test_in.php

[3] http://www.gbif.es/ica.php

[4] Ariño AH, Otegui J (2009) Meta-análisis de los datos de biodiversidad suministrados a través de gbif.es. Universidad de Navarra; Available: http://www.gbif.es/ficheros/CSeg/MetaGBI​FEs.pdf.

[5] Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1): e55144. doi:10.1371/journal.pone.0055144

[6] Otegui J, Ariño AH (2013) BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics, 28(16): 2207-2208.

[7] Gaijy S, Chavan V, Ariño AH, Otegui J, Hobern D, Sood R, Robles E (2913). Content assessment of the primary biodiversity data published through GBIF network: Status, Challenges and Potentials. Biodiversity Informatics, 8(2): 94-172