Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
Linking systems to improve data quality
Javier Otegui, Robert Guralnick

Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-28 04:00 PM – 04:05 PM
Last modified: 2013-10-02

Abstract


Data quality is still a key challenge for the biodiversity informatics community. Given maturing data publishing mechanisms and new tools for data quality assessment, an important next step is a more unified approach to assessing fitness for use across very large datasets. There are many, often herculean individual efforts, which yield very useful results, but generally the scope of the impact of those results is somewhat narrow. There is little success in developing a wider, organization-independent framework for the data quality improvement workflow. Allowing the effective communication and sharing of information between initiatives can yield results greater than the sum of the parts. Here we show a case scenario for the integration between large scale species occurrences aggregators such as VertNet (http://www.vertnet.org) and GBIF (http://www.gbif.org), with Map Of Life (http://www.mappinglife.org). VertNet and GBIF are networks of institutions who share global primary information about species presence. Map Of Life is an online resource that collates and links different sources of biodiversity information (expert range maps, regional checklists…) under a common infrastructure. By allowing the sharing of information between these initiatives, we have been able to build a framework where the records from VertNet and GBIF are checked against different resources of Map Of Life. This allows for the detection, and in some cases resolution, of taxonomic and spatial issues based on the distribution of the records and the extents of the range maps, which leads to the improvement of the quality of the data VertNet serves. We also show overall results from analysis of over 200 million point occurrences, focusing on error rates and, most importantly, asking what factors, such as inherent properties of records themselves, and dataset publishing trends, might explain the variability in such rates of errors.