Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions
Christian Gendreau, David P. Shorthouse, Peter Desmet

Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-29 02:15 PM – 02:30 PM
Last modified: 2013-11-11

Abstract


Canadensys, http://www.canadesnsys.net is a network of 11 Canadian universities, 3 botanical gardens, and 2 museums that digitize and make available occurrence records via a local instance of GBIF’s Integrated Publishing Toolkit. Records from host institutions are independently published under a Public Domain waiver and subsequently aggregated on a dynamic Explorer, http://data.canadensys.net based on CartoDB's Windshaft tile server. Our goal is to enrich the Explorer with better filtering and download capabilities and to provide structured reports on data quality to our stakeholders.

We are building an open source, extensible, i18n-ready data quality library that can either flag suspect occurrence records and/or normalize, transform and populate empty fields of data where appropriate. We are building this code library in stages with the participation of members of SIB-Columbia, GBIF, VertNet, iDigBio, FilteredPush, and other informatics projects that have similar challenges. The first stage of development includes a framework for a scalable processing engine that transforms and normalizes independent fields of data requiring few secondary or contextual validations (e.g. standardize the spelling of State or Province names when the country is known).  Future stages of development will use the infrastructure to flexibly combine two or more fields within a record or across records for contextual data validation routines. Where possible, we expose logic within the processing engine on simple web forms and web-based application programming interfaces. These allow the research community to validate or transform data of particular types singly or in bulk and/or to incorporate into their applications at the point of data entry.