Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
Proposal of a methodology for dealing with Biodiversity Data Quality
Allan Koch Veiga, Antonio Mauro Saraiva, Etienne Americo Cartolano Jr.

Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-28 02:10 PM – 02:30 PM
Last modified: 2013-10-05

Abstract


Data Quality (DQ) is a major concern in Biodiversity Informatics. The distributed nature of data acquisition and digitization, the specific difficulties imposed by some of the data sub-domains, such as taxonomic data and geographic data, among other aspects, make it important to discuss and propose a methodology to standardize the ways to deal with DQ in the Biodiversity Informatics community.

We propose a methodology that aims to improve the fitness for use of biodiversity data by end users. It is composed by four main steps: (1) Identifying DQ needs; (2) Defining a DQ policy; (3) Implementing DQ mechanisms; and (4) Generating DQ measurements and data provenance metadata.

The aim of the first step is to understand what end users expect regarding DQ. This must be based on requirements defined by end users to assess data fitness for use and includes identifying and describing which types of data are used, for what purpose they are used, what quality aspects (dimensions) are relevant for assessing the fitness for use, and what degrades the quality of these aspects. This step must support the definition of an appropriate DQ policy.

The aim of the second step is to define a DQ policy that declares how data must be presented to satisfy end users DQ needs. The policy is composed by a set of statements. Each statement can declare the meaning of DQ dimensions, how they are measured or the condition to be satisfied for data to have quality, for example: “Coordinates must be supplied” or “Taxon name must abide by one taxonomic authority”.

For meeting this DQ policy, a set of DQ mechanisms must be implemented. The aim of the third step is to implement techniques, tools and procedures for enforcing the DQ policy. A DQ mechanism may be classified as: Prevention – for avoiding errors; Detection – for detecting errors; Correction – for correcting detected errors; Recommendation – for suggesting a correction to detected errors; Validation – for checking if data is compliant with a DQ statement; and Measurement – for assigning a quantitative or qualitative value to a DQ dimension.

Furthermore, it is necessary to provide users sufficient metadata about DQ and data provenance. This information can help users assess whether data are fit for use. The aim of the fourth step is to provide metadata about DQ measurements, about compliance of data with DQ policy and about data provenance.

Acknowledgement: The Research Center on Biodiversity and Computing (Biocomp) – University of Sao Paulo - Brazil.