Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-28 04:20 PM – 04:40 PM
Last modified: 2013-10-05
Abstract
Data Quality (DQ) must be assessed before data being used for any purpose. For this reason, sufficient metadata is necessary to allow users to determine data fitness for use. We propose a new metadata schema for helping DQ assessment. The metadata schema is based on the description of (1) DQ measurements; (2) DQ policy compliance; and (3) data provenance.
Quality assessment can be performed based on DQ measurements described through quality dimensions. Dimensions are qualitative or quantitative indicators related to specific DQ aspects, such as accuracy, completeness or consistency. These metadata can be described by terms such as: dimension name, dimension description, measuring method, measured value, reference to available measuring mechanisms.
Due to quality being an idiosyncratic concept, reporting the meaning of quality is strongly desirable. DQ policy statements should describe this meaning, and each statement should declare how data must be presented to have quality. If data is in compliance with DQ policy statements, then data has quality. Those metadata can be described by terms, such as: statement description, compliance with the statement, methods used for compliance checking, statement formalization, reference to available checking mechanisms, reference to ontology.
Moreover, data provenance has a significant role on DQ assessment. Provenance description allows users to know by whom and how data was created, modified and corrected over time. It allows tracking modifications performed by DQ mechanisms, such as taxonomic nomenclature correction and georeferencing through a location description, for example, without losing the original values, and describing methods and people or software used in the correction activity. Those metadata can be described using PROV-DM (http://www.w3.org/TR/prov-dm). This data model specifies three components for describing provenance: Agent - who or what created or modified the data; Activity - the action that modified or created data, associated to used methods (e.g.: georeferencing using Google API; taxonomic identification correction by a taxonomist); and Entity - the data itself, but each modification generates a new entity with a new version of the same data, without losing older versions.
Besides allowing DQ assessment by humans and computers, this metadata schema can help the Biodiversity Informatics community to share and reuse DQ policies, dimensions definitions and DQ mechanisms, avoiding rework and helping to converge to future standards of biodiversity DQ policies, measuring methods and DQ improvement mechanisms.
Acknowledgement: The Research Center on Biodiversity and Computing (Biocomp) – University of Sao Paulo - Brazil.