Last modified: 2011-10-12
Abstract
Species occurrence data available in portals such as GBIF and IABIN are widely used in many studies on conservation and sustainable use of natural resources. The quality of those data, especially taxonomic and location data, are essential for the quality of numerous studies, assessments and models derived from them.
Data quality (DQ) is a multidimensional concept, each dimension representing an aspect of quality and allowing a more objective measurement and management of DQ. The dimensions of DQ include Completeness (sufficiency to perform a task); Consistency (contradictions absence); Credibility (source reputation); Accuracy (correctness or veracity); Precision (resolution or granularity).
The DQ in each dimension may be improved by preventing common error patterns such as: domain value redundancy; missing data value; incorrect data values; nonatomic data values; duplicate occurrences; inconsistent data values; and information quality contamination.
In order to improve the quality of the data digitized in the Biodiversity Data Digitizer (BDD), a web-based system, two computational resources have been designed and integrated to it aiming at decreasing digitization errors on location and taxonomic data, and improving data completeness, consistency, credibility, accuracy and precision.
In order to prevent errors in location data a web-based resource was added to BDD, which is organized in three steps: (1) primary data insertion, (2) data source selection and (3) uncertainty reporting. In the first step the user fills primary data in one of three ways: typing known geographical coordinates; using a locality description, like “25 km NNE from New Orleans”; or using a tridimensional interactive map for finding an approximate location. In the second step primary data is used to obtain multiple complementary information such as latitude, longitude, geodetic datum, altitude, municipality, state and country from three data sources: BioGeomancer, Google Maps and GeoNames, which can be selected by the user. In the third step the user can report the coordinate uncertainty in meters. A circle representing the uncertainty is plotted in a map to facilitate its visualization.
As for taxonomic data, errors are prevented using a resource that assists the entry of taxonomic names and hierarchies based on Catalog of Life (CoL) database and on a local collection database. While filling in a taxon field a list of taxon names are suggested to the user. The suggestions displayed are retrieved from the local database using a fuzzy matching technique. This technique allows retrieving textual data, which is orthographically similar. When selecting a taxon name, its hierarchy is obtained in accordance with the local database. If the taxon name typed does not exist in the local database or it is not validated according to CoL, a query is performed into the CoL database. Also using fuzzy matching, suggestions of valid taxon names and hierarchies according to CoL are then presented.
These computational resources integrated to BDD allowed a significant improvement of DQ in the dimensions mentioned by preventing the user from incurring inadvertently in errors such as value redundancy, missing data, incorrect values, nonatomic values, duplicate occurrences, inconsistent values and information quality contamination.