Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-28 04:05 PM – 04:20 PM
Last modified: 2013-10-05
Abstract
Experiment data is prone to containing errors, therefore, data quality management (DQM) is an important integral part of the Biodiversity Exploratories Information System (BExIS) - the data repository and information exchange platform of the Biodiversity Exploratories project. Of recent, BExIS is being redesigned to be modular, scalable and extendable, and in this new version, a new DQM framework module is being developed. Within it, a user can specify DQM criteria on (groups of) variables at data structure design time. Such criteria include integrity constraints, data types, regex patterns, series definition, variable dependencies, complex business rules, e.g., condition-based patterns, and domain value ranges, e.g., a (reference to a) list for string-valued variables. In addition, users will be able to specify other variables/datasets which are related to a variable/dataset being created, and thresholds for indicating dataset/variable/tuple completeness and redundancy. Based on some of the DQM criteria specified, a user will be able to download Microsoft Excel templates with macros for performing a guided data entry and validation before dataset upload. At upload time, the dataset will be validated against some of the user-specified DQM criteria.
Afterwards, dataset auditing will be carried out as a batch process. Data profiling which generates a description of the components of the dataset using various statistical metrics will be carried out as the first part of data auditing. The other part consists of several forms of data analytics and mining procedures using information such as the related dataset/variable provided at data structure design-time. Some of the analysis done will include outlier detection, redundancy analysis (using absolute or fuzzy matching techniques), text based analytics (for detecting, e.g., spelling errors, groups of misspelled terms, cryptic names etc.), and the generation of DQM criteria which are not specified by the users, e.g., patterns, variable dependencies etc. which the dataset will then be validated against. Furthermore, the data structure will also be examined to determine, e.g., non-atomic columns, and GIS-related checks are done here, too. A report based on the data profiling statistics and the probable errors based on the data analytics will then be sent to the data owner. Finally, the correction and enrichment component will assist the data owners to (mass-)correct their dataset based on the report. This component also consists of tools for suggesting possible corrections, e.g., for missing values, and data enrichment information, e.g., link to Catalogue of Life for species datasets.