Building: CTEC
Room: Auditorium
Date: 2016-12-05 04:45 PM – 05:00 PM
Last modified: 2016-10-15
Abstract
Data cleaning has the potential to improve the chances for people and computers to find and use relevant data. This is true for researchers as well as for large-scale data aggregators. In the biodiversity realm, Darwin Core provides a convenient scope and framework for data cleaning tools and vocabularies.
One way to address data cleaning tasks is to use workflows that act on a combination of original data, controlled vocabularies, algorithms, and services to detect inconsistencies and errors, recommend changes, and augment the original data with improvements and additions. There are advantages from the perspective of flexibility to construct such workflows from specialized, reusable "actors" -- building blocks that do specific tasks, such as provide a list of distinct values of a field in a data set.
The Kurator project uses Akka, a Java-based framework to construct workflows with actors written in a variety and even in a combination of programming languages. In this presentation, we will explore the process of building actors and combining them in Akka workflows that do a variety of data cleaning and reporting tasks inspired by the VertNet process of mobilizing data from institutional data sets for large-scale aggregators such as VertNet, iDigBio, and the Global Biodiversity Information Facility. Ultimately, the goal of this work might be, given a biodiversity data set, to provide an improved version of that data set in the form of a Darwin Core archive that includes a data quality extension (not yet developed) to report what was found, what was done to it, and what could still be done to further improve it.