Building: Grand Hotel Mediterraneo
Room: Sala dei Continenti
Date: 2013-10-29 02:54 PM – 03:12 PM
Last modified: 2013-10-05
Abstract
Natural science collections data are rife with errors and inconsistencies. Reuse of collections data to address scientific questions imposes concerns of data quality and fitness for use upon the collections management community. We have implemented data quality control using Kepler scientific workflows [1] embedded within an analytical capability of FilteredPush network instances [2,3,4]. These workflows can apply quality control criteria to harvested data and launch actionable annotations as feedback to data curators, who can then correct their data. A user of the system can launch a quality control workflow from a web client to a FilteredPush network. The resulting annotations indicate data quality issues and the processing history of these annotations by a data curator that were used to apply the changes to authoritative records in a Specify-6 database.
The workflow identifies data quality problems and generates annotations expressing either proposed solutions or the presence of problems that the workflow is unable to resolve. Interested parties, such as data curators for affected collections, are notified of annotations and may retrieve and act upon them to modify their data. Beneath the hood, we use an open-source stack, including Fedora as a document store for annotations and workflows, Fuseki as a triple store for semantic queries and reasoning, and MongoDB as a store for data harvested using the OAI/PMH protocol. Notification is implemented by matching annotations with interests expressed as Sparql queries. Research questions of differing scope may repeatedly detect the same data quality issues and highlight records considered most in need of correction by the research community.
For our curation workflows, we extended the generic Kepler development system for better deployment: Instead of the default build-and-run system, we use a simplified but more efficient execution framework. We use Kepler as a library, i.e., without the standard Graphic User Interface (GUI), and use a direct Java coupling with the FilteredPush system. Unlike the alternative, a loosely coupled approach that invokes headless Kepler through through a shell, our approach only uses a single Java Virtual Machine (JVM), and thus is more efficient, both in runtime and memory. We are also working on improving the curation workflow runtimes further by exploiting parallel execution. The default Kepler execution model as well as the Kepler/COMAD model [5] do not automatically execute independent service instances in parallel. We have conducted experiments with Akka [6], a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM, which indicate that we can accelerate curation workflow execution significantly through parallel service invocation.
References
[2] http://wiki.filteredpush.org
[3] Morris, P.J., J. A. Macklin, J. Hanken, M. Kelly, S. Koehler, D. Lowery, B. Ludäscher, R.A. Morris, T. Song. 2013. Improving Natural Science Collections data through quality control for research using Kepler workflows embedded in a FilteredPush network. SPNHC: 28th annual meeting of the Society for the Preservation of Natural History Collections.
[4] L. Dou, G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken, Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, Volume 9, 2012, Pages 1614-1619, ISSN 1877-0509, http://dx.doi.org/10.1016/j.procs.2012.04.177.
[5] McPhillips, T., Bowers, S., Zinn, D., & Ludäscher, B. (2009). Scientific workflow design for mere mortals. Future
Generation Computer Systems, 25(5), 541-551.
[6] http://akka.io/