Missouri Botanical Garden Open Conference Systems, TDWG 2015 ANNUAL CONFERENCE

Font Size: 
Data cleaning with the Kurator toolkit:Bridging the gap between conventional scripting and high-performance workflow automation

Building: Windsor Hotel
Room: Oak Room
Date: 2015-09-30 03:15 PM – 03:30 PM
Last modified: 2015-08-29

Abstract


The Kurator project aims to facilitate the development, documentation, and efficient execution of scripts and workflows for cleaning biodiversity data. Kurator tools under development and available as prototypes in the Kurator GitHub repositories (http://github.com/kurator-org/) support traditional scripting as well as high-performance, actor-oriented workflow approaches to validating, annotating, and cleaning data. The Kurator-Akka framework (http://github.com/kurator-org/kurator-akka) makes it easy to develop and run high-performance data cleaning workflows that employ the Akka actor toolkit by shielding actor developers and workflow users alike from the complexities of the Akka API (application programming interface). Kurator-Akka actors currently can be written either in Python or Java, and workflows may be specified using a language based on YAML (YAML Ain't Markup Language) that defines how data flows between the actors at run time. A workflow can be composed from existing actors by editing a simple text file and subsequently executed by providing this file to the Kurator-Akka workflow runtime. Actors in a Kurator-Akka workflow execute concurrently in different threads, potentially yielding pipeline parallelism and thus higher throughput than is achievable in conventional scripts.

Recognizing that conventional scripts written, for example, in Bash, Python or R, also represent an effective means of automating data cleaning workflows, Kurator is leading an interdisciplinary effort to develop the YesWorkflow toolkit (http://yesworkflow.org/yw). YesWorkflow (YW) aims to provide many of the benefits of using a scientific workflow management system without having to rewrite scripts for execution within a workflow engine. Instead, a YesWorkflow user simply adds special YesWorkflow comments to existing scripts. These comments declare how data is used and results produced, step by step, by the script. The YesWorkflow tools interpret the YW comments and produce graphical output that reveals the stages of computation and the flow of data in the script. A means for reconstructing and querying the provenance of the outputs of a script marked up with YesWorkflow annotations is currently under development.

Because scripts marked up with YesWorkflow annotations may be used as actors in the Kurator-Akka framework, Kurator tools span the scripting and workflow automation paradigms. The Kurator-Validation GitHub repository (http://github.com/kurator-org/kurator-validation) provides example scripts, actors, workflows, and documentation that demonstrate how Kurator tools effectively integrate scripting and automated workflow approaches to cleaning biodiversity data. Using a simple Python class that wraps the WoRMS (World Register of Marine Species)web service as a starting point, documentation available in the Kurator-Validation GitHub repository demonstrate (1) how a Python script can make use of this WoRMS service class to validate names against the standard WoRMS taxonomy; (2) how to annotate this script with YesWorkflow comments so the script can be modeled, visualized and analyzed as a workflow; (3) how a Kurator-Akka actor invoking the WoRMS service class can be written in Python; and (4) how a workflow employing this actor can be specified in YAML and executed by the Kurator-Akka framework with each actor running concurrently.