Building: Grand Hotel Mediterraneo
Room: Sala dei Continenti
Date: 2013-10-29 02:36 PM – 02:54 PM
Last modified: 2013-10-05
Abstract
A scientific workflow describes a process for accomplishing a scientific objective, usually in terms of tasks (implemented by software components, or actors) and their dataflow dependencies. Scientific workflows have become an increasingly popular paradigm and tool in many eSciences and can be used to improve various computation-intensive and data-intensive processes in biodiversity informatics.
The FilteredPush project [1], is developing automatic and semi-automatic data curation pipelines and integrating these in a broader community-annotation infrastructure [2]. These data curation workflows employ a combination of local and remote services, and even human curators "in the workflow loop," to check, assess, and improve the quality of specimen-collection datasets.
In addition to workflow automation and the possible scaling up of pipeline execution using parallel environments, it is the capability to automatically capture provenance information during workflow runs that makes a scientific workflow approach so promising in biodiversity informatics and data curation. Provenance is a critically important form of metadata used to assess data quality, to support result interpretation and validation, and to provide transparency and reproducibility of results. Consequently, a number of systems, including Kepler, Taverna, and VisTrails, support provenance capture. Various groups are also developing provenance technologies to efficiently store, query, analyze, and visualize provenance information. However, by design, neither the original Open Provenance Model (OPM) nor its W3C successor PROV provides a standardized way to enrich trace-level provenance information with higher-level information from the workflow specifications themselves.
The DataONE Working Group on Provenance in Scientific Workflows develops just such an extension to the W3C PROV standard [3]. The extension takes into account workflow-specific elements and thus allows users of provenance not only to query low-level workflow traces, but also to link these to higher-level conceptual information, represented by the workflow description, and possibly further annotated with concepts from a community ontology. Users can express more intuitive and more expressive queries, which combine all three information layers of provenance: execution traces, workflow specification, community ontology.
We will present an overview of our PROV extension for scientific workflows and discuss example queries from data curation scenarios. We also solicit community input regarding which forms of provenance would be most critical for different use-cases from the biodiversity informatics community.
References
[1] Morris, P.J., E. Gilbert, J. Hanken, M. Kelly, S. Koehler, D. Lowery, B. Ludäscher, J.A. Macklin, R.A. Morris, T. Song. 2013. Expanding the scope of collections databases without schema modifications: Using annotations, rules, and semantic web technologies to add a open world layer to natural science collections data. SPNHC 2013. Program book and abstracts for the 28th annual meeting of the Society for the Preservation of Natural History Collections. p.33.
[2] Morris, P.J., J. A. Macklin, J. Hanken, M. Kelly, S. Koehler, D. Lowery, B. Ludäscher, R.A. Morris, T. Song. 2013. Improving Natural Science Collections data through quality control for research using Kepler workflows embedded in a FilteredPush network. SPNHC 2013. Program book and abstracts for the 28th annual meeting of the Society for the Preservation of Natural History Collections. p.33.
[3] Missier, P., Dey, S., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B. (2013, April). D-PROV: extending the PROV provenance model with workflow structure. In Proceedings of the 5th USENIX conference on Theory and Practice of Provenance (pp. 9-9). USENIX Association.