Missouri Botanical Garden Open Conference Systems, TDWG 2014 ANNUAL CONFERENCE

Font Size: 
Analytical Workflows for Large Data
Christian Authmann, Christian Beilschmidt, Johannes Drönner, Michael Mattig, Bernhard Seeger

Building: Elmia Congress Centre, Jönköping
Room: Rum 10
Date: 2014-10-28 04:28 PM – 04:41 PM
Last modified: 2014-10-03


In order to support data-driven biodiversity research, scientists need fast and unrestricted access to data as well as the ability to correlate and aggregate arbitrary data sets. New insights are often gained from data via an explorative approach: a workflow is created and executed, its results are visualized, and the workflow is then kept, discarded, adjusted or extended and executed again. We call these workflows analytical to differentiate them from the common operative workflows that refer to a well-specified scientific task. The highly interactive approach of analytical workflow is only viable when (intermediate) results are produced in a timely manner. This poses unique challenges when large data sets are involved, e.g. remote sensing data whose size can exceed hundreds of gigabytes. It is therefore crucial to minimize the data transfers among processing operators of analytical workflows and to enable parallelism whenever possible.

Within our visualization, aggregation and transformation system (VAT-system), which is an integral part of the GFBio and IDESSA projects, we provide a first approach to supporting analytical workflows on large biodiversity data sets including occurrence, remote sensing and various trait data. Our system offers the following innovative techniques: first, it provides an approach for computing the results only for a small tile or a user-defined region of interest, including low-resolution previews. We introduce operator specific query contexts into our system. Each operator computes the results for a specific context, e.g. an area of interest or a time interval. The query context is propagated through the system from sink to sources. Each operator may adjust the query context to receive exactly the required data from its sources, thus reducing the dataflow as early as possible during execution. This is not always trivial, for example for operators that change the coordinate system.

Secondly, workflows are executed on demand. This allows very fast delivery of early or approximate answers. Users can then request more detailed answers or results for different areas of interest, which are lazily computed as required. Similar approaches have been used in other domains, like database systems.

Thirdly, the VAT-system supports function-shipping: instead of transferring large data to a service that implements a function, we support shipping the function to the data. Custom algorithms can easily be created by uploading scripts (for example in R), which are executed directly on the server, close to the input data. This reduces network transfers of large data; externally located services are only invoked when absolutely necessary.

The workflow component of the VAT-system offers a service-oriented interface to trigger an execution of a registered workflow via a web service. In addition, operative workflow systems might outsource data-intensive analytical processing tasks to our VAT-system; the results can be included in a larger workflow system using standard protocols.

The VAT-System uses parallel computing (GPU and CPU based) to further reduce response times. Currently, the VAT-system can process many common workflows on large data in near real-time. Future work will focus on improving performance even further.