Missouri Botanical Garden Open Conference Systems, TDWG 2014 ANNUAL CONFERENCE

Font Size: 
Workflow Support for Continuous Data Quality Control in a FilteredPush Network
Bertram Ludaescher, James Hanken, David Lowery, James Macklin, Paul J Morris, Robert Morris, Tianhong Song

Room: Rum 10
Date: 2014-10-28 04:41 PM – 04:54 PM
Last modified: 2014-10-03


In supporting the Southwest Arthropod Collections Network (SCAN) Thematic Collections Network (TCN), the FilteredPush project has implemented a data quality workflow that can be run using the Akka framework to provide data quality reports back to collections participating in the network. In this workflow, three actors perform data quality checks on, scientific name, georeference, and date collected. Inputs are provided through an actor that loads DarwinCore records in JSON or CSV form, either from the file system, or from a pool of harvested records in MongoDB. Outputs are provided as JSON, written either into a MongoDB collection, or to the file system, with a utility to convert the output to a multi-sheet spreadsheet with data quality annotations. The output includes a quality-controlled copy of the original record, and for each quality control actor, assertions made by that actor concerning the record data and provenance.

The actors follow a common pattern of checks on the data: First, checks are performed to determine whether data have the expected format: e.g., for a field expected to contain a date, whether or not a valid date (or a value parsable as a valid date) is present. Second, checks on internal consistency in the fields relevant to the actor are executed. For example, does a georeference assert a coordinate within the country boundary for the given country name? Similarly, a check is made whether a scientific name in a single field is consistent with other fields that hold parts of that name (e.g., the genus name may occur both in the scientific name field and in its own field.) Third, consistency checks can also be performed against remote services to make sure that, e.g., collection dates are plausible in view of biographical information about the collectors.

The FilteredPush project originally assessed data quality in FilteredPush networks using Kepler workflows with actors from the Kepler Kuration package. These actors employ a workflow director (COMAD) designed for workflow pipelines over structured data collections. We faced two difficulties in deploying Kepler to support automated quality control of harvested data: first, running Kepler headless, without its graphical user interface, and second, a low rate of throughput. We were able, though with noticeable overhead, to run Kepler headless as a service within a FilteredPush node. We then explored the use of the Akka platform as a low-level, higher-throughput workflow system. To this end, we refactored the logic of the data quality control actors into a FilteredPush library, which can be used for both Kepler Kuration and for Akka actors. Initial benchmarking experiments on the Akka workflow suggest a local optimum of about five parallel threads (for each of the actors in the workflow) was the most efficient in our deployment.