Building: Windsor Hotel
Room: Oak Room
Date: 2015-09-30 02:00 PM – 02:15 PM
Last modified: 2015-08-29
Abstract
In the FilteredPush (FP) and Kurator projects we have built tools for quality control of biodiversity data. One of these, FP-Akka, is derived from earlier work on the Kepler Kuration package, where, to run data-curation workflows within a FilteredPush node infrastructure, code was refactored into an external service wrapper layer, a data validation logic layer, and a workflow layer that composes elements of the logic layer into actors in a record-centric workflow. The service wrapper layer and data validation layer are packaged in an FP-KurationServices library, which can be composed with Kepler Kuration workflows or with workflows written in the Akka parallelization framework. In developing and maintaining FP-Akka, we encountered multiple challenges arising from the interplay between external services and workflow components: discovery of pertinent services, technical documentation and integration of services, documentation of domain-specific details of assumptions made by the services, the wide variety of technologies used by service providers, and maintenance of our code base in the face of changing services. To find services pertinent to the data-quality needs of the science goals of US Thematic Collections Networks (TCNs)—principally, quality control of scientific names, georeferences, and collecting event dates—we have looked, ad hoc, to the usual suspects for quality data in relevant domains and have done some discovery using The Biodiversity Catalogue service registry. We also wrote a service (for name and date data concerning entomologists) in a case where shallow searching for pertinent services returned no results. Service documentation at the technical level has ranged from Web Services Description Language (WSDL) files from which we could generate code, to example response documents, to none; in the latter case, we simply coded to the observed responses of the service. Much more difficult has been domain concepts documentation, the information needed to tell what assumptions the service provider is making about questions made to the service, and what assumptions are embedded in the responses. In a simple case, such as for a service that provides information related to scientific names, are the responses from the service making nomenclatural assertions, taxonomic assertions, or a mixture of both? Also relevant to understanding of service use is how clean and authoritative is the dataset behind the service, and, when quality varies in a dataset, presence of row-level assertions about data quality. Biodiversity-related service implementations use a wide variety of exchange technologies, in effect requiring consumers to do something different for each service we interact with, thus each component in our service-wrapper layer is wholly different from the others. The combination of different technologies for services and different domain assumptions made by similar services have combined to make it difficult for us to cleanly write a layer containing our validation logic and compose this with a layer that abstracts services. Instead, we have had to bring some of the logic dealing with the differing assumptions of different services down into the service-wrapper layer. To maintain the code, we responded to both documented and silent changes to service Application Programming Interfaces (APIs). In order to detect API changes that would cause our code to fail, we have unit tests that query the services. This has created challenges for our development framework, as when a service has a transient outage, builds that run these tests fail (and, in the case of automated build systems, tend not to leave clear traces of failure causes). Our single largest challenge has not been technological, but social: to understand the domain-specific assumptions of biodiversity data service providers.