Missouri Botanical Garden Open Conference Systems, TDWG 2011 Annual Conference

Font Size: 
Cooking Class: Boiling Down ABCD to DarwinCore Archives
Joerg Holetschek, Johan Dunfalk, Anton Guentsch, Walter G. Berendsohn

Last modified: 2011-10-12

Abstract


International biodiversity data networks link together the huge amount of primary data provided by natural history specimen collections and holders of species occurrence databases worldwide. They promote the idea of free and open data access and offer portals for browsing this unified data pool as well as web services for the selective retrieval of data subsets by machines. Examples are the Global Biodiversity Information Facility (GBIF, www.gbif.org) and the Biological Collection Access Service (BioCASe, www.biocase.org).

In the past, data flow from provider to portal in this domain was based on web services: The data provider mapped the information to be published to one of the existing XML data standards, either the complex and hierarchical ABCD (Access to Biological Collection Data) or the rather simple and flat DarwinCore schema. Once expressed as ABCD or DarwinCore, the data was exposed as a web service that could be queried with one of three existing protocols, namely DiGIR (Distributed Generic Information Retrieval), TAPIR (TDWG Access Protocol for Information Retrieval) or BioCASe. Harvesters crawled through the published datasets by querying these web services repeatedly and storing data items of interest in so-called index databases, which could then be used for offering data portals. Users were also able to go back to the individual provider and retrieve the full data records.

This harvesting strategy has two disadvantages: First, it requires a constantly running web server, which can be problematic for small data providers. Moreover, harvesting large datasets by querying a web service repeatedly is obviously inefficient. To overcome this, GBIF introduced the dumping of whole datasets into text files, which will be zipped up with a metadata file and a descriptor into a so called DarwinCore-Archive. Once put on a server in the web or mailed to the harvester, it can be ingested much faster. Even though GBIF still supports data publication through traditional web services, it strongly encourages the use of DarwinCore-Archives for new providers and supports this with the new Integrated Publishing Toolkit. Individual providers using this, however, cannot be queried directly anymore.

Many data providers use the BioCASe Provider Software in conjunction with ABCD to publish their data to GBIF and other networks. BioCASe adapts to the latest trends: the Provider Software now allows dumping whole datasets into single archive files of the same formats supported by the respective BioCASe web service, e.g. ABCD. This dump can be downloaded directly from the web server or emailed to the harvester of a specialised interest network that requires a data-rich or specialised data schema such as ABCD or one of its extensions. For DarwinCore-Archive consumers such as GBIF, the ABCD Dump can be transformed into DarwinCore-Archive files in a subsequent step. The continuing availability of the BioCASe web service maintains the ability to query the provider directly.

The computer demo will show these new features and demonstrate how a dataset published with BioCASe can be dumped into a rich ABCD dump, and how this can be transformed into a slim DarwinCore Archive.