Missouri Botanical Garden Open Conference Systems, TDWG 2011 Annual Conference

Font Size: 
SYNTHESYS II: Updating the BioCASe Technology Suite
Joerg Holetschek, Johan Dunfalk, Anton Guentsch, Walter G. Berendsohn

Last modified: 2011-10-12

Abstract


Natural history collections, observational databases, and living collections throughout the world form a unique archive of biodiversity. They are the subject of research and preserve the expertise of innumerable biologists, past and present.

International networks and initiatives such as the Biological Collection Access Service (www.biocase.org) and the Global Biodiversity Information Facility (www.gbif.org) share a vision of free and open access to the world's primary biodiversity data, linking together natural history and species occurrence data from a large number of databases worldwide. They offer data portals for searching and browsing this unified data pool and web services that allow for a selective retrieval of data subsets.

BioCASe Technology refers to a suite of standards and software packages developed over the past 10 years, namely: a comprehensive data schema used for storing biodiversity data to be sent across the Internet (Access to Biological Collection Data, ABCD); the BioCASe protocol for sending queries and responses between network components; the BioCASe Provider Software for publishing natural history and occurrence data to biodiversity networks using these two standards; and the BioCASe data portal and SYNTHESYS cache generator system for setting up regional, taxonomic, or discipline-specific special interest networks. Auxiliary software was developed for finding duplicate records, creating virtual specimen annotations, and integrating taxonomic or geographic thesauri into data portals.

Recently, harvesting strategies in biodiversity networks have undergone a paradigm change. In the past, crawlers paged through datasets by querying the provider’s web services repeatedly, which is inefficient for large datasets. To overcome this, dumping whole datasets into several text files zipped together with a metadata file and a descriptor into a so-called DarwinCore Archive (DwCA) has been introduced. Once mailed to the harvester or downloaded from a server, it can be ingested much faster than the numerous packages that would be retrieved from a web service. Even though GBIF still supports data publication through traditional web services, it strongly encourages the use of DwCA for new providers and supports this with the new Integrated Publishing Toolkit (IPT).

BioCASe adapts to these trends. For one thing, DwCA providers do not support the retrieval of individual records, which is expected by the current BioCASe data portal. So the portal software must be extended to be able to deal with DwCA providers. On the other side, the BioCASe Provider Software should support the concept of dumping whole datasets into archives.

In order to preserve the richness of data that can be published with BioCASe and ABCD, this archiving feature will be implemented in two steps: The Provider Software will be complemented with a function that allows dumping whole datasets into single archive files of the same formats supported by the BioCASe web service, e.g. ABCD. A subsequent step will transform these into DwCA files, which can be used by DwCA consumers such as GBIF. Specialised interest networks that require richer data can use the ABCD dump files, thus combining the abundance of standardised data items ABCD and its extensions offer with the harvesting efficiency of dataset archives.