Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
iDigBio Data Integration with Self-Referential Schemas
Sarfaraz Ahmed Soomro, Andrea Matsunaga, José A. B. Fortes

Building: Grand Hotel Mediterraneo
Room: Africa (formerly America del Sud)
Date: 2013-10-29 04:00 PM – 04:15 PM
Last modified: 2013-10-08

Abstract


For decades or even centuries, biodiversity collections data have been collected, catalogued, curated, digitized, and stored by museums and other institutions, using a variety of storage media, data formats, data granularity, data structure, vocabularies, protocols, information systems, and infrastructures. A challenge faced by the biodiversity community is to consolidate the data from a variety of disparate sources so that they conform to one target standard or data schema without losing information. Such a consolidated data store improves the efficiency and ability of researchers to pose a wide range of scientific queries that would otherwise require significant effort in gathering and transforming data from multiple sources. The branch of computer science that deals with combining data from several disparate data sources into a mediated schema is called Data Integration. It takes into account all issues including, but not limited to structural differences in the source and target schemas, semantic anomalies in their vocabularies and data inconsistency. While multiple standards for exchanging biodiversity information, which serve as a mediated schema, have been developed (e.g., the various versions of Darwin Core, ABCD, and EML (Ecological Metadata Language)), the data management systems used by museums often store occurrence data in a structure that is customized to the user’s need and not entirely compliant with standards. This creates a need to Extract, Transform, and Load (ETL) data from one schema into another. A previously identified recurring class of data transformations, also observed by iDigBio while combining data from various natural history museums across the United States, is one where either the source or the target is a ranked self-referential schema. This type of schema is ideal to represent entities that are hierarchical in nature such as geographical locations, organism taxonomy, geological time scales, and organizational trees. Our research focuses on automating the conversion of these hierarchical representations into flat representations or vice versa, given minimal information about the hierarchy in the schema. In this paper, the most common use cases are illustrated, along with key challenges that make it a rather long and cumbersome process. We also present a prototype solution consisting of a new component developed for an existing ETL tool, and demonstrate its qualities while converting to and from two popular relational database management systems for biological and paleontological collections, namely Specify and Symbiota.