Missouri Botanical Garden Open Conference Systems, TDWG 2011 Annual Conference

Font Size: 
Implementation of a metadata catalogue system for the GBIF network
Tim Robertson, Eamonn O Tuama, Federico Mendez

Last modified: 2011-10-12

Abstract


Large distributed networks such as GBIF's bring together many publishers and consumers of data.  To guide consumers in discovering data that is fit-for-purpose, all datasets should be accompanied, when published, by metadata that describe critical aspects of the data such as sampling procedures and methods, data quality, provenance, ownership, data format, access, and intellectual property rights. Once generated, metadata are typically stored in online catalogues (databases) that can be browsed and searched. Metadata are thus a central component in an expanding GBIF network where the key activities of discovery and access to data and services must be well coordinated through the provision of registries and metadata catalogues, and through the generation of indexes. To meet this need, GBIF has implemented a metadata system for its network that provides unified access to all participating catalogues.  In designing the system, GBIF choose to work with several lightweight components rather than adopt an existing turnkey solution such as Metacat [1] or GeoNetwork [2] as the requirements were more modest and deep integration with the GBIF data portal was paramount.

 

The principal components of the system are:  i) a metadata registry/harvester, ii) a central catalogue holding copies of all metadata published on the network, iii) one or more participating external metadata catalogues, and iv) a set of protocols and data exchange standards to allow flow of metadata in the network.  A key requirement for the system was ability to support the Open Archives Initiative for Metadata Harvesting (OAI-PMH)[3] , a widely used open protocol allowing harvesting across online metadata repositories which is also supported by Metacat and GeoNetwork . The system should also not be restricted to one metadata format but accept metadata in all popular standards, e.g., Ecological Metadata Language (EML), ISO 19115/19139, Dublin Core (DC), Directory Interchange Format (DIF) and Content Standard for Digital Geospatial Metadata (CSDGM). A system specification document is available [4].

 

The metadata registry and harvester were implemented using a simple XML list of catalogue endpoints and the Java based OAICat [5] which supports the OAI-PMH functionality. For its central catalogue, GBIF choose Apache Solr [6], an open source enterprise search server, based on open standards and offering many useful features such as indexing, text search, results highlighting, faceted navigation, query spell correction, auto-suggest queries and “more like this” for finding similar metadata documents. The interface to the catalogue is based on Ajax/Single Page Design [7] where the main functionalities are exposed via a small Web application implemented using web technologies (Ajax & JQuery) which provide an improved user experience, integrate easily with the Solr REST Application Programming Interface and, additionally, minimize the technological requirements for the production system. OAICat is also used to serve the aggregated metadata in the GBIF catalogue onwards to other aggregators, e.g., the EuroGEOSS broker [8].

 

The system is now in testing phase [9] and has proven to be flexible and robust. Work is currently in progress to support those GBIF Participants wishing to connect their metadata catalogues to the GBIF network.

URLs:  

[1] http://knb.ecoinformatics.org/software/metacat/

[2] http://geonetwork-opensource.org/

[3] http://www.openarchives.org/pmh/

[4] http://links.gbif.org/gbif_metadata_catalogue_specification.pdf

[5] http://www.oclc.org/research/activities/oaicat/default.htm

[6] http://lucene.apache.org/solr/

[7] http://msdn.microsoft.com/en-us/magazine/cc507641.aspx

 [8] http://www.eurogeoss.eu/broker/

[9] http://metadata.gbif.org