Building: Grand Hotel Mediterraneo
Room: Sala dei Continenti
Date: 2013-10-30 03:08 PM – 03:17 PM
Last modified: 2013-10-07
Abstract
The digitization of museum specimens can make information already associated with a specimen accessible on the Internet, including as a semantic resource such as BiSciCol. Until recently this label information could be stored in a museum’s database. With the oversight by skilled workers, more general-purpose Optical Character Recognition (OCR) and parsing tools can convert label data to Darwin Core (DwC) formats to facilitate ingest into multiple systems including BiSciCol. Unfortunately, since many fields are optional when implementing DwC, many institutions do not always preserve important semantics of information on the museum labels. This makes import into a semantic resource such as BiSciCol difficult and less useful. For example, specimens may have a series of identifications over time. During digitization and during export to DwC, some institutions add only the latest identification or (re)determination and do not record dwc:identifiedBy or dwc:dateIdentified, making instantiation of the identification class difficult. This loss of information makes it difficult for the semantic store to associate the specimen with different publications and other resources including specimen records. In the AOCR Hackathon and related digitization effort with iDigBio, we established a simple scheme where columns represent DwC elements and rows represent independent labels on a specimen, including, for example, the primary label, redetermination labels, museum ownership labels, institution ownership labels, expedition/project and other labels. This format made it possible for project participants to readily edit files. It proved inadequate for representing items with multiple values such as multiple collector, multiple species on the same specimen, leading to string values with embedded “and”, commas, semicolons and other punctuation as occurs on labels making machine ingest and semantic interpretation difficult. This points to the need for tools that allow digitizers to populate semantically unambiguous representations without the loss of information.
Services like BiSciCol need to allow repeating classes such as DwC class identification to allow for the fact that different people may assign different names at different times to a specimen or other Occurrence. This necessitates the creation or use of an identification-GUID to bind the identification facts in the knowledgebase. The repeating nature holds for many classes such as the Occurrence level, since, for example, one specimen of lichens may include three different species of lichen all attached to the same substrate (e.g. on a Quercus alba branch). If a more full semantic representation is preserved, existing semantic resources can be used to facilitate the digitization process itself. Curated “facts” in a semantic store can be used to resolve errors in the digitization process, resolve ambiguities for digitization or flag potential inconsistencies in the label data. For example, OCR errors are common in digitization collector names on labels. These can sometimes be corrected automatically using a database of names and approximate matching algorithms such as Levenshtein distance. This can lead to multiple matches requiring human intervention. An alternate approach could use a SPARQL query to identify occurrence data for specimens from the same expedition and greatly reduce the umber of potential collector names, but only if we include the entire history of identification of a specimen or occurrence over time.