Missouri Botanical Garden Open Conference Systems, TDWG 2014 ANNUAL CONFERENCE

Font Size: 
Two approaches to finding botanical duplicate occurrences (and other relations among specimens) in the face of noisy data
Robert A Morris, James Hanken, David B. Lowery, Bertram Ludäscher, James A. Macklin, Chuck McCallum, Paul J. Morris, Tianhong Song

Building: Elmia Congress Centre, Jönköping
Room: Rum 11
Date: 2014-10-30 02:15 PM – 02:30 PM
Last modified: 2014-10-03


Botanists intentionally collect multiple specimens from the same organism and distribute them to several collections for curation. During or after the initial digitization of legacy paper occurrence metadata, the resulting digital records may diverge for a number of reasons. Some of this divergence results from differing curation velocities, e.g., one duplicate may receive a taxonomically current species assignment while another does not. Other divergence may arise as a simple consequence of human transcription error. Yet other may reflect different local curatorial practices and not be regarded as an error at all; for example, one collection may record the collector’s name in a single field, first name followed by last, whereas another collection might do the reverse. There are similar issues about legacy event dates. These “non-erroneous” divergences, whether or not they involve duplicates, can impact the fitness of occurrence data for many applications, such as the change of species geographic distribution over time.

We report two approaches explored by the FilteredPush project. The first is an ad-hoc one based on known common data errors and practice, transformed into text pattern matching and normalization. This may be augmented by tests of semantic constraints when data are available. An example includes phonetic match techniques such as Soundex comparison. Matches may then be deemed suspicious if an occurrence collection date is earlier than the birth date or in the infancy of the reported collector. Some of these may be augmented, or even replaced simply by an examination of authority lists, e.g., the Harvard University Herbaria and Libraries Index of Botanists (http://kiki.huh.harvard.edu/databasesbotanist_index.html). The purported duplicates are then presented in data entry or quality control interfaces for human or machine examination. This query time mechanism for finding existing potential duplicate records is effective for improving the rate of transcription of legacy paper records in botanical collections.  We will briefly demonstrate one human-facing example based on deployment in existing FilteredPush installations.

The second approach is based on cluster analysis using the Mahout text-mining framework. It attempts to model Darwin Core and other data as vectors in a low-dimensional vector space. To date, this approach has come nowhere near the effectiveness of the ad-hoc implementation, yielding such a large fraction of false positives as to not yet be useful. We continue the investigation because in principle it accounts for the noise on its own and would be quickly deployable with other data schemas.