Missouri Botanical Garden Open Conference Systems, TDWG 2011 Annual Conference

Font Size: 
Georeferencing botanical data using text analysis tools
Clare A Llewellyn, Elspeth M Haston, Claire Grover

Last modified: 2011-10-12

Abstract


Automatic text analysis tools have a significant potential to improve the productivity of those who organise large collections of data. However, to be effective, they have to be both technically efficient and have a productive interaction with the user.

Georeferencing is the process of converting textual descriptions of where a specimen was collected into machine readable geographic locations generally using a map based coordinate system. Historically, locations on plant specimens have been vague. Identifying and correcting plant specimen records that contain errors is time consuming and expensive for curators, therefore anything that could be done to improve the speed and accuracy in this process would be valuable. Currently, georeferencing is conducted manually using resources such as gazetteers and maps to find the coordinates (latitude and longitude) of the place names that have been identified in the plant specimen records by the curators.

A tool has been created by the University of Edinburgh and the Royal Botanic Garden Edinburgh that allows users to enhance botanical data by adding geographical locations. This is achieved by the user interacting with automatically generated locations extracted from information about botanical specimens. The user can correct or make additions to this generated output in order to specify an exact georeference for where the botanical sample was collected.

This project identifies and improves the usability of text analysis in a practical context. Output can be automatically generated through content analysis, natural language processing and text mining. Widespread use of text analysis has not yet been achieved. The main barrier to uptake is the fact that accuracy levels usually fall short of the expectations and needs of the user. It is proposed that this problem is rectifiable through interface extensions to existing text analysis tools to allow the user to correct and enhance automatically created output – combining the efficiency of automatic processing with the accuracy of manual annotation

The tools used for text analysis are the Edinburgh Informatics information extraction tools including LT-TTT2 and the Edinburgh Geoparser. These are well established tools that process text and XML to identify place names and provide geographic coordinates for these locations. The Geoparser is made up of two main components – the Geotagger which provides place name recognition (identifies text strings as places) and the Georesolver which provides geographic referencing (looks up the names in a geographic gazetteer and ranks the possible interpretations). The tools have been adapted so that an interface extension provides an interaction with the curator.

The software comprises a backend MySQL database, command line scripts for the initial processing of the records (including text mining and National Grid Reference conversions), CGI scripts to query the database and the map API’s, XSLT to build the specific HTML for each record and HTML pages. During the design phase it was noted that attempts were being made to present a lot of information in a small space, therefore, there is extensive use of JavaScript to hide and revealed items as desired.