Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Geographic entities extraction from biological textual sources
Moisés Alberto Acuña-Chaves

Building: CTEC
Room: Auditorium
Date: 2016-12-06 09:00 AM – 09:15 AM
Last modified: 2016-10-15


This work is focused on the exploration and application of entity extraction techniques for the codification and identification of geographical locations present in the geographic distribution section within botanic documents, such as the plant species manual of Costa Rica. Several technologies must be combined to achieve such objective, among them is Natural Language Processing (NLP) that helps in the extraction of entities such as the module ANNIE in the GATE framework, which uses gazetteers. Another technology is the usage of rules (regular expressions, Deterministic Automata, context-free grammars), Freeling is an example of it.

Additional to the identification and codification, it is very important to bind the geocoding to authorized sources such as geonames. Furthermore, this work identifies and enriches the entry text with extra information extracted from the paragraphs where the distribution is defined. An algorithm using Freeling 3.1 and Solr 5.5 is presented. Some values of interest for this work are: Holdridge life zones, world distribution, Costa Rica distribution, elevation and flowering months of the year. After those values are identified, the information is structured so that can be processed and become useful for diverse applications, such as geographic information systems. Other research projects might be interested in the results of this project.

The results obtained were evaluated by manually judging a randomly selected sample to establish whether or not the algorithm yielded useful data. The judgment consisted in assigning three possible values (GOOD, BAD, UNKNOWN) to the entities extracted and geocoded from the world distribution and Costa Rica distribution using the source’s context. The ideal is to have the least BAD percentage. The algorithm is relatively good to geocode and bind the world distribution and life zones. More work needs to be done for distribution in Costa Rica.