TDWG 2016 ANNUAL CONFERENCE

Semi-Automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions
Maria Mora, José Enrique Araya

Date: 2016-12-06 09:30 AM – 09:45 AM
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format.  The amount of publications generated is very large; therefore to process it manually is a complex and very expensive activity.  The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million of pages published in over 5.4 million of books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current).

It is necessary to develop standards and software tools to extract, integrate, and publish this knowledge into existing free and open access repositories to support science, education, and biodiversity conservation.

In this talk, an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish is presented. The developed algorithm is based on the work of Dr. Hong Cui (University of Arizona), uses semantic analysis, ontologies, and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the book Trees of Costa Rica Volume III and to a subset of descriptions of the Manual of Plants of Costa Rica with very competitive results (more than 94.1% of average performance).   The system receives the morphological descriptions in tabular format and generates XML documents according to the scheme proposed by Dr. Cui (available at https://github.com/biosemantics/schemas/blob/master/semanticMarkupOutput.xsd). The scheme allows documenting structures, characters, and relations between characters and structures. Each extracted object is documented with attributes like name, value, modifiers, restrictions, ontology term id, among other attributes.

The implemented tool is free software, was developed using Java, and integrates existing technology as FreeLing, the Plant Ontology (PO), the Ontology Term Organizer (OTO), and the Flora Mesoamericana English-Spanish Glossary.