Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Semi-automatical classification and structuring of text fragments from biological documents
Jose E Araya-Monge

Building: CTEC
Room: Auditorium
Date: 2016-12-06 09:15 AM – 09:30 AM
Last modified: 2016-10-15


An enormous body of information required for an effective biodiversity conservation is stored in books and papers. Unfortunately, that makes harder to synthesized knowledge from it. In this talk a tool is presented to help users semi-automatically extract and structure knowledge from scientific literature about the flora of Costa Rica. At this point the tool is still being developed and it is not yet integrated with tools that extract morphological characters from taxonomic descriptions and extract geographic entities.

As its first goal the tool allows users to mark fragments of text from a botanical document (Flora de Costa Rica, Árboles de Costa Rica) and assigned them one of some semantically meaningful categories that described its content. Among these categories we have: morphological descriptions, distributions, dichotomy keys and diagnostics descriptions. Depending on the information available the process will go from totally manual to completely automatic (in the future). There will be four levels of processing:

  1. manual mark up and manual assignment of categories
  2. manual mark up and automatic suggestions of categories
  3. manual mark up and automatic assignment of categories
  4. automatic mark up and automatic assignment of categories

We already implemented the first two levels. The next two levels are under developing.

The second goal of the tool is to allow users invoked specialized tools to extract structured information from a subset of fragments according to their categories. We are in the process of integrating the tools that have been developed to:

  • semi-automatically structure the morphological descriptions to extract characters
  • semi-automatically structure the distribution descriptions to extract geographical entities

As the final goal, some modules are being developed to take advantage of the structured information available.

  • query the morphological descriptions
  • query the distribution descriptions
  • select a set of taxons and generate input for taxon-character matrix software (like Lucid)