Missouri Botanical Garden Open Conference Systems, TDWG 2014 ANNUAL CONFERENCE

Font Size: 
ENVIRONMENTS-EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life
Evangelos Pafilis, Sune Frankild, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Julia Schnetzer, Aikaterini Vasileiadou, Umer Ijaz, Christos Arvanitidis, Robert Stevenson, Lars Juhl Jensen

Building: Elmia Congress Centre, Jönköping
Room: Rydbergsalen
Date: 2014-10-27 04:40 PM – 05:00 PM
Last modified: 2014-10-03

Abstract


The knowledge of when and where a species is most likely to be found informs ecological relationships and management decisions. Associating organisms with environment types can provide researchers and policy makers with useful insights, complementing geo-referenced observation data.

Descriptions of species in the Encyclopedia of Life (EOL, http://eol.org, Parr et al. 2014, http://bdj.pensoft.net/articles.php?id=1079) contain such information as "desert", "lagoon", "forest" that can be readily understood by readers. But, to make the information available for extraction and analysis, the terms must be identified and tagged as related to habitat. This task was undertaken using ENVIRONMENTS (http://environments.hcmr.gr, Pafilis et al 2013, http://www.plosone.org/article//info:doi/10.1371/journal.pone.0065390) This is a dictionary-based, open source, named entity recognition tool. ENVIRONMENTS identifies the Environment Ontology's (EnvO) terms in plain text (http://environmentontology.org/, Buttigieg et al. 2013, www.jbiomedsem.com/content/4/1/43). EnvO, a controlled and structured vocabulary for biomes, environmental features, materials and conditions, provides ENVIRONMENTS with a hierarchically related term-set to drive such an identification process.

To improve recognition, dictionary name orthographic expansion and flexible term matching assist capturing for the different ways terms may be written in text as compared to their representation in the ontology. To prevent false positives that may arise from the two previous processes, an extensive manually-curated stopword list is employed.

Based on the EOL website on 17 September 2014, 231,759 Taxa Pages were annotated and 1,476,155 unique "EOL taxon – EOL text section – matched term – EnvO identifier" associations were extracted. The latter derived from English pieces of text belonging to informative sections, such as "Habitat", "Biology Description", "Trophic Strategy", "Reproduction" and others.

In addition to the taxon – environment relation, the EOL text section provenance can help address biology-focused questions, e.g. in which environments may an organism be found during the migration stage of its life (if one applies). Such pieces of information are highly relevant to citizen-science project planning.

To reach a wide range of users, beyond data experts and information technology specialists, the extracted ENVIRONMENTS-EOL annotations have been incorporated into the EOL system (in collaboration with Dr. J. Hammock, Dr P. Leary, Dr. K. Schulz, Dr. C. Parr, see also Parr et al. under review, http://www.semantic-web-journal.net/content/traitbank-practical-semantics-organism-attribute-data). EnvO terms associated with a taxon can be seen both in the Overview, Quick Facts and under the Data Tab of an EOL Taxon Page.

Monthly updates of the ENVIRONMENTS-EOL annotation extraction address the increasing incorporation of biodiversity knowledge into the EOL systems. The ENVIRONMENTS tagger is distributed as open source (under BSD license) and can be downloaded from: http://environments.hcmr.gr/. The ENVIRONMENTS-EOL annotation dataset is available at: http://download.jensenlab.org/EOL/.

ENVIROMENTS-EOL is funded by the EOL-Rubenstein 2013 Program; Part of the visualization scripts were developed in Hackathons funded by the EU COST ES1103 Action. EP, LF have received funding from the EUFP7 MARBIGEN project (grant agreement No 264089). LJJ, SuF by the Novo Nordisk Foundation Center for Protein Research. An EP’s visit in NNFCPR was funded by an EMBO Short Term Fellowship (356-2011). Collaboration with RS has been initiated in NESCents BHL-EOL Researc Spring 2014