Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
Identification of Environment Ontology terms in Text and Annotation of Biodiversity (ENVIRONMENTS-EOL) and Genomics (SEQenv) Information
Evangelos Pafilis, Sune Frankild, Umer Ijaz, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Julia Schnetzer, Aikaterini Vasileiadou, Christos Arvanitidis, Christopher Quince, Lars Juhl Jensen

Building: Grand Hotel Mediterraneo
Room: Sala dei Continenti
Date: 2013-11-01 11:20 AM – 11:29 AM
Last modified: 2013-10-07

Abstract


Knowing the species reported to exist in a specific environment (e.g. invertebrates living in coral reefs or microbes growing in hydrothermal vents) is a key piece of information in exploring biodiversity and ecological patterns.

Such knowledge presently exists as free-text in relevant sections (e.g. Habitat) of species descriptions, in sequence record text-fields and/or in the related publications.

The identification of environment descriptors, such as “terrestrial”, “marine”, “forest”, “coral reef” in the free-text and their mapping to community semantic models can support the linking to such crucial environmental context information.

ENVIRONMENTS (http://environments.hcmr.gr), an open source, named entity recognition tool, follows a dictionary-based approach to support such term identification. The Environment Ontology's (EnvO, http://environmentontology.org/) controlled and structured vocabulary for biomes, environmental features, materials and conditions, serves as the name source for such identification process.

Built on the same fast performing tagger engine as in Pafilis et al., 2013, ENVIRONMENTS addresses the challenge posed by the ever-increasing amount of biomedical and biodiversity literature and data.

Orthographic expansion of the dictionary names and flexible matching improve the matches between EnvO terms as they exist in the ontology and the way they may be found in text. An extensively manually curated stopword list is protecting from increased false positives.

To support biodiversity and ecology research ENVIRONMENTS is being employed to link both species and genetic sequences to their environment:

ENVIRONMENTS and the Encyclopedia of Life
ENVIRONMENTS-EOL (http://environments-eol.blogspot.com) is a project aiming at processing the Encyclopedia of Life's (EOL, http://eol.org) Taxon pages to extract descriptions of their environmental context. Such input will subsequently employed to answer integrative large-scale biological questions. Retrieving all species belonging to a specific group (e.g. Invertebrates), associated with a particular environment (e.g. coral reefs) and occurring in a specific region (e.g. IndoPacific Ocean) is such an example. The EOL by collecting the available information about a given taxon is a one-stop-shop that greatly facilitates exploring such questions.

SEQenv: annotating sequences with environments
SEQenv (http://environments.hcmr.gr/seqenv.html) is a pipeline capable of annotating genetic sequences based on EnvO terms occurring within the records (GenBank records “isolation source” field) and in the related literature (PubMed Abstracts) of highly similar sequences. Sequence analysis, web data retrieval, along with statistical analysis and vizualizations, using the EnvO ontology structure, are employed to this end. SEQenv has been applied in highlighting the source environments of the sequences in novel samples, and in different microbial taxa.

References 
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. Pafilis E, Frankild SP, et al. (2013). PLoS ONE, 2013, 8(6): e65390.