Building: Grand Hotel Mediterraneo
Room: Sala dei Continenti
Date: 2013-11-01 10:24 AM – 10:30 AM
Last modified: 2013-10-07
Abstract
The increased availability of high-capacity sensors in various scientific domains is causing an exponential growth in the amount of scientific data generated. This is also the case in the Biodiversity domain. This work is being partly developed in the context of the Brazilian Biodiversity Information System (SiBBr), which harvests, indexes and disseminates this type of data, which is usually available through textual documents on the Web. In this scenario, one needs to retrieve and extract information from these unstructured documents, which is not a trivial task. One can define Information Retrieval (IR) as the task of fulfilling a demand for information from a collection of unstructured documents, mainly text. On the other hand, ontologies are frequently seen as an answer to the problem of semantic interoperability in current information systems. Most work involving ontologies have focused on issues regarding its construction and update. However, in addition to approaches where ontologies are built manually, one can consider building them semi-automatically using, for instance, IR techniques. In this context, Ontology Learning (OL) emerges as an interesting strategy. This technique uses knowledge from various areas such as machine learning, knowledge acquisition, natural language processing, and IR. The adoption of OL has the benefit of reducing the time and effort required for developing ontologies. OL can be simply defined as the task of identifying terms, concepts, relationships and, eventually, axioms from textual information. Observing the amount of biodiversity information produced daily, the necessity of an automated mechanism that enables or supports the extraction and reuse of knowledge is clear. Therefore, it is expected that OL will be an efficient tool in the Biodiversity domain. In this work, the application of some of the available techniques for building ontologies from texts related to biodiversity has been evaluated. The tool NLTK 2.0 has been used for tokenization, named entity extraction and entity labeling of these terms according to grammatical class, after which a specific algorithm is applied to measure each term frequency in the text. For the acquisition of synonyms the WordNet® has been employed. WordNet is a large lexical database where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. The validation of the extracted concepts has been performed using controlled vocabularies, which help to obtain relevant classes on the Biodiversity domain. Documents can also be enriched through semantic annotations to automate the process of tagging document terms with the aid of a domain ontology. Furthermore, a simplified algorithm has been developed for determining relationships between the classes obtained in the previous steps. This technology has applications in other area of the Semantic Web such as Linked Open Data, which enables the integration of data and information on the Web.
Acknowledgement: The authors would like to thank the support from UNEP.