Building: CTEC
Room: Auditorium
Date: 2016-12-06 04:30 PM – 04:45 PM
Last modified: 2016-10-15
Abstract
We are able to investigate biology on grander scales by integrating biological data from multiple sources. The use of scientific names of organisms allows aggregation of information on the same taxa out of many different places. There are impediments to such aggregation because there is often more than one name for a taxon, or one name may apply to more than one taxon. Names are often spelled with variations, sometimes misspelled, abbreviated, annotated. Author information often varies dramatically.
To be able to deal with biodiversity information we need tools that disambiguate different spelling variants, find names in spite of misspellings, to find synonyms and currently accepted names for a taxon. To mobilize information from scientific literature in general and from Biodiversity Heritage Library in particular we need fast and reliable name finding, reconciliation and resolution tools. With advances of DNA and RNA sequencing there is a dramatic increase in usage of "surrogate" names that do not follow established rules of nomenclature. These mandate new nomenclatural systems to map legacy literature to molecular knowledge systems.
As a part of Global Names Architecture project we are developing a new generation of high quality tools with an emphasis on scalability and speed. Our current goal is to create name-parsing, name-resolving, name-finding programs that are able to process whole corpus of biological literature in a few days, re-index Biodiversity Heritage Library system in 1-2 days. Such speeds will improve the quality of scientific name services globally.
In this presentation we introduce the second generation of the Global Names Parser and the Global Names Resolver tools. Both projects are heavily based on Scala language and offer orders of magnitude increase in throughput over previous tools. For example parser is able to process 30 million names/hour per CPU thread with 99% accuracy, and name-string resolver Application Program Interface (API) is able to match 1000 name-strings/second per request. We will also discuss our approach for global name-string finding effort which is planned for release by the middle of 2017. We strive to rescan Biodiversity Heritage library routinely every time we release an update of name-string handling algorithms and as a result continuously increase the quality of biodiversity bibliographic indexes and mapping.