Article Semanticizer - stitching data mining services into a standalone search appliance
David Peter Shorthouse, Dmitry Mozzherin

Building: Elmia Congress Centre, Jönköping
Room: Rydbergsalen
Date: 2014-10-30 09:00 AM – 09:05 AM
The Biodiversity Heritage Library, traditional publishers, scientific societies, and many other organizations with large or small collections of unstructured biological texts need a simple, scalable mechanism to create search indices. These indices are immediately valuable to the public and also afford opportunities to enrich their content through internal and external crosslinks. Here, we describe an MIT-licensed application that combines the strengths of two Global Names, http://www.globalnames.org services, another from the Encyclopedia of Life, http://eol.org and a third from AlchemyAPI, http://www.alchemyapi.com/. In combination, these services discover and resolve scientific names in raw text (or images), expand these to their vernacular equivalents in multiple languages, and extract known entites such as surnames, placenames (with geographic coordinates from GeoNames), and organization names. The resultant database of indexed terms and the full text are then ingested into ElasticSearch for immediate autocomplete and fulltext search capabilities. A proof-of-concept was constructed from back issues of The Canadian Entomologist (1868-2002) and temporarily made available at http://canent.shorthouse.net. Code is available at https://github.com/dshorthouse/article_semanticizer.