Missouri Botanical Garden Open Conference Systems, TDWG 2014 ANNUAL CONFERENCE

Font Size: 
Unlocking biodiversity knowledge through Text Mining and Crowdsourced Tagging of Legacy Literature
William Ulate

Building: Elmia Congress Centre, Jönköping
Room: Rum 10
Date: 2014-10-28 02:30 PM – 02:45 PM
Last modified: 2014-10-03


Since its beginnings, following ample criteria of what Biodiversity Science was about, the Biodiversity Heritage Library (BHL) started digitizing books and journals that included very diverse topics within its main topic.  Also since day one, BHL has paid particular attention to taxonomists’ literary needs, trying to make available the most extensive and relevant literature available.  But whereas most taxonomists are mainly interested in the taxa names, their hierarchy and story as the key entry point to biodiversity knowledge, just as well, other scientists, such as ecologists and evolutionary scientists, rely on very different access paths that include diverse named entities, their traits and relationships between those taxa and their surroundings;  and even further, professionals from other disciplines, like history, humanities and conservation science are attracted by other information such as names of places and people throughout time.  BHL’s collaborative process has made freely available images of the text pages and their corresponding OCR to audiences around the World, but this OCR comes with its own set of different errors, depending on such aspects as typography, current state of the book and even the software version itself, for example.  Our community has expressed their interest that the OCR quality needs to be improved before the full potential that this knowledge holds could be exploited through automated means.  So far, text mining techniques developed by other partner initiatives have been successfully employed to find taxa names candidates concealed in the text and more recently, other algorithms have also proved effective in identifying those pages that contain illustrations within the corpus of BHL.  Likewise, as part of the new innovation projects, BHL is evaluating the effectiveness of a gaming approach to help resolve minor differences in transcriptions and OCR text generation of different types of materials (notebooks, seeds lists, etc.) after an automated process has spotted those conflicting areas.  Yet, another innovation project, Mining Biodiversity, will tag named entities and their relationships to create a training subset of legacy literature so automatic engines can process the BHL content, tagging them accordingly.  This will conform the base for enabling semantic searches in BHL.