Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Identifiers for Biodiversity Informatics: The Global Names Approach
Dmitry Y Mozzherin, Richard Pyle

Building: CTEC
Room: Auditorium
Date: 2016-12-08 09:30 AM – 09:45 AM
Last modified: 2016-10-16


Scientific names are perhaps the most persistent global identifiers in biology. They have been used for aggregation and exchange of biodiversity information for 250 years. Their importance is hard to overestimate. Advances in informatics have brought new opportunities and challenges for organizing information. Biology is transitioning fast into the realm of “Big Data”. Connecting information via scientific names is not trivial, because of many spelling variants of the same name, instability in binomial names due to creation of new genus-species combinations, homonyms, name misapplications, etc. The Global Names Architecture (GNA) is designing better global identifiers for biology and mapping scientific names to these identifiers.

We follow certain goals in identifier design. The identifiers must be globally unique so they can be minted without checking a global registry. They should be optimized for computer/computer interaction, and should be independent from encoding, resolution or transportation protocols. Identifiers should be used for identification only; other important features such as addressability or resolution are achieved by including identifiers into currently used formats (e.g., PURL, URI, LSID). There are two kinds of identifiers used in the Global Names Architecture: Name-String Identifiers and Global Names Usage Bank identifiers.

A name-string is a combination of characters that represents a scientific name. Each scientific name can be expressed by many name-strings. We use UUID version 5 standard for conversion of name-strings into a Name-String Identifier. Such an identifier can be generated independently using any programming language and the resulting identifier will be exactly the same for the same name-string, so biological information bound to name-strings from many different sources is easily inter-connected. These identifiers are universally 128 bits long and can be easily managed by existing tools. When printed on paper, the identifier is as unambiguous as an electronic version.

Global Names Usage Bank identifiers (GNUB IDs) are created on the nomenclatural level. They create a solid foundation for tracking Taxon Name Usages (TNUs) in the world, and give an ability to easily organize nomenclatural events associated with them. Every scientific name has the “original” Protonym TNU. An UUID is created for every Protonym TNU. Such UUIDs then can be combined to represent a mapping to scientific names. For example a binomial scientific name is expressed as a binomial combination of UUIDs generated for the Protonym TNUs of its genus and species ranks. GNUB IDs exclude the possibility of homonyms and greatly simplify finding and organizing species information.