Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Semantic Annotation for Tabular Data
John Deck

Building: CTEC
Room: Auditorium
Date: 2016-12-05 02:45 PM – 03:00 PM
Last modified: 2016-10-15


Tabular data, expressed as spreadsheets, and tab or comma-delimited files, are a convenient and common method for storing and transmitting biodiversity data.  However, tabular data is all too often “dark” data, lacking context and consistency, with little clarity about exactly what is being referred to in the data: for example, whether a set of fields in a “row” refers to a curated specimen, a living individual that is being tracked on an ongoing basis, or an observation. Common difficulties in working with dark data include values with no units, identifiers that are local in scope only or missing, and especially a lack of context for the relationships that exist between data values in columns. These issues are a true impediment for sharing and integrating data from distributed data sources.   While this topic has received a lot of attention in recent years, implementations that offer usable solutions for helping users improve semantic clarity and create instance identifiers have lagged.  This talk will explore a method for validating and classifying instance data based on project management rules, expressed in an XML (extensible markup language) configuration file, and useful for biologists and data managers.   Beginning with a look at the necessary steps of project configuration and then data validation, we will finish by following a sample input file from the National Phenology Network (NPN) as it is loaded into the Biocode Field Information Management System (http://biscicol.org/) and finally a look at the resulting triples and a discussion of implications and future directions.