Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Fresh Data: what's new and what's interesting?
Jennifer Hammock, Jorrit Poelen

Building: CTEC
Room: Auditorium
Date: 2016-12-07 05:15 PM – 05:30 PM
Last modified: 2016-10-15


This talk describes a use case for Big Data analysis for fostering transparency and communication among several communities interested in biodiversity data.

Fresh Data is a suite of services for monitoring new biodiversity data matching specific queries across multiple biodiversity data sources, and notifying data providers when their data has been requested. Our goal is to connect time sensitive data consumers (researchers, primarily) with data producers (wildlife observers) in a meaningful but unobtrusive way. The community we seek to serve is non professional observers on platforms such as http://citsci.org/ and http://www.inaturalist.org/, who would not otherwise know they were documenting scientifically relevant data.

For these contacts to be useful, they must be fast. A subscribed researcher with a saved query should learn of a relevant new data point within a few days of the observation, and an observer should learn as quickly as possible that they have reported something that was needed by a researcher; this will encourage timely reactions, (additional reports by the observer or recruiting of other observers, and direct communication from the researcher if desired.)

To attract researchers to the monitoring tool, its search must be comprehensive, including the data sources they already rely on. Thus, the search index includes both GBIF and iDigBio data, as well as orphan data sources not yet aggregated.

Each data source is updated individually, and schedules are set appropriately for each; priority communities with short internal lag times (eg: iNaturalist) are updated the most frequently. Whole aggregator datasets (GBIF, iDigBio) are refreshed as frequently as capacity permits. Some of the priority communities have their data hosted at GBIF; their datasets are indexed separately as well, in order to allow faster update schedules, and records are deduplicated by occurrence ID.

Services are documented at https://github.com/gimmefreshdata/freshdata/wiki/api . Services available include:

-all occurrence records, filtered by taxonomic and geographic parameters, occurrence date and date added to Fresh Data (supports data monitors for interested researchers)

-monitored occurrence records only, filtered by the same parameters, and also data source (supports data usage reports per interested data source)

-query parameters for all monitors, filtered by all the above parameters, and also occurrence ID (supports query dissemination, eg: your Urania Swallowtail report was sent to a researcher interested in Lepidoptera in the Caribbean)