Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Why you must clean your big-data
Tomer Gueta, Yohay Carmel

Building: CTEC
Room: Auditorium
Date: 2016-12-06 05:00 PM – 05:15 PM
Last modified: 2016-10-15


Aggregated big biodiversity databases are prone to numerous data errors and biases. Improving the quality of biodiversity research, in some measure, is based on improving users-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data. We estimated the effect of user-level data cleaning on species distribution model (SDM) performance, and exemplified the value of more intensive and case-specific data cleaning, which are rarely conducted by biodiversity big-data users. We implemented several relatively simple and easy-to-execute data cleaning procedures, and tested SDM performance improvement, using GBIF occurrence data of Australian mammals, in six different spatial scales.

Occurrence data for all Australian mammals (1,041,941 records, 297 species) were downloaded from the Australian GBIF node. In parallel, 24 raster layers of environmental variables in Australia (elevation, land use, NDVI, and 21 climatic variables) were compiled at a spatial resolution of 1km2. A Maximum Entropy Model (MaxEnt) was performed for each species in each grid cell, based on data before- and after user-level data cleaning, respectively. We compared model performance before- and after cleaning using one-tailed paired Z-test. The cleaning procedures used in this research improved SDM performance significantly, across all scales and for all performance measures. This finding showcase the value of user-level data cleaning for big data, regardless of spatial scale.

In a typical research, data are very expensive, and filtering/removing big proportion of the data is inconceivable. In contrast, in the big-data world, data are plentiful and relatively inexpensive, and it is sometimes worthwhile to dispose large volumes of data for the sake of data quality. Here, for example, we disposed half a million records, which consisted 50% of the database, in order to increase data quality. Thus, tools for easy yet advanced query of the data are as important as tools for detecting and correcting errors. The results of our study stress the need for data validation and cleaning tools that incorporate customizable techniques. We plan to develop an R package that will facilitate a comprehensive, structured and reproducible data-cleaning.