Missouri Botanical Garden Open Conference Systems, TDWG 2016 ANNUAL CONFERENCE

Font Size: 
Data Quality at Scale: Bridging the Gap between Datum and Data
Alexander Thompson, Matthew Collins

Building: CTEC
Room: Auditorium
Date: 2016-12-07 05:00 PM – 05:15 PM
Last modified: 2016-10-15


This talk will provide a practical look at implementing high throughput, high volume data quality processing to tackle the task of providing efficient and effective feedback on data quality at the scale of an aggregator with tens of millions of records. Topics covered will include looking at the tradeoffs between coverage and accuracy, using the Apache Spark processing framework to rapidly iterate on data quality workflows across large volumes of data and methods for effectively capturing the results of large scale data quality work for distribution back to data providers. The examples given in this talk are driven by from work the iDigBio team has done on implementing data quality workflows across all of the data we have collected, as well as comparing and contrasting our methods with those of other projects.