Missouri Botanical Garden Open Conference Systems, TDWG 2014 ANNUAL CONFERENCE

Font Size: 
Scalable and Provenance-Enabled Scientific Workflows for Species Distribution Modeling
Luiz Gadelha, Guilherme Gall, Andrea Sanchez-Tapia, Marinez Ferreira de Siqueira, Jorge Velásquez, Daniel Lopez

Building: Elmia Congress Centre, Jönköping
Room: Rum 10
Date: 2014-10-28 04:54 PM – 05:07 PM
Last modified: 2014-10-03


Tools for analysis and synthesis of biodiversity data, such as species distribution modeling (SDM) [Townsend Peterson et al. 2011], are widely used. These analyses typically employ several different applications executed in a loosely-coupled manner, a typical use case for scientific workflow management [Deelman et al systems. 2009]. For example, in the case of SDM, global climatological data is retrieved from the environmental data providers while species occurrence data is obtained from providers such as GBIF. It is common that the data have to be adjusted with geographical information system tools or filtered using data quality control tools. After these preprocessing steps, algorithms for SDM, such as Maxent [Phillips et al. 2004], are applied to predict the potential distribution of species using the environmental data and the species occurrence data acquired and manipulated in pre-processing steps. Finally, a post-processing step is performed, where statistical and data visualization tools are used to analyze the result of the modeling. This process is computationally demanding, which makes it important to use tools that are scalable. The Brazilian Biodiversity Information System (SiBBr) implemented a prototype scientific workflow for SDM [SiBBr Github 2014], which allows the execution of various algorithms available in the openModeller [Muñoz 2011] SDM library. The implementation was done in Swift [Wilde et al. 2011], a system for managing scientific workflows with emphasis on parallelism and distribution. The implemented parallelization strategy uses a thread of execution per species modeled in scientific workflow. Other opportunities for parallelism that can be exploited include the execution of a thread by SDM algorithm, by climatological scenario provided by the Intergovernmental Panel on Climate Change (IPCC) [IPCC 2014], or by set of parameters of the SDM algorithms. The workflow was executed on a shared memory machine with 72GB of memory and 24 processing cores showing good scalability. One advantage of using the Swift for implementing the scientific workflow for SDM is its native support for recording data provenance [Gadelha et al. 2012], which facilitates both the reproducibility of the computational experiment and its analysis and validation. Currently, additional scripts, based in the DISMO library [DISMO 2014] for R, are being aggregated to the scientific workflow. Finally, as future work, a web interface for these scientific workflows will be provided allowing for both SDM experts and taxonomists to interact, similarly to BioModelos [BioModelos 2014], to provide highly qualified predictions for species distributions.