Missouri Botanical Garden Open Conference Systems, TDWG 2013 ANNUAL CONFERENCE

Font Size: 
Numerical tree representations and their application to MCMC phylogenetic inference, phylogenetic signal, and ... database query
Saverio Vicario

Building: Grand Hotel Mediterraneo
Room: America del Nord (Theatre I)
Date: 2013-10-31 09:35 AM – 09:45 AM
Last modified: 2013-10-05

Abstract


Phylogenetic information is structured as tree (topology + branch lengths). There is no unique neither complete way to transform this information in a scalar or vectorial numerical variable. This complexity makes the difficult to reuse method for both the analysis and the archiving from other field of science. This causes the flourishing of ad hoc data format and metric of comparisons not necessarily well grounded or particularly efficient.  I advocate that a correct numerical representation of a phylogenetic tree for a given problem should guide development in the field. I will show 2 applications also implemented in the BioVeL to illustrate the point.

Billera's tree space (Billera et al. 2001) is a geometric tree representation and it is shown to be a correct generalization of Robinson and Foulds distance and to be monotonically correlated with likelihood score in great variety of condition (Battagliero et al. 2011). We applied this metric to define the overlap of tree distribution of independent runs of the MCMC in order to estimate the probability of convergence of the MCMC runs within a Kolmogorov-Smirnov test framework. An effective sample size (ESS) of the tree parameter is produced to estimate correct degree of freedom. The software implementation of the approach is available at http://mblabproject.it/geoks/ while the software is exposed in web service https://www.biodiversitycatalogue.org/rest_methods/115.

The service is used within a workflow (http://www.myexperiment.org/packs/371.html) that includes AIC based method to define partitioned model, Bayesian phylogenetic inference, and a posterior predictive test. This workflow that could be played in the BioVeL portal (https://portal.biovel.eu/) ensure to detect main phylogenetic pitfalls.

Within the context of community diversity it was proposed the phylogenetic entropy (Allen et al. 2009; Chao et al. 2010). This measure defines the uncertainty in finding a given ancestry in an organism at random taken from leaf of a tree. I enlarged this concept to estimate species turnover (beta diversity) across group of samples (http://www.myexperiment.org/workflows/3569.html).

From a research infrastructure point of view, phylogenetic entropy could be used as checksum. In fact beyond the format used the phylogenetic entropy represent a unique scalar number that describe the structure of the tree. A pre-calculated phylogenetic checksum would detect identity between different versions of the same tree and, together with the list of leaf identifier, would form an intrinsic unique identifier.

Using this same prospective, phylogenetic beta diversity estimate, that could be obtained with three tree traversing, two of which could be pre-calculated, could measure the mutual information between a tree structure and a given categorical vector allowing to retrieve tree that better match pattern of a categorical variable.

 

reference

Billera LJ, Holmes SP, Vogtmann K. Geometry of the Space of Phylogenetic Trees. Advances in Applied Mathematics. 2001;27(4):733-767

Battagliero S, Puglia G, Vicario S, et al. An Efficient Algorithm for Approximating Geodesic Distances in Tree Space. IEEE; 2011:1196-1207.

Chao A, Chiu C-H, Jost L. Phylogenetic diversity measures based on Hill numbers. Philosophical transactions of the Royal Society of London. Series B, Biological sciences. 2010; 365(1558): 3599-609.