Building: Main Building 1st Floor
Room: Salone degli Oceani
Last modified: 2013-10-01
Abstract
The absence of accessible and computable phenotypic data is perhaps the most pressing scientific bottleneck to integration across a diverse set of key fields in biology, including genomics, genetics, development, evolution, ecology, and taxonomy. Over the past several decades, the biological community has embraced the goal of integrative biology, forming cross-disciplinary initiatives, departments and societies and bringing together areas of research that had been historically separated, e.g., development and evolution under ‘devo-evo’. These efforts, largely involving ‘manual’ integration of limited datasets, have been hugely successful, but they are increasingly reliant on the large and rapidly growing genomic and genetic data stores. At this juncture, discoveries in many areas of biology rely on integrating genomic data with phenotypic data, and this is at an impasse because of the lack of computable and accessible phenotypic data across species. Here we describe the strategy we have successfully employed to computationally compare phenotypic data across mammalian taxonomic groups, thereby gaining insight into genotype to phenotype relationships.
There are significant challenges to this undertaking, but it can increase efficiency, reduce the loss of data and duplication of effort and facilitate cross-domain reuse of phenotype data. Phenotypic data are complex, spanning many levels of biological organization and a variety of possible observations, involving a temporal developmental component and environmental context. Phenotypes are straightforward to describe in words, i.e., in ‘natural language’ or ‘free text’, but they are more difficult to capture in a form that computers can understand. The legacy of comparative biology is locked in this free-text form.
Our comparison algorithm is based on an initial pair-wise semantic similarity of individual phenotypic features. The resulting pairwise phenotypic relevance scores are based on logical definitions (Entity-Quality statements) and determine biologically equivalent phenotypes across species where simple, lexical matching is not possible. Logical definitions are provided for phenotypic features related to biological processes, small molecules, cell types, and anatomical structures. The semantic matching approach allows similar but non-exact phenotypes to be detected and a score to be generated for how similar the two phenotypes being considered are and how specific the match is (generalized phenotypes that are seen in lots of diseases and receive a lower score). An overall similarity score is obtained by averaging across all the pairwise comparisons.