Developing a phyloinformatic framework for analysing multigene data matrices: grass database, missing data and mixed models

Gesuchsteller/in Salamin Nicolas
Nummer 116412
Förderungsinstrument Projektförderung (Abt. I-III)
Forschungseinrichtung Département d'Ecologie et d'Evolution Faculté de Biologie et de Médecine Université de Lausanne
Hochschule Universität Lausanne - LA
Hauptdisziplin Botanik
Beginn/Ende 01.10.2007 - 31.12.2010
Bewilligter Betrag 261'515.00
phylogenetic trees; phyloinformatics; multigene alignment; DNA sequences; maximum likelihood; computer simulations

This project proposes to develop a phyloinformatic framework aiming at improving several aspects of the analysis of multigene data matrices. The project has four distinct goals. (i) The implementation of a database storing aligned DNA sequences for the grass family. The database will be regularly updated by querying existing sequence databases. Automated tools will sift the sequences in order to obtain the maximum number of DNA regions and species usable for phylogenetic studies. Beside alignments, the database will store phylogenetic trees for each DNA regions considered, as well as a tree for the combined DNA regions, which will represent the largest existing grass phylogenetic tree. An instantaneous view of grass evolutionary history will be available online and will help future sampling strategies aiming at builing the grass Tree of Life. (ii) The development and assessment of algorithms allowing an efficient tree reconstruction of multigene matrices containing large amount of missing data. Such matrices are becoming more and more common in phylogenetic studies and the inclusion of large amount of missing data can potentially impact the accuracy, resolution and support of the estimated phylogenetic tree. Removing taxa with missing data is often not efficient in macro-evolutionary studies and obtaining the most resolved tree is important if we want statistical power for analyses using trees as starting points. (iii) The characterisation of model parameters important in the analysis of multigene data matrices. When analysing multigene matrices, a single model of DNA evolution is either applied to all partitions, or different models are applied to each partition separately. In the first case, oversimplication can result in inconsistent inference, while the second case could lead to overparameterisation. We will use computer simulations to investigate the effect of using different model parameters in order to get accurate topologies and branch lengths estimations. This part of the project will propose guidelines as to how best analyse multigene data matrices. (iv) The development of a tool selecting appropriate models of DNA evolution for multigene data matrices. This tool will determine which models are shared among partitions and which model parameters should be linked or unlinked across partitions of the data set. It will be built upon a performance-based approach to model selection. This tool will help selecting appropriate models of DNA evolution, and therefore should be useful in reducing inconsistent inference.
