Project

Back to overview

Frequentist estimation of the evolutionary history of sequences with substitutions and indels

English title Frequentist estimation of the evolutionary history of sequences with substitutions and indels
Applicant Anisimova Maria
Number 176316
Funding scheme Project funding (Div. I-III)
Research institution Zürcher Hochschule für Angewandte Wissenschaften
Institution of higher education Zurich University of Applied Sciences - ZHAW
Main discipline Genetics
Start/End 01.05.2018 - 30.04.2022
Approved amount 800'000.00
Show all

All Disciplines (2)

Discipline
Genetics
Information Technology

Keywords (6)

molecular evolution; sequence alignment ; ancestral sequence ; joint inference; indel model; phylogeny

Lay Summary (French)

Lead
Le séquençage à haut débit permet aux scientifiques d'observer une incroyable diversité moléculaire entre les espèces. Puisque toutes les séquences moléculaires observées sont le résultat d'une longue histoire évolutive, les inférences informatives ne peuvent être faites que lorsqu'elles sont analysées d'un point de vue évolutionniste. Les séquences moléculaires sont régulièrement alignées pour définir l'homologie des caractères basée sur l'ascendance commune. Ces alignements sont utilisés pour dériver des arbres phylogénétiques, qui sont à leur tour utilisés pour tester des hypothèses biologiques, par exemple, en ce qui concerne la divergence fonctionnelle ou la sélection naturelle. Nous visons à développer de nouvelles méthodes de calcul pour reconstruire le passé des molécules anciennes. De telles inférences auront une variété d'applications: de la biomédecine et de l'ingénierie des protéines à la science médico-légale et à l'écologie.
Lay summary

Les modèles évolutifs existants peuvent supporter des inférences d'alignement de séquences, de phylogénie et d’histoire de mutations et d'insertions et de délétions de caractères (indels). Toutes ces inférences sont généralement effectuées en tant qu'étapes indépendantes. Pourtant, ces objets sont étroitement interconnectés, et les simplifications faites à chaque étape affectent la précision de l'estimation. Par conséquent, les MSA, les arbres, les états ancestraux et les indels devraient être déduits conjointement pour l’ensemble de de séquences homologues. Certaines méthodes bayésiennes d'inférence conjointe existent mais ne conviennent actuellement que pour de petits ensembles de données. Ceci est dû à la complexité de calcul élevée des modèles évolutifs explicites avec les indels.

Récemment, nous avons développé une nouvelle méthode rapide pour l'alignement simultané et l'inférence d'arbre. Notre approche utilise un modèle évolutif explicite d'indels décrit comme un processus de Poisson. Ceci est le premier méthode fréquentiste rapide avec une formulation mathématique rigoureuse de l'évolution indel.

Nous allons avancer notre approche pour inférer également des séquences ancestrales simultanément avec MSA et arbres. Le modèle d'Indel sera adapté pour refléter la variabilité naturelle des taux. Puisque la sélection laisse une forte empreinte sur les séquences génomiques, nous couplerons en plus notre méthode avec des modèles de codons qui permettent l'estimation de la sélection sur la protéine. Cela permettra d'atténuer les problèmes découlant des approches qui infèrent MSA, arbre et ancêtres dans des étapes indépendantes. Considérant que les modèles de codons décrivent les gènes codant les protéines de façon plus réaliste en incluant explicitement la structure du code génétique et la sélection, ces modèles ont le potentiel d'améliorer sensiblement la précision de l'estimation conjointe de l'histoire moléculaire complète.

Nos propres collaborations avec l'industrie montrent que notre nouvelle méthode sera très demandée non seulement dans les projets académiques, mais aussi dans l'industrie pharmaceutique et biotechnologique. La reconstruction de l'histoire moléculaire avec des substitutions et des indels intéresse un large éventail de chercheurs de différents domaines - de l'évolution et de l'écologie aux applications en biomédecine, en médecine légale et en ingénierie des protéines.

Direct link to Lay Summary Last update: 17.04.2018

Lay Summary (English)

Lead
High throughput sequencing technologies have permitted a wide range of scientists to observe an astonishing molecular diversity across all domains of life. Since all observed molecular sequences are a result of a long evolutionary history, most informative inferences can be made only when analysing genomic sequences from an evolutionary perspective. Molecular sequences are routinely aligned to define character homology based on common ancestry. These alignments are used to infer molecular phylogenetic trees, which are in turn used for testing various biological hypotheses, for example, with respect to functional divergence or natural selection. We aim to develop new computational methods for reconstructing the past of ancient molecules. Such inferences will be valuable in diverse molecular studies of functional properties, with applications from biomedicine and protein engineering to forensics and ecology.
Lay summary
Existing molecular evolutionary models can support inferences of multiple sequence alignment (MSA), phylogeny, and ancestral history of mutations and character insertions and deletions (indels). All these inferences are typically performed as independent steps. Yet, these objects are tightly interconnected, and decisions such as model
choice and simplifications made at each step affect the accuracy of estimation. Therefore, MSAs, trees, ancestral states and indels should be inferred jointly for any given set of homologous sequences. Some Bayesian implementations of joint inferences exist but are currently suitable only for small datasets. This is due to high computational complexity of explicit evolutionary models with indels.

Recently, we have developed a new fast method for simultaneous alignment and tree inference.
Our approach uses an explicit evolutionary model of indels described as a Poisson process with linear likelihood computation. This is the first fast frequentist method aligner with a rigorous mathematical formulation of indel evolution.

Here we will advance our approach to also infer ancestral sequences simultaneously with MSAs and trees. Indel model will be adapted to reflect the natural variability of indel rates. Since positive selection leaves a strong imprint on genomic sequences, we will additionally couple our method with codon models that enable the estimation of selection on the protein. This will alleviate the problems stemming from approaches that infer MSA, tree and ancestors in independent sequential steps. Considering that codon models describe protein-coding genes more realistically by explicitly including the structure of the genetic code and selection, these models have the potential to substantially improve the accuracy of the joint estimation of the comprehensive molecular history.

Our own collaborations with industry show that our new method will be in high demand not only in academic projects but also in pharmaceutical and biotech industry. Reconstruction of molecular history with substitutions and indels is of interest to a wide variety of researchers from different domains – from evolution and ecology to applications in biomedicine, forensics, and protein engineering.
Direct link to Lay Summary Last update: 17.04.2018

Responsible applicant and co-applicants

Employees

Associated projects

Number Title Start Funding scheme
157064 Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework 01.02.2015 Project funding (Div. I-III)
174836 C16.0072: Discovering evolutionary innovations by assessing variation and natural selection in protein tandem repeats 01.01.2017 COST (European Cooperation in Science and Technology)

Abstract

NGS technologies have permitted a wide range of scientists to observe an astonishing diversity of molecular data. Molecular sequences are routinely aligned to define character homology. These alignments are used to infer phylogenetic trees, which are in turn used for testing various biological hypotheses or support further analyses such as inference of natural selection. Taking a glimpse into the past of ancient molecules may prove valuable in diverse molecular studies of functional properties, with applications from biomedicine and protein engineering to forensics and ecology. Many existing molecular evolution models and methods can provide the means for the inference of multiple sequence alignments (MSAs), trees, ancestral characters and insertion-deletion events, typically performed, however, as independent steps. Yet, the objects are tightly connected, and decisions such as model choice and simplifications made at each step affect the accuracy of estimation. Therefore, given a set of homologous sequences, MSAs, trees, ancestral states and indels should be inferred jointly. Some joint MSA-tree estimation algorithms were implemented in the Bayesian framework, relying on the classic evolutionary birth-death model TKF91 that describes both substitutions and indels. These implementations are suitable for relatively small datasets, but are computationally difficult for large NGS data due to TKF91’s exponential complexity and the necessity to sample multiple parameters including MSAs and trees. The faster frequentist methods have been lagging behind in this respect. Recently, we have been developing a new frequentist method JATI that uses the Poisson indel process (PIP) for simultaneous MSA-tree inference. This development was possible since the PIP model is a modification of TKF91 but allows to compute marginal likelihood in linear time. Here we propose to advance our frequentist approach to simultaneous infer the tree, alignment, and ancestral sequences. Our method should compare favorably with Bayesian methods in terms of speed, and will fill an important methodological gap as a frequentist method. In addition, since our current JATI work is based on our other previous developments, specifically CodonPhyML and ProGraphMSA, we will be able to exploit their advantages. Particularly, since positive selection leaves a strong imprint on genomic sequences, coupling our joint inference method with codon models makes much sense: The joint estimation of tree-MSA-ancestors is the natural way to also simultaneously obtain the estimates of selection at the protein level. This eliminates the questions raised when using approaches that infer MSA, tree and ancestors in independent sequential steps. Considering that codon models describe protein-coding genes more realistically by explicitly including the structure of the genetic code and selection, these models have the potential to substantially improve the accuracy of the joint estimation of the comprehensive molecular history.Our own collaborations with industry show that our new method will be in high demand not only in academic projects but also in pharmaceutical and biotech industry.
-