Project

Back to overview

Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework

English title Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework
Applicant Anisimova Maria
Number 157064
Funding scheme Project funding (Div. I-III)
Research institution Zürcher Hochschule für Angewandte Wissenschaften
Institution of higher education Zurich University of Applied Sciences - ZHAW
Main discipline Genetics
Start/End 01.02.2015 - 31.01.2018
Approved amount 454'000.00
Show all

Keywords (7)

sequence alignment; maximum likelihood inference; phylogeny; indels; joint estimation; genomic sequences; molecular evolution

Lay Summary (French)

Lead
La disponibilité de grandes quantités de données moléculaires exige des développements de méthodes bioinformatiques précis et rapides pour analyser ces données. Les séquences moléculaires d'origine commune sont utilisées pour inférer des phylogénies, qui aident à tester différentes hypothèses biologiques ou pour soutenir des analyses ultérieures. L’inférence phylogénétique repose sur des alignements de séquences, qui sont généralement déduites au cours d'une inférence heuristique navigué par un arbre de guidage. Pour résoudre cette circularité nous allons développer des méthodes d'inférence simultanée de la phylogénie et de l'alignement. Ce projet permettra d'élaborer une solution rapide et pratique.
Lay summary

L'objectif est de développer un algorithme rapide et précis pour l'inférence simultanée de l'alignement et de l'arbre à l'aide des statistiques fréquentistes. L'algorithme sera disponible dans un logiciel qui permettrait d'analyser de grands ensembles de données génomiques ou métagénomiques avec des milliers de séquences. Nous allons connecter nos méthodes efficaces récents, fournis en paquets indépendants CodonPhyML (pour inférence rapide du maximum de vraisemblance de la phylogénie des gènes codant pour des protéines) et ProGraphMSA (pour l'alignement évolutive, probabiliste et rapide basé sur les graphes). Pour contourner les difficultés de calcul, nous allons modéliser le processus d’indel utilisant une modification du modèle classique, avec une complexité en temps linéaire. Le calcul de haute performance assurera que notre logiciel soit optimisé pour l'utilisation de la mémoire et la vitesse.

La nouvelle méthode soutiendra les analyses phylogénétiques de données génomiques avec des milliers de séquences de pathogènes microbiens ou des données d'anticorps provenant de donneurs infectés. Sur la base de nos propres collaborations actuelles avec l'industrie, la nouvelle méthode promet d'être très en demande, non seulement chez les projets académiques, mais aussi à l'industrie pharmaceutique et biotechnologique.


 

Direct link to Lay Summary Last update: 08.01.2015

Lay Summary (English)

Lead
The availability of large molecular data demands accurate and fast bioinformatics methods to analyze these data. Molecular sequences of common origin are used to infer phylogenetic trees, which help to test various biological hypotheses or to support subsequent analyses. Phylogeny inference relies on sequence alignments, which are usually inferred during a heuristic search navigated by a guide-tree. This circularity calls for methods for joint inference of phylogeny and alignment. This project will develop a fast and practical solution.
Lay summary

The goal is to develop a fast and accurate joint alignment and tree inference algorithm in the frequentist framework, which will be implemented in a user-friendly software package and applicable to large genomic and metagenomic datasets with of sequences. We will connect our recent successful methods implemented in independent packages: CodonPhyML for fast maximum likelihood phylogeny inference for protein-coding genes and ProGraphMSA for fast probabilistic graph-based phylogeny-aware alignment. To circumvent the computational difficulties, we will use the Poisson indel process - a modification of the classical model with a linear time complexity. High performance computing will ensure that the implementation is optimized for memory usage and speed using parallelization.

The new method will support the phylogenetic analyses of genomic data with thousands of sequences from microbial pathogens or antibody data from infected donors. Based on our own current collaborations with industry, the new method promises to be in high demand not only in academic projects but also in pharmaceutical and biotech industry.

Direct link to Lay Summary Last update: 08.01.2015

Responsible applicant and co-applicants

Employees

Publications

Publication
Progressive multiple sequence alignment with indel evolution
Maiolo Massimo, Zhang Xiaolei, Gil Manuel, Anisimova Maria (2018), Progressive multiple sequence alignment with indel evolution, in BMC Bioinformatics, 19(1), 331-331.
Cross-reactive immunity drives global oscillation and opposed alternation patterns of seasonal influenza A viruses
Gatti Lorenzo, Zhang Jitao David, Anisimova Maria, Schutten Martin, Osterhaus Albert, van der Vries Erhard (2017), Cross-reactive immunity drives global oscillation and opposed alternation patterns of seasonal influenza A viruses, in bioRxiv, 226613-226613.
Morphometric, behavioral, and genomic evidence for a new Orangutan species
Nater Alexander, Mattle-Greminger Maja P, Nowak Matthew G, others, Anisimova Maria et al. (2017), Morphometric, behavioral, and genomic evidence for a new Orangutan species, in Current Biology, 27(22), 3487-3498.
Progressive Multiple Sequence Alignment With The Poisson Indel Process
Maiolo Massimo, Zhang Xiaolei, Gil Manuel, Anisimova Maria (2017), Progressive Multiple Sequence Alignment With The Poisson Indel Process, in bioRxiv, 123513-123513.
Analysis of Bias and Reliability of Progressive Poisson Indel Process Algorithm.
Zhang Xiaolei (2016), Analysis of Bias and Reliability of Progressive Poisson Indel Process Algorithm., MSc thesis, ETH Zurich, Zurich.
DNA polymorphism and selection at the bindin locus in three Strongylocentrotus sp.(Echinoidea)
Balakirev Evgeniy S, Anisimova Maria, Pavlyuchkov Vladimir A, Ayala Francisco J (2016), DNA polymorphism and selection at the bindin locus in three Strongylocentrotus sp.(Echinoidea), in BMC genetics, 17(1), 66-66.
Functional assignment to positively selected sites in the core type III effector RipG7 from Ralstonia solanacearum
Wang Keke, Remigi Philippe, Anisimova Maria, Lonjon Fabien, Kars Ilona, Kajava Andrey, Li Chien-Hui, Cheng Chiu-Ping, Vailleau Fabienne, Genin Stéphane, others (2016), Functional assignment to positively selected sites in the core type III effector RipG7 from Ralstonia solanacearum, in Molecular plant pathology, 17(4), 553-564.
Methodologies for Phylogenetic Inference
Gil Manuel, Anisimova Maria (2015), Methodologies for Phylogenetic Inference, in eLS, 1-5.
Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks
Tan Ge, Gil Manuel, Löytynoja Ari P., Goldman Nick, Dessimoz Christophe (2015), Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks, in Proceedings of the National Academy of Sciences, 112(2), 99-100.
The SIB Swiss Institute of Bioinformatics’ Resources: Focus on Curated Databases.
SIB members including, Lorenzo Gatti, Massimo Maiolo, Manuel Gil, Maria Anisimova (2015), The SIB Swiss Institute of Bioinformatics’ Resources: Focus on Curated Databases., in Nucleic Acids Research, 44(D1), D27-D37.

Collaboration

Group / person Country
Types of collaboration
Roche Pharma Research and Early Development Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Industry/business/other use-inspired collaboration
BSSE, ETH Zürich (Xiaolei Zhang, MSc thesis) Switzerland (Europe)
- Publication
- Exchange of personnel
Veterinary University Hannover, Germany Germany (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
IMLS and Department of Anthropology, University of Zurich Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Exchange of personnel
Far Eastern Branch of the Russian Academy of Science, Vladivostok Russia (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Phylogroup XI Talk given at a conference Inferring Multiple Sequence Alignments with Explicit Model of Indel Evolution 08.03.2018 London, Great Britain and Northern Ireland Maiolo Massimo;
Biology 2018 (invited workshop for conference participants) Talk given at a conference Fast and accurate reconstruction of multiple sequence alignments using an evolutionary indel model 14.02.2018 Neuchatel, Switzerland Ulzega Simone; Gatti Lorenzo; Maiolo Massimo;
Bioinfo seminar, IMLS - UZH Talk given at a conference Phylogenetic tree optimisation under gap-aware evolutionary models: A computationally feasible algorithm for phylogenetic tree optimisation under the Poisson InDel Process (PIP) 06.12.2017 Zurich, Switzerland Gatti Lorenzo; von Mering Christian;
BioComp 2017, EPFL Poster Phylogenetic tree optimisation under gap-aware evolutionary models: A computationally feasible algorithm for phylogenetic tree optimisation under the Poisson InDel Process (PIP) 04.10.2017 Lausanne, Switzerland Gatti Lorenzo;
XLAB Alumni Symposium, XLAB - Goettinger Experimentallabor fuer junge Leute Talk given at a conference Seasonal influenza dynamics with phylogeographic models 21.02.2017 Goettingen, Germany Gatti Lorenzo;
Computational Life Sciences @ Bayer 2016 Talk given at a conference Cross-reactive immunity drives global oscillation and opposed alternation patterns of seasonal influenza A viruses 16.11.2016 Berlin, Germany Gatti Lorenzo;
Big Data in Biology and Health 2016, Welcome Genome Campus, EMBL Talk given at a conference Cross-reactive immunity drives global oscillation and opposed alternation patterns of seasonal influenza A viruses 25.09.2016 Heidelberg (DE), Switzerland Gatti Lorenzo;
PhyloSIB 2016 Talk given at a conference Fast and accurate Joint Alignment-Tree Inference method in the Frequentist framework 05.09.2016 Wädenswil, Switzerland Maiolo Massimo;
Phylogeny and Detecting Natural Selection, University of Zurich Individual talk Workshop for the PhD program for plant sciences 28.06.2016 Zurich, Switzerland Anisimova Maria; Gatti Lorenzo;
SIB Days 2016 Talk given at a conference [Selected lightning talk] Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework 08.06.2016 Bienne, Switzerland Maiolo Massimo;
SIB Days 2016 Talk given at a conference Seasonal influenza dynamics revealed with phylogeographic models 07.06.2016 Biel-Bienne, Switzerland Gatti Lorenzo;
Bioinfo meeting, UZH Talk given at a conference Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework 16.03.2016 Zurich, Switzerland Maiolo Massimo;
Life in numbers 2 Poster Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework 01.09.2015 Wädenswil, Switzerland Anisimova Maria; Maiolo Massimo;
Molecular Life Sciences PhD Retreat Poster Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework 27.08.2015 Engelberg, Switzerland Maiolo Massimo;
Bioinfo seminar, IMLS - UZH Talk given at a conference Seasonal influenza dynamics uncovered with phylogeographic models 10.08.2015 Zurich, Switzerland Gatti Lorenzo; von Mering Christian;
Invited speaker in Prof. Tanja Stadler lab, BSSE ETH Zurich Individual talk Influenza alignments with evolutionary indel model 29.07.2015 Basel, Switzerland Gatti Lorenzo;


Self-organised

Title Date Place
PhyloSIB 2016 01.09.2016 Wädenswil, Switzerland
Life in numbers 2 01.09.2015 Wädenswil, Switzerland

Knowledge transfer events

Active participation

Title Type of contribution Date Place Persons involved
NanoTalks, University of Zurich Talk 26.05.2016 Zurich, Switzerland Gatti Lorenzo;


Self-organised

Title Date Place
Computational Molecular Evolution: Phylogenetics and Detecting Positive Selection 12.06.2017 University of Lund (SE), Sweden

Communication with the public

Communication Title Media Place Year
Media relations: print media, online media Die Tapanuli-Orang-Utans: Zürcher Forschende bestimmen neuen Menschenaffen NZZ German-speaking Switzerland 2017
Media relations: print media, online media Phylogenetics reveals competition of human flu subtypes Transfer International 2017
Media relations: print media, online media Fast joint estimation of alignment and phylogeny from genomic sequences in a frequentist framework Transfer International 2015

Awards

Title Year
Best Lightning Talk, SIB scientific days 2016, Swiss Institute of Bioinformatics, Biel-Bienne (CH) 2016
Best scientific poster at ETHZ-UZH MLS PhD retreat 2015, UZH-ETH Zurich Ph.D Program in Molecular Life Sciences (CH) 2015

Use-inspired outputs

Associated projects

Number Title Start Funding scheme
176316 Frequentist estimation of the evolutionary history of sequences with substitutions and indels 01.05.2018 Project funding (Div. I-III)

Abstract

With rapidly growing molecular data from high-throughput technologies, bioinformatics methods must keep pace, providing the scientific community with accurate and scalable computational solutions to analyze these data. For molecular sequence analysis, evolutionary thinking provides a powerful framework for disentangling underlying biological mechanisms: Molecular sequences of common origin are routinely used to infer phylogenies, which provide test-base for various biological hypotheses or support further downstream analyses.Phylogeny inference typically relies on multiple sequence alignments (MSA), which are - in turn - usually inferred during a heuristic search navigated by a guide-tree. As well as this apparent circularity, modeling simplifications at each step affect the accuracy of MSA and phylogeny estimation. Ideally, phylogeny and alignment should be inferred jointly. Several joint alignment-tree inference (JATI) algorithms were implemented in the Bayesian framework, relying on the classic evolutionary model TKF91 that describes sequence changes (substitutions and indels) by an infinite-state continuous-time birth-death process. These implementations are useful for relatively small datasets, but are not realistic for large modern-day NGS data due to computational limitations, namely the exponential complexity of the TKF91 and the intensive MCMC sampling of multiple parameters including unconventional parameters such as alignment and tree.We propose to develop a fast and accurate JATI algorithm in the frequentist framework, which will be implemented in a user-friendly software package and applicable to large genomic and metagenomic datasets with thousands of sequences. This proposal will connect and build upon our two recent successful methods, currently implemented in independent packages: (1) CodonPhyML for fast maximum likelihood phylogeny inference for protein-coding genes (Gil et al. 2013 Mol Biol Evol), and (2) ProGraphMSA for fast probabilistic graph-based phylogeny-aware alignment (Szalkowski 2012 PLoS ONE; Szalkowski, Anisimova 2013, Nucleic Acids Res). To circumvent the computational difficulties posed by combining indel and substitution processes, we will use the most recent Poisson indel process (PIP; Bouchard-Cote, Jordan, 2013), which is a modification of TKF91 but has linear time complexity. In addition, the existing arsenal of CodonPhyML’s substitution models will be used to improve accuracy of joint estimation, since these models describe protein-coding genes more realistically by explicitly including the structure of the genetic code and selection pressures at the protein level. Further, ProGraphMSA provides one of the fastest and most accurate alignment heuristics, which distinguishes and correctly penalizes insertions and deletions, accounts for sequence divergence, and is the only alignment method that incorporates sequence content heterogeneity, alternative splicing and repeats.Combining these state-of the art methods for the first time in a maximum likelihood JATI algorithm, here we aim to: (1) develop new efficient heuristics to search through the joint space of alignment and tree structures combined with optimization of model parameters and branch lengths, (2) develop a sampling procedure during our the joint heuristic search that enables approximation of marginal likelihoods over MSAs and of confidence sets for inferred point phylogenies. As a consequence, the new methodology will allow not only more accurate phylogeny and alignment inference but will also facilitate the estimation of statistical supports for inferred tree partitions and reconstruct indel and substitution history. High performance computing techniques will ensure that the implementation is optimized for memory usage and speed using parallelization (also on GPUs). This will support the phylogenetic analyses of genomic/metagenomic data or NGS data with thousands of sequences from viral/bacterial pathogens or antibody data from infected donors. Based on our own current collaborations with industry, we see that our new JATI method will be in high demand not only in academic projects but also in pharmaceutical and biotech industry.
-