sequence alignment; maximum likelihood inference; phylogeny; indels; joint estimation; genomic sequences; molecular evolution
Maiolo Massimo, Zhang Xiaolei, Gil Manuel, Anisimova Maria (2018), Progressive multiple sequence alignment with indel evolution, in BMC Bioinformatics
, 19(1), 331-331.
Gatti Lorenzo, Zhang Jitao David, Anisimova Maria, Schutten Martin, Osterhaus Albert, van der Vries Erhard (2017), Cross-reactive immunity drives global oscillation and opposed alternation patterns of seasonal influenza A viruses, in bioRxiv
Nater Alexander, Mattle-Greminger Maja P, Nowak Matthew G, others, Anisimova Maria et al. (2017), Morphometric, behavioral, and genomic evidence for a new Orangutan species, in Current Biology
, 27(22), 3487-3498.
Maiolo Massimo, Zhang Xiaolei, Gil Manuel, Anisimova Maria (2017), Progressive Multiple Sequence Alignment With The Poisson Indel Process, in bioRxiv
Zhang Xiaolei (2016), Analysis of Bias and Reliability of Progressive Poisson Indel Process Algorithm.
, MSc thesis, ETH Zurich, Zurich.
Balakirev Evgeniy S, Anisimova Maria, Pavlyuchkov Vladimir A, Ayala Francisco J (2016), DNA polymorphism and selection at the bindin locus in three Strongylocentrotus sp.(Echinoidea), in BMC genetics
, 17(1), 66-66.
Wang Keke, Remigi Philippe, Anisimova Maria, Lonjon Fabien, Kars Ilona, Kajava Andrey, Li Chien-Hui, Cheng Chiu-Ping, Vailleau Fabienne, Genin Stéphane, others (2016), Functional assignment to positively selected sites in the core type III effector RipG7 from Ralstonia solanacearum, in Molecular plant pathology
, 17(4), 553-564.
Gil Manuel, Anisimova Maria (2015), Methodologies for Phylogenetic Inference, in eLS
Tan Ge, Gil Manuel, Löytynoja Ari P., Goldman Nick, Dessimoz Christophe (2015), Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks, in Proceedings of the National Academy of Sciences
, 112(2), 99-100.
SIB members including, Lorenzo Gatti, Massimo Maiolo, Manuel Gil, Maria Anisimova (2015), The SIB Swiss Institute of Bioinformatics’ Resources: Focus on Curated Databases., in Nucleic Acids Research
, 44(D1), D27-D37.
With rapidly growing molecular data from high-throughput technologies, bioinformatics methods must keep pace, providing the scientific community with accurate and scalable computational solutions to analyze these data. For molecular sequence analysis, evolutionary thinking provides a powerful framework for disentangling underlying biological mechanisms: Molecular sequences of common origin are routinely used to infer phylogenies, which provide test-base for various biological hypotheses or support further downstream analyses.Phylogeny inference typically relies on multiple sequence alignments (MSA), which are - in turn - usually inferred during a heuristic search navigated by a guide-tree. As well as this apparent circularity, modeling simplifications at each step affect the accuracy of MSA and phylogeny estimation. Ideally, phylogeny and alignment should be inferred jointly. Several joint alignment-tree inference (JATI) algorithms were implemented in the Bayesian framework, relying on the classic evolutionary model TKF91 that describes sequence changes (substitutions and indels) by an infinite-state continuous-time birth-death process. These implementations are useful for relatively small datasets, but are not realistic for large modern-day NGS data due to computational limitations, namely the exponential complexity of the TKF91 and the intensive MCMC sampling of multiple parameters including unconventional parameters such as alignment and tree.We propose to develop a fast and accurate JATI algorithm in the frequentist framework, which will be implemented in a user-friendly software package and applicable to large genomic and metagenomic datasets with thousands of sequences. This proposal will connect and build upon our two recent successful methods, currently implemented in independent packages: (1) CodonPhyML for fast maximum likelihood phylogeny inference for protein-coding genes (Gil et al. 2013 Mol Biol Evol), and (2) ProGraphMSA for fast probabilistic graph-based phylogeny-aware alignment (Szalkowski 2012 PLoS ONE; Szalkowski, Anisimova 2013, Nucleic Acids Res). To circumvent the computational difficulties posed by combining indel and substitution processes, we will use the most recent Poisson indel process (PIP; Bouchard-Cote, Jordan, 2013), which is a modification of TKF91 but has linear time complexity. In addition, the existing arsenal of CodonPhyML’s substitution models will be used to improve accuracy of joint estimation, since these models describe protein-coding genes more realistically by explicitly including the structure of the genetic code and selection pressures at the protein level. Further, ProGraphMSA provides one of the fastest and most accurate alignment heuristics, which distinguishes and correctly penalizes insertions and deletions, accounts for sequence divergence, and is the only alignment method that incorporates sequence content heterogeneity, alternative splicing and repeats.Combining these state-of the art methods for the first time in a maximum likelihood JATI algorithm, here we aim to: (1) develop new efficient heuristics to search through the joint space of alignment and tree structures combined with optimization of model parameters and branch lengths, (2) develop a sampling procedure during our the joint heuristic search that enables approximation of marginal likelihoods over MSAs and of confidence sets for inferred point phylogenies. As a consequence, the new methodology will allow not only more accurate phylogeny and alignment inference but will also facilitate the estimation of statistical supports for inferred tree partitions and reconstruct indel and substitution history. High performance computing techniques will ensure that the implementation is optimized for memory usage and speed using parallelization (also on GPUs). This will support the phylogenetic analyses of genomic/metagenomic data or NGS data with thousands of sequences from viral/bacterial pathogens or antibody data from infected donors. Based on our own current collaborations with industry, we see that our new JATI method will be in high demand not only in academic projects but also in pharmaceutical and biotech industry.