Lead
The availability of large molecular data demands accurate and fast bioinformatics methods to analyze these data. Molecular sequences of common origin are used to infer phylogenetic trees, which help to test various biological hypotheses or to support subsequent analyses. Phylogeny inference relies on sequence alignments, which are usually inferred during a heuristic search navigated by a guide-tree. This circularity calls for methods for joint inference of phylogeny and alignment. This project will develop a fast and practical solution.

Lay summary

The goal is to develop a fast and accurate joint alignment and tree inference algorithm in the frequentist framework, which will be implemented in a user-friendly software package and applicable to large genomic and metagenomic datasets with of sequences. We will connect our recent successful methods implemented in independent packages: CodonPhyML for fast maximum likelihood phylogeny inference for protein-coding genes and ProGraphMSA for fast probabilistic graph-based phylogeny-aware alignment. To circumvent the computational difficulties, we will use the Poisson indel process - a modification of the classical model with a linear time complexity. High performance computing will ensure that the implementation is optimized for memory usage and speed using parallelization.

The new method will support the phylogenetic analyses of genomic data with thousands of sequences from microbial pathogens or antibody data from infected donors. Based on our own current collaborations with industry, the new method promises to be in high demand not only in academic projects but also in pharmaceutical and biotech industry.