Technological advances in DNA sequencing are making it easier to decode the genome of numerous organisms. The challenge that this mass of variable-quality data presents for biologists is how to analyse it efficiently and consistently. This project aims to develop new computational approaches capable of processing genomic data of variable quality in order to compare the genomes of different organisms. Modelling the interactions between genes with the help of machine learning methods will make it possible to understand, for example, the evolution of groups of genes involved in metabolic processes.

Lay summary

First of all, the project will focus on developing tools capable of organising genomic data and deducing comparable biological elements from it, such as genes that are similar between different species. Using different types of genomic data, these tools will make it possible to analyse more species, which is important for gaining a better understanding of the processes involved in the evolution of species. The second area of focus will consist in developing new machine learning algorithms capable of identifying which of the tens of thousands of genes present in the genomes show the most interesting characteristics. Studying them in depth with the help of modelling methods will enable their interactions and evolution to be understood.

Identifying the genes that are key to an organism’s development enables scientists to determine which genes relate to functions that are essential to the organism’s survival. In medicine, for example, it is vital to know whether a gene identified in a model organism such as a mouse has the same function in human beings. Answering questions of this kind requires complex computing methods and high-quality data. Such questions are therefore restricted to a small number of organisms that have been studied in great depth and ignore the enormous quantity of poorer-quality data that is currently being generated.

The project’s scope is in full conformity with the issue of Big Data, since it addresses the size, heterogeneity and quality of genomic data in biology. It also has implications that go beyond this single discipline, since establishing approaches for managing and comparing data is essential in other fields, such as language analysis. Moreover, machine learning is a key component of computational sciences.