Integrating metadata is currently an expensive and tedious task because it has proved very difficult to automate. This project aims to develop new techniques for the efficient, automatic integration of metadata taken from the Web or social networks, for example. This project is divided into two parts. The first part consists in developing and then testing new techniques for extracting data in order to characterise the available data automatically, understand the relationships between pieces of data and model their value distribution. Second, this information will be used to facilitate the analysis and integration of the available data.

Lay summary

It will be necessary to develop new techniques capable of creating data patterns on demand and providing abstraction layers. The ultimate goal is to provide processes which allow data sets to be easily combined while preserving their specific features and history.

One of the cornerstones of Big Data consists in combining several sources of information in order to model a specific phenomenon. Most current methods are based on analysis of data patterns, and particularly on the metadata that unambiguously defines the structure of the information to be combined. Nevertheless, in practice these patterns often turn out to be incomplete, e.g. for data originating from social networks or the Web. Given that it is currently impossible to combine this data automatically, experts have no choice other than to prepare and integrate it manually. The resulting loss of time is one of the major problems of Big Data.

The aim of this project is to devise new techniques for the automatic or semi-automatic integration of data. Because the data structure is often not defined in advance, the central challenge for our research is to understand it retrospectively, by reconstructing patterns using the available data.

This project is particularly important because of the disproportion between the ever-increasing volume of data available and the limited time available for analysts to process it. The results of this project will help to substantially speed up the process of turning raw data into models and visualisations. Numerous fields that require the combination of heterogeneous data sets (e.g. smart cities, personalised healthcare and e-science) stand to benefit from new methods of combining different data sets, resulting in more powerful analyses and models.