Project

Back to overview

SimDiversity: Similarity-reduced measures of diversity in the social and natural sciences

Applicant Bavaud François
Number 190377
Funding scheme Spark
Research institution Sciences du langage et de l'information Faculté des lettres Université de Lausanne
Institution of higher education University of Lausanne - LA
Main discipline Mathematics
Start/End 01.02.2020 - 31.01.2021
Approved amount 89'990.00
Show all

All Disciplines (5)

Discipline
Mathematics
Applied linguistics
Political science
Ecology
Social geography and ecology

Keywords (7)

similarities between types; similarity-reduced biodiversity; information theory; phase transitions; political diversity; effective diversity; textual richness

Lay Summary (French)

Lead
Les mesures classiques de la variété d'une configuration (entropie de Shannon et variantes) sont basées sur le recensement de catégories ou types distincts, ainsi que de leur fréquence d'apparition. Or, quel que soit le niveau de granularité retenu, ces types ne sont pas totalement distincts, mais présentent des similarités plus ou moins prononcées, propres à réduire la valeur de la variété classique:•les mots présentent des similarités sémantiques (variété lexicale dans un texte)•les lieux géographiques sont co-fréquentés (variété spatiale)•les espèces sont morphologiquement ou génétiquement apparentées (biodiversité)•les votes des parlementaires affiliés à des formations politiques distinctes coïncident souvent (variété politique).
Lay summary

CONTENU ET OBJECTIFS DU TRAVAIL DE RECHERCHE
Le projet SimDiversity est consacré à l'étude des mesures de variété tenant compte de la similarité entre types, en particulier des deux mesures que sont l'entropie réduite, récemment introduite en biodiversité, et une nouvelle quantité, l'entropie effective, dont le calcul, itératif, révèle de forts liens avec la mécanique statistique et les transitions de phase, le clustering et la quantification vectorielle, le transport optimal, et les modèles de perception et de choix en psychologie.

L'entropie effective est toujours concave (i.e. possède une décomposition intra/inter-groupes), au contraire de l'entropie réduite dont la concavité dépend de conditions supplémentaires sur la matrice des similarités, lesquelles restent à caractériser complètement.

Dans son volet empirique, le projet SimDiversity collectera, en dialogue avec des spécialistes de chaque discipline (linguistique, géographie, biologie, politologie), des jeux de données de grande taille, sur lesquels sont calculées et analysées ces mesures de diversité réduites, du point de vue de leur comportement numérique ainsi que de leur interprétation disciplinaire.

CONTEXTE SCIENTIFIQUE ET SOCIAL DU TRAVAIL DE RECHERCHE
L'objectif à terme du travail est de contribuer à promouvoir et unifier l'usage de ces mesures de variété réduite au-delà des particularités disciplinaires, et d'encourager leur diffusion dans la pratique générale en Analyse des Données.

Direct link to Lay Summary Last update: 25.11.2019

Responsible applicant and co-applicants

Employees

Publications

Publication
Similarity-reduced diversities: the effective entropy and the reduced entropy
BavaudFrançois, Similarity-reduced diversities: the effective entropy and the reduced entropy, in Journal of Classification.

Datasets

simdiversity/data-politics v0.2

Author Egloff, Mattia; Bavaud, François
Publication date 15.03.2021
Persistent Identifier (PID) DOI: 10.5281/zenodo.4485413
Repository SimDiversity
Abstract
This package contains four datasets: swiss_legislator_49; swiss_legislator_50; italian_legislator_17; italian_legislator_18Each dataset contains data about a legislative period split in 4 datatables. One contains the information about members of parliament, the second the information about all the votes, the third is a numeric matrix containing all the polls for each member of parliment and each vote, the final datatable contains the meaning of the numbers in the numeric matrix e.g. Yes, No, Absention...The Swiss datasets is obtained combining the xls files avilable on https://www.parlament.ch/de/ratsbetrieb/abstimmungen/abstimmung-nr-xls with some informatio from http://ws-old.parlament.ch/ the swiss parlamentary data webservice.The Italian datasets are obtained by querying the SPARQL endpoint of the italian parliment https://dati.camera.it/ .For more details see the scripts in the "data-raw" folder.

Abstract

The project SimDiversity proposes to define, study and diffuse new measures of diversity, where the similarities between the constituents (types, categories) are explicitly taken into account. Those measures extend the classical diversity measures (Shannon, Tsallis, Rényi and variants) which are recovered in the limit of infinitely distinct types. They crucially depend on the nature of the similarities at stake. Conversely, similarity-reduced measures of diversity help characterizing and understanding the nature of similarities in Data Analysis, much less investigated than dissimilarities.Firmly grounded in a robust and innovative mathematical formalism, the project aims at unifying and stimulating the use of similarity-reduced measures of diversity in the social and natural sciences, in particular for textual data, geography, political science, and biology-ecology. Its interdisciplinary nature makes it less likely to be supported by more conventional mono-disciplinary funding schemes.Building from currently available, preliminary results and numerical code applied on toy datasets, the project consists of:(1) gathering large data sets: at least two "real" datasets of interest for each of the four disciplines mentioned above, constituting a rich corpus exhibiting various facets of (dis)similarity structures;(2) further investigating the mathematical formalism (characterization of "dominating types" and "phase transitions"; creation and testing of new indices); optimizing and developing the presently available numerical codes (R and Python), and distributing the material in an accessible, open and user-friendly manner; (3) conducting scientific exchanges with local and international specialists in the target disciplines (for maximizing the potential impact and relevance of the project); diffusing the results through conferences and open journal articles.The project will be carried out on 12 months by the applicant and another scientific collaborator, both possessing a significant research experience (postdoc level or more) covering all the necessary skills and scientific expertise, on a 130% full time equivalent basis.The project pertains a priori to all the disciplines concerned with the concepts of diversity, variety or richness - that is most of the disciplines in the social and natural sciences. Its potential impact in Data Analysis, currently scantly developed regarding similarity studies, is large. It also aims at encouraging disciplines at paying more attention to the issue of similarity between types or categories, and at using well-defined similarity measures in a well-controlled manner. The project intends to significantly contribute:• to the unification of the studies and uses of measures of diversity which (in contrast to the concept of variance, say), are so far largely developed and applied in a idiosyncratic way, within tightly segregated disciplines; and to encourage the practice of quantification when concepts such as "linguistic richness", "political variety", or "biodiversity" are referred to.• to the unification and further developments of formal approaches, studied so far separately in Information Theory, Statistical Mechanics, Optimal Transportation, biased choice model and confusion models in Psychology; and to stress the potential of the proposition for capturing and describing phenomena such as phase transitions in stimulus recognition, the emergence of prototypes, or the very nature of the categorization process itself, of direct relevance for many disciplines.
-