Project

Back to overview

RODI: ROle based speaker DIarization

English title RODI: ROle based speaker DIarization
Applicant Bourlard Hervé
Number 135463
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.11.2011 - 31.10.2014
Approved amount 165'738.00
Show all

Keywords (3)

Speaker clustering/diarization; Human conversation analysis; Automatic role recognition

Lay Summary (English)

Lead
Lay summary
Speaker Diarization is the task of inferring "who spoke when" in an audio stream and is an essential step for facilitating the search and the indexing of audio archives, increasing the richness of automatic transcriptions and extracting high level informations on human conversations. Most of the recent efforts in the domain have addressed the problem using machine learning and signal processing techniques. On the other hand, current approaches completely neglect the fact that the data represents instances of human conversations which present predictable patterns induced by the role that each participant have in the discussion.In recent years, many studies have shown that turn-taking extracted from speaker diarization can be statistically modeled and used to classify the role that each speaker has in the conversation. Roles can be coded according to a number of schemes including formal/informal, social and functional roles. Reversely we propose to integrate in the diarization system, the statistics on the speaker interactions induced by their roles. The goal of this proposal is to enhance speaker diarization of meetings and broadcast data through the combination of traditional audio processing techniques with the information on the conversation structure coming from the roles that participants have.The project is organized in two research tracks: 1- Statistical representation and estimation of the speakers interactions conditioned to their role. 2-Integration of this information into the speaker diarization system. The development and the evaluation will be carried on meeting recordings and broadcast audio data collected in the framework of the Rich Transcription evaluations. Progresses will be estimated in terms of Diarization Error Rate which is the official metric proposed by NIST for benchmarking this task. The research proposed in RODI will try to bridge the gap in between two different fields, the automatic speaker segmentation and the analysis of human conversations that are closely related.
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Name Institute

Publications

Publication
Artificial neural network features for speaker diarization
Yella Sree Harsha, Stolcke Andreas, Slaney Malcolm (2014), Artificial neural network features for speaker diarization, in IEEE Spoken Language Technology Workshop, South Lake Tahoe.
Improving speaker diarization using social role information
Sapru Ashtosh, Yella Sree Harsha, Bourlard Hervé (2014), Improving speaker diarization using social role information, in IEEE ECASSP, Florence.
Inferring social relationships in a phone call from a single party's speech
Yella Sree Harsha, Anguera Xavier, Luque Jordi (2014), Inferring social relationships in a phone call from a single party's speech, in IEEE ICASSP, Florence.
Overlapping Speech Detection Using Long-Term
Yella Sree Harsha, Bourlard Hervé (2014), Overlapping Speech Detection Using Long-Term, in IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(12), 1688-1700.
Phoneme Background Model for Information Bottleneck based Speaker Diarization
Yella Sree Harsha, Bourlard Hervé, Motlicek Petr (2014), Phoneme Background Model for Information Bottleneck based Speaker Diarization, in Interspeech, Singapore.
Improved Overlap Speech Diarization of Meeting Recordings using Long-term Conversational Features
Yella Sree Harsha, Bourlard Hervé (2013), Improved Overlap Speech Diarization of Meeting Recordings using Long-term Conversational Features, in IEEE ICASSP, Vancouver, Canada, 2013.
Automatic detection of conflict escalation in spoken conversations
Kim Samuel, Yella Sree Harsha, Valente Fabio (2012), Automatic detection of conflict escalation in spoken conversations, in Interspeech.
Information Bottlenech Features for HMM/GMM Speaker Diarization of Meetings Recordings
Yella Sree Harsha, Valente Fabio (2011), Information Bottlenech Features for HMM/GMM Speaker Diarization of Meetings Recordings, in Interspeech, 953-956.
Information Bottleneck Features for HMM/GMM Speaker Diarization of Meetings Recordings
Yella Sree Harsha, Valente Fabio (2011), Information Bottleneck Features for HMM/GMM Speaker Diarization of Meetings Recordings, in Interspeech.
Understanding Social Signals in Multi-party Conversations: Automatic Recognition of Socio-Emotional Roles in the AMI Meeting Corpus
Valente Fabio, Vinciarelli Alessandro, Yella Sree Harsha (2011), Understanding Social Signals in Multi-party Conversations: Automatic Recognition of Socio-Emotional Roles in the AMI Meeting Corpus, in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics.
Speaker diarization of overlapping speech based on silence distribution in meeting recordings
Yella Sree Harsha, Valente Fabio, Speaker diarization of overlapping speech based on silence distribution in meeting recordings, in Interspeech.

Collaboration

Group / person Country
Types of collaboration
ICSI United States of America (North America)
- in-depth/constructive exchanges on approaches, methods or results
Microsoft Research Uganda (Africa)
- Exchange of personnel
Telefonica Research Spain (Europe)
- Exchange of personnel

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Interspeech 2014 conference Talk given at a conference Phoneme Background Model for Information Bottleneck based Speaker Diarization 29.05.2014 Singapore, Singapore Yella Sree Harsha;
IEEE ICASSP Talk given at a conference Improved Overlap Speech Diarization of Meeting Recordings using Long-term Conversational Fea- tures 26.05.2013 Vancouver, Canada Yella Sree Harsha;
Interspeech 2012 conference Talk given at a conference Speaker diarization of overlapping speech based on silence distribution in meeting recordings 09.09.2012 Portland, United States of America Yella Sree Harsha;


Abstract

Speaker Diarization (SD) is the task of inferring who spoke when in an audio stream and is an essential step for facilitating the search and the indexing of audio archives, increasing the richness of automatic transcriptions and extracting high level informations on human conversations. It involves two simultaneous unsupervised tasks: (1) the estimation of the number of speakers (2) associating speech segment to each speaker. Typical diarization applications consists of Broadcast audio and meeting recordings. Most of the recent efforts in the domain have addressed the problem using machine learning and signal processing techniques. On the other hand, current approaches completely neglect the fact that the data represents instances of human conversations which present predictable patterns induced by the role that each participant have in the discussion.Conversations are one the most common form of human interaction and "while appearing unconstrained and spontaneous, are governed by principles and laws which give rise to ordered and predictable behavioral patterns" [Orestrom83]. In recent years, many studies have shown that the turn-taking extracted from speaker diarization can be statistically modeled and used to classify the role that each speaker has in the conversation. Those roles can be formal (as in Broadcast recordings) or informal (as in meeting recordings). Reversely we propose to integrate in the diarization system, the statistics on the speaker interactions induced by their roles. The goal of this proposal is to enhance speaker diarization of meetings and broadcast data through the combination of traditional audio processing techniques with the information on the conversation structure coming from the roles that participants have. The project is organized in two research tracks:1- Statistical representation and estimation of the speakers interactions conditioned to their role. 2-Integration of this information into the speaker diarization system.We propose three case scenarios of increasing difficulty to address the problem. Case scenario A assumes that the number of speakers and their role are known. The research will focus on the statistical modeling of the turn-taking and on the integration of this information in the diarization.Case scenario B assumes that the number of speakers is known, but their role is unknown. The research will focus on the estimation and the integration of the role information obtained by an automatic classifier.Case scenario C assumes that both the number of speakers and their roles are unknown. The research will focus on how the information on turn-taking and roles affects the estimation of the number of speakers and the associated speaker time.The development and the evaluation will be carried on meeting recordings and broadcast audio data collected in the framework of the Rich Transcription evaluations. Progresses will be estimated in terms of Diarization Error Rate which is the official metric proposed by NIST for benchmarking this task. The project will build on a recently finished thesis on speaker diarization which produced significant advances in the field of information fusion in this task. The research proposed in RODI will try to bridge the gap in between two different fields (automatic speaker segmentation and analysis of human conversations) that are closely related thus we ask to fund a dedicated PhD student to support the project. The student will join IDIAP and be enrolled in the EPFL EDEE doctoral school.
-