Speaker clustering/diarization; Human conversation analysis; Automatic role recognition
Yella Sree Harsha, Stolcke Andreas, Slaney Malcolm (2014), Artificial neural network features for speaker diarization, in IEEE Spoken Language Technology Workshop
, South Lake Tahoe.
Sapru Ashtosh, Yella Sree Harsha, Bourlard Hervé (2014), Improving speaker diarization using social role information, in IEEE ECASSP
Yella Sree Harsha, Anguera Xavier, Luque Jordi (2014), Inferring social relationships in a phone call from a single party's speech, in IEEE ICASSP
Yella Sree Harsha, Bourlard Hervé (2014), Overlapping Speech Detection Using Long-Term, in IEEE/ACM Transactions on Audio, Speech and Language Processing
, 22(12), 1688-1700.
Yella Sree Harsha, Bourlard Hervé, Motlicek Petr (2014), Phoneme Background Model for Information Bottleneck based Speaker Diarization, in Interspeech
Yella Sree Harsha, Bourlard Hervé (2013), Improved Overlap Speech Diarization of Meeting Recordings using Long-term Conversational Features, in IEEE ICASSP
, Vancouver, Canada, 2013.
Kim Samuel, Yella Sree Harsha, Valente Fabio (2012), Automatic detection of conflict escalation in spoken conversations, in Interspeech
Yella Sree Harsha, Valente Fabio (2011), Information Bottlenech Features for HMM/GMM Speaker Diarization of Meetings Recordings, in Interspeech
Yella Sree Harsha, Valente Fabio (2011), Information Bottleneck Features for HMM/GMM Speaker Diarization of Meetings Recordings, in Interspeech
Valente Fabio, Vinciarelli Alessandro, Yella Sree Harsha (2011), Understanding Social Signals in Multi-party Conversations: Automatic Recognition of Socio-Emotional Roles in the AMI Meeting Corpus, in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics
Yella Sree Harsha, Valente Fabio, Speaker diarization of overlapping speech based on silence distribution in meeting recordings, in Interspeech
Speaker Diarization (SD) is the task of inferring who spoke when in an audio stream and is an essential step for facilitating the search and the indexing of audio archives, increasing the richness of automatic transcriptions and extracting high level informations on human conversations. It involves two simultaneous unsupervised tasks: (1) the estimation of the number of speakers (2) associating speech segment to each speaker. Typical diarization applications consists of Broadcast audio and meeting recordings. Most of the recent efforts in the domain have addressed the problem using machine learning and signal processing techniques. On the other hand, current approaches completely neglect the fact that the data represents instances of human conversations which present predictable patterns induced by the role that each participant have in the discussion.Conversations are one the most common form of human interaction and "while appearing unconstrained and spontaneous, are governed by principles and laws which give rise to ordered and predictable behavioral patterns" [Orestrom83]. In recent years, many studies have shown that the turn-taking extracted from speaker diarization can be statistically modeled and used to classify the role that each speaker has in the conversation. Those roles can be formal (as in Broadcast recordings) or informal (as in meeting recordings). Reversely we propose to integrate in the diarization system, the statistics on the speaker interactions induced by their roles. The goal of this proposal is to enhance speaker diarization of meetings and broadcast data through the combination of traditional audio processing techniques with the information on the conversation structure coming from the roles that participants have. The project is organized in two research tracks:1- Statistical representation and estimation of the speakers interactions conditioned to their role. 2-Integration of this information into the speaker diarization system.We propose three case scenarios of increasing difficulty to address the problem. Case scenario A assumes that the number of speakers and their role are known. The research will focus on the statistical modeling of the turn-taking and on the integration of this information in the diarization.Case scenario B assumes that the number of speakers is known, but their role is unknown. The research will focus on the estimation and the integration of the role information obtained by an automatic classifier.Case scenario C assumes that both the number of speakers and their roles are unknown. The research will focus on how the information on turn-taking and roles affects the estimation of the number of speakers and the associated speaker time.The development and the evaluation will be carried on meeting recordings and broadcast audio data collected in the framework of the Rich Transcription evaluations. Progresses will be estimated in terms of Diarization Error Rate which is the official metric proposed by NIST for benchmarking this task. The project will build on a recently finished thesis on speaker diarization which produced significant advances in the field of information fusion in this task. The research proposed in RODI will try to bridge the gap in between two different fields (automatic speaker segmentation and analysis of human conversations) that are closely related thus we ask to fund a dedicated PhD student to support the project. The student will join IDIAP and be enrolled in the EPFL EDEE doctoral school.