Back to overview

Online Diarization Enhanced by recent Speaker identification and Structuredprediction Approaches (ODESSA)

English title Online Diarization Enhanced by recent Speaker identification and Structuredprediction Approaches (ODESSA)
Applicant Marcel Sébastien
Number 164336
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.05.2016 - 31.10.2019
Approved amount 296'160.00
Show all

Keywords (5)

speaker diarization; speaker recognition; deep learning; i-vectors; cyber-criminality

Lay Summary (French)

Le projet ODESSA a pour objectif d'utiliser ou d'adapter les récentes techniques développés en reconnaissance automatique du locuteur (i.e. i-Vectors, deep learning, session compensation, ...) pour la segmentation et le regroupement de locuteur.Le regroupement de locuteurs est un processus non-supervisé qui consiste à identifier dans un flux audio, le nombre de locuteurs et leurs identités.Le regroupement de locuteurs est une technologie très importante dans de nombreux domaines et notamment pour combattre la cyber-criminalité.
Lay summary


Direct link to Lay Summary Last update: 21.10.2015

Responsible applicant and co-applicants



The Speed Submission to DIHARD II: Contributions & Lessons Learned
Sahidullah Md, Patino Jose, Cornell Samuele, Yin Ruiqing, Sivasankaran Sunit, Bredin Herve, Korshunov Pavel, Brutti Alessio, Serizel Romain, Vincent Emmanuel, Evans Nicholas, Marcel Sébastien, Squartini Stefano, Barras Claude (2019), The Speed Submission to DIHARD II: Contributions & Lessons Learned, (Idiap-RR-1), arXiv, arXiv(Idiap-RR-1).
Low-latency speaker spotting with online diarization and detection
Patino Jose, Yin Ruiqing, Delgado Héctor, Bredin Hervé, Komaty Alain, Wisniewski Guillaume, Barras Claude, Evans Nicholas, Marcel Sébastien (2018), Low-latency speaker spotting with online diarization and detection, in Odyssey 2018 The Speaker and Language Recognition Workshop, ISCA, Les Sables d’Olonne.
Bob Speaks Kaldi
CernakMilos, KomatyAlain, MohammadiAmir, AnjosAndré, MarcelSébastien (2017), Bob Speaks Kaldi, in Interspeech 2017, StockholmISCA, Stockholm.


Author Komati, Alain; Marcel, Sébastien
Publication date 17.01.2019
Persistent Identifier (PID) ODESSA
Repository ODESSA
The database contains 42 conversations in English between 2 speakers. Thetotal number of speakers is 14. All participants have signed consent form byagreeing for the collected data to be used for research purposes. The conversationscenario includes the two speakers reading scripted lines from the script preparedin advance. There is no cross talk between speakers. Each of the speakers useda PC to connect. Each recording session is a brief transcribed VoIP conversationbetween two speakers. The session manager used third PC to record the sessionwhile muting himself.All audio les are manually annotated by Idiap and the ground truth is storedin Text and RTTM formats. The annotations include the beginnings and ends ofthe speech for each speaker with local and global speaker IDs. Each transcribedreference is associated with its corresponding session, so that the database couldalso be used for speech diarization, speech recognition, speaker recognition, andlow=latency speaker spotting tasks.

Use-inspired outputs


Name Year
Bob Kaldi 2018


Speaker diarization is an unsupervised process which aims to identify each speaker within an audio stream and to determine when each speaker is active. It considers that the number of speakers, their identities and their speech turns are all unknown.Speaker diarization has become an important key technology in many domains such as content-based information retrieval, voice biometrics, forensics or social-behavioral analysis. Examples of applications of speaker diarization include speech and speaker indexing, speaker recognition (in the presence of multiple speakers), speaker role detection, speech-to-text transcription, speech-to-speech translation and audiovisual content structuring.Although speaker diarization has been studied for almost two decades, current state-of-the-art systems suffer from many limitations. Such systems are extremely domain-dependent. For instance, a speaker diarization system trained on radio/TV broadcast news experiencesdrastically degraded performance when tested on a different type of recordings such as radio/TV debates, meetings, lectures, conversational telephone speech or conversational voice-over-IP speech. Overlapping speech, the spontaneous speaking style, background noise, music and other non-speech sources (laugh, applause, etc.) are all nuisance factors which badly affect the quality of speaker diarization.Furthermore, most existing work addresses the problem of offline speaker diarization: the system has full access to the entire audio recording beforehand and no real time processing is required. Therefore, the multi-pass processing over the same data is feasible and a bunch ofelegant machine learning tools can be used. Nevertheless, these compromises are not admissible in real-time applications mainly when it comes to public security and fight against terrorism and cyber-criminality.Moreover, after an initial step of segmentation into speech turns, most approaches address speaker diarization as a bag-of-speech-turns clustering problem and do not take into account the inherent temporal structure of interactions between speakers. One goal of the project isto integrate this information and to exploit structured prediction techniques to improve over standard hierarchical clustering methods.Speaker diarization is inherently related to speaker recognition. In recent years, the performance of state-of-the-art speaker recognition systems has improved enormously on account of new recognition paradigms such as i-vectors and deep learning, new session compensation techniques such as probabilistic linear discriminant analysis, and new score normalization techniques such as adaptive symmetric score normalization.However, existing speaker diarization systems do not take full advantages of these new techniques. Therefore, another goal of the project involves their adaptation to speakerdiarization, and thus to fill the research gap in the current literature.To evaluate the proposed algorithms and to ensure their genericness, different existing databases will be considered such as NIST SRE 2008 summed-channel telephone data, NIST RT 2003-2004 conversational telephone data, REPERE TV broadcast data and AMImeeting corpus. Furthermore, we will collect a medium-size database that suits our main application involving the fight against cyber-criminality.