speaker diarization; speaker recognition; deep learning; i-vectors; cyber-criminality
Sahidullah Md, Patino Jose, Cornell Samuele, Yin Ruiqing, Sivasankaran Sunit, Bredin Herve, Korshunov Pavel, Brutti Alessio, Serizel Romain, Vincent Emmanuel, Evans Nicholas, Marcel Sébastien, Squartini Stefano, Barras Claude (2019), The Speed Submission to DIHARD II: Contributions & Lessons Learned
, (Idiap-RR-1), arXiv, arXiv(Idiap-RR-1).
Patino Jose, Yin Ruiqing, Delgado Héctor, Bredin Hervé, Komaty Alain, Wisniewski Guillaume, Barras Claude, Evans Nicholas, Marcel Sébastien (2018), Low-latency speaker spotting with online diarization and detection, in Odyssey 2018 The Speaker and Language Recognition Workshop
, ISCA, Les Sables d’Olonne.
CernakMilos, KomatyAlain, MohammadiAmir, AnjosAndré, MarcelSébastien (2017), Bob Speaks Kaldi, in Interspeech 2017
, StockholmISCA, Stockholm.
||Komati, Alain; Marcel, Sébastien
|Persistent Identifier (PID)
The database contains 42 conversations in English between 2 speakers. Thetotal number of speakers is 14. All participants have signed consent form byagreeing for the collected data to be used for research purposes. The conversationscenario includes the two speakers reading scripted lines from the script preparedin advance. There is no cross talk between speakers. Each of the speakers useda PC to connect. Each recording session is a brief transcribed VoIP conversationbetween two speakers. The session manager used third PC to record the sessionwhile muting himself.All audio les are manually annotated by Idiap and the ground truth is storedin Text and RTTM formats. The annotations include the beginnings and ends ofthe speech for each speaker with local and global speaker IDs. Each transcribedreference is associated with its corresponding session, so that the database couldalso be used for speech diarization, speech recognition, speaker recognition, andlow=latency speaker spotting tasks.
Speaker diarization is an unsupervised process which aims to identify each speaker within an audio stream and to determine when each speaker is active. It considers that the number of speakers, their identities and their speech turns are all unknown.Speaker diarization has become an important key technology in many domains such as content-based information retrieval, voice biometrics, forensics or social-behavioral analysis. Examples of applications of speaker diarization include speech and speaker indexing, speaker recognition (in the presence of multiple speakers), speaker role detection, speech-to-text transcription, speech-to-speech translation and audiovisual content structuring.Although speaker diarization has been studied for almost two decades, current state-of-the-art systems suffer from many limitations. Such systems are extremely domain-dependent. For instance, a speaker diarization system trained on radio/TV broadcast news experiencesdrastically degraded performance when tested on a different type of recordings such as radio/TV debates, meetings, lectures, conversational telephone speech or conversational voice-over-IP speech. Overlapping speech, the spontaneous speaking style, background noise, music and other non-speech sources (laugh, applause, etc.) are all nuisance factors which badly affect the quality of speaker diarization.Furthermore, most existing work addresses the problem of offline speaker diarization: the system has full access to the entire audio recording beforehand and no real time processing is required. Therefore, the multi-pass processing over the same data is feasible and a bunch ofelegant machine learning tools can be used. Nevertheless, these compromises are not admissible in real-time applications mainly when it comes to public security and fight against terrorism and cyber-criminality.Moreover, after an initial step of segmentation into speech turns, most approaches address speaker diarization as a bag-of-speech-turns clustering problem and do not take into account the inherent temporal structure of interactions between speakers. One goal of the project isto integrate this information and to exploit structured prediction techniques to improve over standard hierarchical clustering methods.Speaker diarization is inherently related to speaker recognition. In recent years, the performance of state-of-the-art speaker recognition systems has improved enormously on account of new recognition paradigms such as i-vectors and deep learning, new session compensation techniques such as probabilistic linear discriminant analysis, and new score normalization techniques such as adaptive symmetric score normalization.However, existing speaker diarization systems do not take full advantages of these new techniques. Therefore, another goal of the project involves their adaptation to speakerdiarization, and thus to fill the research gap in the current literature.To evaluate the proposed algorithms and to ensure their genericness, different existing databases will be considered such as NIST SRE 2008 summed-channel telephone data, NIST RT 2003-2004 conversational telephone data, REPERE TV broadcast data and AMImeeting corpus. Furthermore, we will collect a medium-size database that suits our main application involving the fight against cyber-criminality.