Project

Back to overview

Unified Speech Processing Framework for Trustworthy Speaker Recognition (UniTS)

Applicant Magimai-Doss Mathew
Number 159886
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.07.2015 - 30.06.2019
Approved amount 334'085.00
Show all

Keywords (5)

Speaker verification; Machine learning; Biometrics; Anti-spoofing; Speaker Recognition

Lay Summary (French)

Lead
Le principe de la reconnaissance automatique du locuteur est d'autentifier ou identifier une personne grâce à la voix. Le but du projet "Unified speech processing framework for Trustworthy Speaker recognition (UniTS)" est de développer de nouvelles caractéristiques et de nouveaux modèles pour la reconnaissance fiable du locuteur. Spécifiquement, cela consiste en:1. L'apprentissage de nouvelles caractéristiques liées au locuteur directement à partir du signal brut en utilisant le deep learning.2. Le développement de techniques contre l'usurpation d'identité.Le projet UniTS est une collaboration entre les chercheurs du group "Speech and Audio Processing" (Dr. Mathew Magimai Doss, PI) et le groupe "Biometrics" (Dr. Sebastien Marcel, co-PI) à l'Idiap. Ce projet finance un étudiant de doctorat pour une période de trois ans et un chercheur postdoctoral pour une période d'un an.
Lay summary
Le but de la reconnaissance automatique du locuteur est d'authentifier ou d'identifier une personne grâce à sa voix. Les systèmes de reconnaissance du locuteurs dans l'état de l'art sont généralement fondés sur les caractéristiques spectrales à court terme et emploient une série de méthodes de compensation pour atteindre des taux d'erreur faibles. Cela a deux limitations principales. Tout d'abord, la disponibilité de données suffisantes pour l'entrainement d'un modèle pour chaque locuteur. Deuxièmement, les systèmes d'authentification du locuteur peuvent être falsifiés ou attaqués. Alternativement, les systèmes de vérification du locuteur basés sur des caractéristiques spectrales sont vulnérables à des attaques malveillantes.

Le projet "Unified speech processing framework for Trustworthy Speaker recognition (UniTS)" a pour but d'adresser ces deux limititation par:

1. L'apprentissage de nouvelles caractéristiques liées au locuteur directement à partir du signal brut en utilisant le deep learning.

2. Le développement de techniques contre l'usurpation d'identité.

Le projet UniTS est une collaboration entre les chercheurs du group "Speech and Audio Processing" (Dr. Mathew Magimai Doss, PI) et le groupe "Biometrics" (Dr. Sebastien Marcel, co-PI) à l'Idiap. Ce projet finance un étudiant de doctorat pour une période de trois ans et un chercheur postdoctoral pour une période d'un an.

Direct link to Lay Summary Last update: 13.07.2015

Lay Summary (English)

Lead
Automatic speaker recognition is about authenticating or identifying a person through speech signal. The goal of the project "Unified speech processing framework for Trustworthy Speaker recognition (UniTS)" is to develop novel features and models for trustworthy speaker recognition. Specifically, it focusses on1. Learning novel speaker-related features directly from the raw speech signal using up-and-coming deep learning techniques.2. Development of anti-spoofing techniques, i.e. countermeasures, to protect speaker recognition systems against spoofing attacks.The project UniTS is a collaboration between researchers from Speech and Audio Processing group (Dr. Mathew Magimai Doss, PI) and Biometrics group (Dr. Sebastien Marcel, co-PI) at Idiap. It funds a PhD student for a period of three years and a postdoctoral researcher for a period of one year.
Lay summary

The goal of automatic speaker recognition task is to authenticate (referred to as speaker verification) or to identify (referred to as speaker identification) a person through speech signal. State-of-the art speaker recognition systems are typically based on short-term spectral based features and employ a series of compensation methods to achieve low error rates. This has two main limitations. First, availability of sufficient data for training a model for each speaker. Second, speaker verification systems can be spoofed or attacked. Alternately, speaker verification systems based on standard spectral-based features are vulnerable to malicious attacks.

The project "Unified speech processing framework for Trustworthy Speaker recognition (UniTS)" aims to address these two limitations by

1. Learning novel speaker-related features directly from the raw speech signal using up-and-coming deep learning techniques.

2. Development of anti-spoofing techniques, i.e. countermeasures, to protect speaker recognition systems against spoofing attacks.

The project UniTS is a collaboration between researchers from Speech and Audio Processing group (Dr. Mathew Magimai Doss, PI) and Biometrics group (Dr. Sebastien Marcel, co-PI) at Idiap. It funds a PhD student for a period of three years and a postdoctoral researcher for a period of one year.

Direct link to Lay Summary Last update: 13.07.2015

Responsible applicant and co-applicants

Employees

Publications

Publication
Understanding and Visualizing Raw Waveform-Based CNNs
Muckenhirn Hannah, Abrol Vinayak, Magimai-Doss Mathew, Marcel Sébastien (2019), Understanding and Visualizing Raw Waveform-Based CNNs, in Interspeech 2019, Graz, AustriaInternational Speech Communication Association, International Speech Communication Association Archive.
On Learning to Identify Genders from Raw Speech Signal Using CNNs
Kabil Selen Hande, Muckenhirn Hannah, Magimai.-Doss Mathew (2018), On Learning to Identify Genders from Raw Speech Signal Using CNNs, in Interspeech 2018, International Speech Communication Association, International Speech Communication Association Archive.
On Learning Vocal Tract System Related Speaker Discriminative Information from Raw Signal Using CNNs
Muckenhirn Hannah, Magimai.-Doss Mathew, Marcel Sebastien (2018), On Learning Vocal Tract System Related Speaker Discriminative Information from Raw Signal Using CNNs, in Interspeech 2018, Hyderabad, IndiaISCA, ISCA.
Towards directly modeling raw speech signal for speaker verification using CNNs
Muckenhirn Hannah, Magimai.-Doss Mathew, Marcel Sébastien (2018), Towards directly modeling raw speech signal for speaker verification using CNNs, in IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, IEEEXplorer.
End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection
Muckenhirn Hannah, Magimai.-Doss Mathew, Marcel Sébastien (2017), End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection, in Proceedings of International Joint Conference on Biometrics, IEEE, IEEEXplore.
Long-Term Spectral Statistics for Voice Presentation Attack Detection
Muckenhirn Hannah, Korshunov Pavel, Magimai.-Doss Mathew, Marcel Sébastien (2017), Long-Term Spectral Statistics for Voice Presentation Attack Detection, in IEEE/ACM Transactions on Audio, Speech and Language Processing, 25(11), 2098-2111.
Overview of BTAS 2016 Speaker Anti-spoofing Competition
Korshunov Pavel, Marcel Sébastien, Muckenhirn Hannah, Gonçalves A. R., Mello A. G. Souza, Violato R. P. Velloso, Simões Flávio, Uliani Neto Mário, de Assis Angeloni Marcus, Stuchi J. A., Dinkel H, Chen N, Qian Yanmin, Paul D, Saha G, Sahidullah Md (2016), Overview of BTAS 2016 Speaker Anti-spoofing Competition, in IEEE International Conference on Biometrics: Theory, Applications and Systems, IEEE, IEEEXplore.
Presentation Attack Detection Using Long-Term Spectral Statistics for Trustworthy Speaker Verification
Muckenhirn Hannah, Magimai.-Doss Mathew, Marcel Sébastian (2016), Presentation Attack Detection Using Long-Term Spectral Statistics for Trustworthy Speaker Verification, in International Conference of the Biometrics Special Interest Group (BIOSIG), IEEE, IEEEXplore.

Collaboration

Group / person Country
Types of collaboration
Department of Signal Theory, Telematics and Communications, University of Granada Spain (Europe)
- Exchange of personnel

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Interspeech Poster Understanding and Visualizing Raw Waveform-based CNNs 17.09.2019 Graz, Austria Magimai-Doss Mathew;
Valais AI Workshop (5th edition) Talk given at a conference Visualizing and understanding raw speech modeling with convolutional neural networks 03.05.2019 Martigny, Switzerland Muckenhirn Hannah;
Interspeech Poster On Learning Vocal Tract System Related Speaker Discriminative Information from Raw Signal Using CNNs 04.09.2018 Hyderabad, India Magimai-Doss Mathew;
Interspeech Poster On Learning to Identify Genders from Raw Speech Signal Using CNNs 03.09.2018 Hyderabad, India Magimai-Doss Mathew;
Google's 3rd Speech Technology Summit Poster Towards Modeling Raw Speech Signal for Speaker Verification in an End-to-End Manner Using CNNs 02.05.2018 London, Great Britain and Northern Ireland Muckenhirn Hannah;
IEEE International Conference on Acousitcs, Speech and Signal Processing Talk given at a conference Towards directly modeling raw speech signal for speaker verification using CNNs 18.04.2018 Calgary, Canada Muckenhirn Hannah;
International Joint Conference on Biometrics Talk given at a conference End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection 01.10.2017 Denver, United States of America Muckenhirn Hannah;
Swiss Machine Learning Day 2016 Talk given at a conference CNN-based presentation attack detection for trustworthy speaker verification 23.11.2016 Lausanne, Switzerland Muckenhirn Hannah;
International Conference of the Biometrics Special Interest Group (BIOSIG) Talk given at a conference Presentation Attack Detection Using Long-Term Spectral Statistics for Trustworthy Speaker Verification 22.09.2016 Darmsdat, Germany Muckenhirn Hannah;


Abstract

The goal of automatic speaker recognition task is to recognize persons through their voice. Automatic speaker verification is a subtest of speaker recognition task where the goal is to verify or authenticate a person. State-of-the-art speaker verification systems typically model short-term spectrum based features such as mel frequency cepstral coefficients (MFCCs) through a generative model such as, Gaussian mixture models (GMMs) and employ a series of compensation methods to achieve low error rates. This has two main limitations. First, the approach necessitates availability of sufficient training data for each speaker for robust modeling and sufficient test data to apply the series of compensation techniques to verify a speaker. Second, the speaker verification system is prone to malicious attacks such as through voice conversion (VC) system, text-to-speech (TTS) system. The main reason is that the front-end feature and back-end models of speaker verification system, namely, MFCC and GMMs, are similar to that of VC system and TTS system.The proposed project aims to address these limitations through development of novel approaches for trustworthy speaker verification. In order to achieve that, through collaboration between researchers from Speech and Audio Processing group and Biometrics group at Idiap, the proposed project focuses along two lines,1. in on-going DeepSTD project funded by HASLER foundation, in the context of speech recognition, it was shown that speech recognition systems can be built by directly modeling raw speech signals using artificial neural networks. The proposed project aims to build on that approach to develop a generic speaker verification approach that can be used for both speaker verification and speaker diarization.2. in a collaborative study with researchers from Univesity of Eastern Finland and Nanyang Technical University (Singapore), Idiap have developed a countermeasure approach for state-of-the-art speaker verification system. The proposed project aims to extend this approach along with development of novel anti-spoofing countermeasures using binary features and text-dependent speaker verification.
-