Project

Back to overview

PHASER-QUAD: Parsimonious Hierarchical Automatic Speech Recognition and Query Detection

English title PHASER-QUAD: Parsimonious Hierarchical Automatic Speech Recognition and Query Detection
Applicant Bourlard Hervé
Number 169398
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.10.2016 - 30.09.2019
Approved amount 344'131.00
Show all

Keywords (8)

Hierarchical sparse coding; Searching large multilingual audio archives; Hierarchical posterior-based acoustic modeling; Deep architectures; Parsimonious structures; Multilingual; Query-by-Example Spoken Term Detection; Automatic Speech Recognition

Lay Summary (French)

Lead
PHASER-QUAD builds on a new perspective to Automatic Speech Recognition (ASR) and Query-by-Example Spoken Term Detection (QbE-STD) problems, expressed as a hierarchical sparse recovery problem. Artificial neural networks (ANN) are used to extract linguistic features from  signal in the form of posterior probabilities. The posteriors are  representation of speech, living in a low-dimensional space. The sought approach is thus based on subspace modeling of posteriors, exploiting model parsimony and phono-lexical temporal properties through hierarchical decoding and detection procedures.
Lay summary

Reconnaissance Automatique de la Parole Parcimonieuse Hiérarchique et Détection de Requête

Lead

PHASER-QUAD se matérialise comme une nouvelle approche aux problèmes que sont la RAP et la QbE-STD, en les formulant comme des problèmes de récupération parcimonieuse hiérarchique. Des Réseaux de Neurones Artificiels (ANN) sont utilisés pour extraire des caractéristiques linguistiques du signal de parole, sous la forme de probabilités postérieures. Ces probabilités postérieures correspondent à une représentation parcimonieuse de la parole, dans un espace de dimension réduite. L'approche recherchée est donc basée sur la modélisation de probabilités postérieures dans ce sous espace, en exploitant la parcimonie du modèle ainsi que les propriétés temporelles phono-lexicales, grâce à des procédures de détection et de décodage hiérarchique.

Contenu et objectifs du travail de recherche

Le but de ce projet est d'exploiter et d 'intégrer les récentes avancées dans le domaine des RAP et QbE-STD basées sur les probabilités postérieures, des systèmes hybrides HMM/ANN, exploitant les Modèles de Markov Cachés (HMM), les Réseaux de Neurones Profonds, la perception compressive et le codage parcimonieux hiérarchique, pour des systèmes de RAP et QbE-STD à la fois plus précis mais aussi moins dépendants de la langue ou du domaine d'application.

Contexte scientifique et social du projet de recherche

Les systèmes de RAP et QbE-STD proposés sont capables d'exploiter les structures cachées de grandes quantités de données de parole avec seulement peu de suppositions a priori. Les systèmes seront disponibles en open source, distribués par l'Idiap, et évalués sur des tâches de reconnaissance vocale et de détection de requête difficiles, incluant des corpus de parole multilingue, contenant du bruit et des conversations accentuées.

Keywords

Modèles hiérarchiques parcimonieux, Réseau de neurones profond, Structure, Modèles de basse dimension.

Direct link to Lay Summary Last update: 27.09.2016

Responsible applicant and co-applicants

Employees

Publications

Publication
Low-rank and sparse subspace modeling of speech for DNN based acoustic modeling
Dighe Pranay, Asaei Afsaneh, Bourlard Hervé (2019), Low-rank and sparse subspace modeling of speech for DNN based acoustic modeling, in Speech Communication, 109, 34-45.
Multilingual Bottleneck Features for Query by Example Spoken Term Detection
Ram Dhananjay, Miculicich Lesly, Bourlard Hervé (2019), Multilingual Bottleneck Features for Query by Example Spoken Term Detection, in Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on, arXiv preprint arXiv:1907.00443, Cornell University.
Sparse Subspace Modeling for Query by Example Spoken Term Detection
Ram Dhananjay, Asaei Afsaneh, Bourlard Herve (2018), Sparse Subspace Modeling for Query by Example Spoken Term Detection, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6), 1130-1143.
{CNN} based Query by Example Spoken Term Detection
Ram Dhananjay, Miculicich Lesly, Bourlard Hervé (2018), {CNN} based Query by Example Spoken Term Detection, in Nineteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 92-96, Proceedings of Interspeech 2018, India92-96.
Phonetic subspace features for improved query by example spoken term detection
Ram Dhananjay, Asaei Afsaneh, Bourlard Hervé (2018), Phonetic subspace features for improved query by example spoken term detection, in Speech Communication, 103, 27-36.
Phonological Posterior Hashing for Query by Example Spoken Term Detection
Asaei Afsaneh, Ram Dhananjay, Bourlard Hervé (2018), Phonological Posterior Hashing for Query by Example Spoken Term Detection, in Proc. Interspeech 2018, 2067-2071, Interspeech 2018, Hyderabad2067-2071.
Perceptual Information Loss due to Impaired Speech Production
Asaei Afsaneh, Cernak Milos, Bourlard Herve (2017), Perceptual Information Loss due to Impaired Speech Production, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12), 2433-2443.
Sparse Pronunciation Codes for Perceptual Phonetic Information Assessment
Asaei Afsaneh, Cernak Milos, Bourlard Hervé, Ram Dhananjay (2017), Sparse Pronunciation Codes for Perceptual Phonetic Information Assessment, in Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS), Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS), Cambridge.
Subspace Regularized Dynamic Time Warping for Spoken Query Detection
Ram Dhananjay, Asaei Afsaneh, Bourlard Hervé (2017), Subspace Regularized Dynamic Time Warping for Spoken Query Detection, in Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS), Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS), Lisbon.
Sparse Modeling of Posterior Exemplars for Keyword Detection
Ram Dhananjay, Asaei Afsaneh, Dighe Pranay, Bourlard Hervé (2016), Sparse Modeling of Posterior Exemplars for Keyword Detection, in Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 3690-3694, Proceedings of Interspeech 2015, Germany3690-3694.
Subspace detection of {DNN} posterior probabilities via sparse representation for query by example spoken term detection
Ram Dhananjay, Asaei Afsaneh, Bourlard Hervé (2016), Subspace detection of {DNN} posterior probabilities via sparse representation for query by example spoken term detection, in Seventeenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 918-922, Idiap, Martigny918-922.

Datasets

AMI corpus

Author AMI, corpus
Persistent Identifier (PID) n/a
Repository AMI corpus


Mediaparl

Author Imseng, David
Persistent Identifier (PID) n/a
Repository Mediaparl


Associated projects

Number Title Start Funding scheme
175589 Sparse and hierarchical Structures for Speech Modeling (SHISSM) 01.03.2018 Project funding (Div. I-III)
144281 Adaptive Multilingual Speech Processing 01.10.2012 Project funding (Div. I-III)
153507 PHASER: Parsimonious Hierarchical Automatic Speech Recognition 01.06.2014 Project funding (Div. I-III)

Abstract

This project proposal is intended to merge and extend the support for two ongoing and strongly complementary projects on “Parsimonious Hierarchical Automatic Speech Recognition” (PHASER, 200021-153507) and “Adaptive Multilingual Speech Processing” (A-MUSE, 200020-144281), to cover the last two years of the respectively funded PhD students Pranay Dighe (PHASER) and Dhananjay Ram (A-MUSE), in addition to Dr. Afsaneh Asaei (PHASER) as one of the leading postdocs in the field to assist PhD students supervision.After a brief overview of the last 2 years of achievements in the two projects, we describe in details theresearch activities foreseen to further anchor the novel parsimonious and hierarchical paradigm for theclosely related tasks of speech recognition and query detection (hence the project acronym PHASERQUAD). The goal of this project is to exploit and integrate in a principled way recent developments in posterior-based Automatic Speech Recognition (ASR) and Query-by-Example Spoken Term Detection (QbE-STD) systems, hybrid HMM/ANN systems, exploiting Hidden Markov Model (HMM) and Artificial Neural Networks (ANN), Deep Neural Networks (a particular form of ANN with deep/hierarchical and nonlinear architecture), compressive sensing, subspace modeling and hierarchical sparse coding for ASR and QbE-STD.The resulting framework that we have been building upon quite successfully relies on strong relationships between standard HMM techniques (with HMM states as latent variables) and standard compressive sensing formalism, where the atoms of the compressive dictionary are directly related to posterior distributions of HMM-states. The proposed research thus takes a new perspective to speech acoustic modeling as a sparse recovery problem which takes low-dimensional observations (at the rate of acoustic features) and provide a high-dimensional sparse inference (at the rate of words) while preserving the linguistic information, preserving temporal and lexical constraints.To that end, we have proposed novel paradigm for speech recognition and spoken query detection based on sparse subspace modeling of posterior exemplars. To further develop the hierarchical, sparse-based, ASR and QbE-STD systems, and demonstrate their potential on the hardest benchmark tasks, several challenging problems will be addressed over the next two years (resulting in two distinct high-quality PhDs): (1) sparse posterior modeling tailored for speech recognition and detection objectives, (2) exploiting the low-dimensional structure of posterior space for unsupervised adaptation in unseen acoustic condition, and (3) hierarchical structured architectures that can go beyond the topological constraints of HMM for high level linguistic inference.Exploiting and further developing our various, state-of-the-art, speech processing tools (often available as Idiap open source1 or integrated in other systems like Kaldi2), the resulting systems will be evaluated on the three different, very challenging, databases: GlobalPhone (multilingual), AMI (noisy and conversational accented speech), and MediaEval for QbE-STD.
-