Project

Back to overview

Sparse and hierarchical Structures for Speech Modeling (SHISSM)

Applicant Bourlard Hervé
Number 175589
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.03.2018 - 28.02.2022
Approved amount 600'000.00
Show all

Keywords (8)

human auditory modeling; Automatic Speech Recognition (ASR); hierarchical sparse coding; deep neural networks (DNN); sparse recovery modeling; speech intelligibility modeling; deep architectures; hierarchical posterior-based ASR

Lay Summary (French)

Lead
SHISSM (Sparse and Hierarchical Structures for Speech Modeling) aims at investigating and integrating emerging research areas in the context of speech modeling, including (1) Deep Neural Networks (DNNs); (2) posterior-based features and systems (as usually resulting from the DNN outputs); (3) sparse coding seeking sparse representation of the processed signals; (4) compressive sensing and sparse recovery, aiming at modeling the speech signal in large-dimensional sparse spaces, usually resulting in simpler processing (e.g., recognition) algorithms; and (5) full exploitation of modern compute resources (big data, large GPU-based processing).Keywords: Automatic Speech Recognition (ASR), hierarchical posterior-based ASR, deep architectures, deep neural networks (DNN), sparse recovery modeling, hierarchical sparse coding, human auditory modeling, speech intelligibility modeling.
Lay summary

SHISSM (Sparse and Hierarchical Structures for Speech Modeling) a pour objectif d’investiguer et d’intégrer dans le contexte de la modélisation de la parole les techniques les plus récentes et souvent complémentaires, à savoir : (1) réseaux de neurones artificiels, et plus particulièrement les réseaux de neurones profonds (DNN) ; (2) systèmes basés entièrement sur des distributions a posteriori hiérarchiques ; (3) codage et représentations creuses en haute dimension ; (4) échantillonnage clairsemé et reconstruction creuse ; (5) exploitation des moyens de calculs (GPU) et ressources de données de plus en plus importantes.

Dans ce contexte, SHISSM devra donc développer une meilleur compréhension théorique dans le cadre de laquelle il sera possible d’intégrer de façon formelle les architectures hiérarchiques profondes, les architectures creuses (éventuellement binaires), selon des méthodes avancées de modélisation statistique, et certains liens récemment définis entre le codage et représentations creuses et certains modèles comme les modèles de Markov cachés (HMMs).

SHISSM est donc un projet assez multidisciplinaire intégrant ces différentes approches dans le contexte de la modélisation de la parole en général, et de la reconnaissance de la parole en particulier, même si l’impact de ce projet devrait aller bien au-delà de la parole.

 

Direct link to Lay Summary Last update: 08.03.2018

Responsible applicant and co-applicants

Employees

Associated projects

Number Title Start Funding scheme
169398 PHASER-QUAD: Parsimonious Hierarchical Automatic Speech Recognition and Query Detection 01.10.2016 Project funding (Div. I-III)
153507 PHASER: Parsimonious Hierarchical Automatic Speech Recognition 01.06.2014 Project funding (Div. I-III)

Abstract

Computer science is currently witnessing the emergence of major research activities, arising (more or less independently) from multiple disciplines (statistics, linear algebra, human brain research), and being applied in different ways in multiple application areas, including big data mining, statistical pattern recognition, computer vision, and speech processing (synthesis and recognition). These emerging research areas, which will be investigated in the context of SHISSM , include (1) more and more focus on posterior-based features and systems; (2) revival of (brain-inspired) Artificial Neural Networks in the form of deep/hierarchical architectures (referred to as Deep Neural Networks - DNNs); (3) full exploitation of modern compute resources (big data, large GPU-based processing); (4) sparse coding seeking sparse representation of the processed signals; and (5) compressive sensing and sparse recovery, aiming at modeling the speech signal in large-dimensional sparse spaces (reminiscent to what is believed to happen in the human brain), resulting in simpler processing (e.g., recognition) algorithms.Building upon several predecessor projects, which resulted in strong theoretical and experimental outcomes, SHISSM is thus very ambitious and aims at developing a better theoretical understanding allowing for a principled combination of ‘deep’ (hierarchical) and ‘sparse’ architectures, driven by advanced statistical modelling (posterior-based approaches, as estimated at the DNN outputs), compressive sensing, sparse recovery, and the formal links recently identified by the PI of this project between Hidden Markov Models (HMM) and compressive sensing, also exploiting posterior distribution estimated by DNNs. SHISSM is thus an interdisciplinary project tying together these important emerging areas in the context of speech modeling, and Automatic Speech Recognition (ASR) in particular, although its impact is expected to go go far beyond speech. Ideally, the targeted framework should result in a unified model, more performant on complex pattern recognition tasks, while also providing interesting biological motivations. In the particular context of speech modeling, and has already demonstrated through some of the PI’s work (discussed later), the resulting approach should be more effcient and more performant that standard HMMs (or hybrid HMM/DNN systems currently considered as state-of-the-art), while also being more biologically-sound, and relevant to other related areas such as speech synthesis, speech coding and human speech intelligibility modeling. This project proposal is intended to consolidate and extend the efforts currently being developed in the context of the the Swiss NSF project PHASER QUAD (2 years, 200020 169398), supporting two excellent PhD students (Pranay Dighe and Dhananjay Ram1, whom should nearly have defended their PhD thesis by the start of the present project).Exploiting recent developments around the above research areas, and further developing our various, state-of-the-art, machine learning and speech processing tools, often available as Idiap open source libraries2, as well as through the GitHub site3, or integrated in other open source toolkits like Kaldi4), the resulting systems will be evaluated on the several international benchmark databases (including, depending on the type of research, TIMIT (phones), Phonebook (words), Switchboard (sentences), AMI (noisy and conversational accented speech), GlobalPhone (multilingual), and ITU subjectively scored data (for human auditory modeling).
-