Back to overview

NAST: Neural Architectures for Speech Technology

Applicant Garner Philip
Number 185010
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.02.2020 - 31.01.2024
Approved amount 496'238.00
Show all

Keywords (3)

Physiology; Speech processing; Neural networks

Lay Summary (German)

NAST zielt darauf ab eine Toolbox von Komponenten zu entwickeln, die physikalische Prozesse modellieren, aber dabei auch in allgemeine neuronale Netze passen.
Lay summary
In jüngster Zeit hat Deep Learning in vielen Bereichen, einschließlich der Sprachverarbeitung, recht bedeutende Fortschritte erzielt:  Ein Grund dafür ist, dass neuronale Netze komplexe Funktionen modellieren und lernen können. Allerdings erscheinen sie, auch für Experten, eher als Black Box.  Dies führt zu einem Dilemma für Wissenschaftler, die biologische Systeme modellieren wollen, welche nach klaren mechanischen Prinzipien funktionieren.

In NAST wollen wir Komponenten bauen, die diesen mechanischen Systemen viel näher sind als die aktuellen neuronalen Black Boxen.  Wir werden diese Komponenten jedoch so designen, dass sie in den allgemeinen Lernmechanismus neuronaler Netze passen, so dass die resultierenden tiefen Netzwerke jede Komponente nach Belieben verwenden können.

Einige dieser Komponenten werden so konzipiert sein, dass sie mit biologischen neuronalen Netzwerken arbeiten, die eher Impulse als Zahlen emittieren.  Wir werden untersuchen wie man solche neuronalen Spiking-Netzwerke am besten mit konventionellen Deep Learning und physischen Komponenten kombiniert.

Das Ergebnis wird eine Art Toolkit aus neuen biologisch inspirierten und interpretierbaren Komponenten sein.  Neben der Verbesserung der resultierenden Sprachtechnologie hoffen wir, Verbindungen zu Wissenschaftlern zu ermöglichen, die die biologischen Prozesse untersuchen, die dieses Modell modellieren.
Direct link to Lay Summary Last update: 28.06.2019

Responsible applicant and co-applicants


Associated projects

Number Title Start Funding scheme
152495 SP2: Scopes Project on Speech Prosody 01.04.2014 SCOPES
141903 SIWIS: Spoken Interaction with Interpretation in Switzerland 01.12.2012 Sinergia
165545 MASS: Multilingual Affective Speech Synthesis 01.05.2017 Project funding (Div. I-III)


Recent years have seen the replacement of many component technologies in speech processing with the deep-learning of deep neural networks (DNNs). Such systems typically perform much better than the older component technologies. The same time period has (not independently) seen such technologies move from academia to industry. Areas such as speech recognition and synthesis that were once regarded as research fields are now ubiquitous applications, typically in the context of the "GAFA" (Google, Apple, Facebook, Amazon) companies. These companies have led advances based on large data and computational resources, and more recently on end-to-end approaches.Responding to this trend, much academic research in speech technology has moved into what might be called peripheral technologies. In the case of the applicant here in Switzerland, that has meant focusing on geographical issues, in particular multilinguality and the closely related issues of paralinguistics and adaptation. Of course, many of the solutions to these issues lie in deep-learning; however, the data resources can be rather small, putting us in a position to compete with the GAFA companies.In research threads on multilingual recognition and emotional synthesis, we are finding that, in order to do deep-learning with such limited resources, it is helpful to cast techniques from signal processing, from bio-inspired computing and from Bayesian statistics into neural components. Integrated into neural networks, the components provide "explainability" that is not present in abstract sigmoid units; it is then clear how they might be adapted to variations in speaker and language.In NAST, the objective is to consolidate the two (application directed) research themes above into a single theme around neural architectures.Specifically, the EU H2020 project SUMMA, although geared towards multilingual speech recognition, is yielding results in Bayesian methods and recurrence.The SNSF project MASS, focusing on emotional synthesis, has cast muscle models as neural components. We intend to blur the distinction between recognition and synthesis since the proposed techniques are applicable to both; this also reflects theories of physiological processes. We aim to create what might be called a toolkit of neural techniques. This toolkit already contains rudimentary muscle models, initial Bayesian recurrence and vocal tract warping.A key feature of all the neural techniques is that they will be trainable in an end-to-end manner. This will allow them to be fully optimised in the context of the application at hand, be it recognition or synthesis of speech or emotion.In a first thread, we propose to develop the muscle models developed for intonation synthesis by driving them using spiking neurons. Whilst quite ambitious, the thread builds on initial work by a masters student, and is written with multiple chances to back off to more conventional techniques.Indeed, the most likely and influential outcome will be a hybrid of spiking and conventional neurons in a coherent framework. In a second, more incremental thread, we propose to consolidate the work of two doctoral students. In finishing their doctoral studies, they will provide neural components for thetoolbox that will feed into their own work, that of the first thread above, and into a new task on factoring waveform synthesis. Each thread is written to allow interactions both within the thread and between the two, with many components being reused across tasks.In a technology sense, we may hope that the work will allow adaptation to speakers, to emotions and to languages with better quality and on smaller amounts of data. For particularly under-resourced or localised dialects, we hope to enable these capabilities where they would not otherwise exist. The resulting networks will have fewer parameters, allowing them to be smaller and faster. More generally, the tools will enable the concept of "explainability" in DNNs; rather than seeking meaning in networks ofotherwise abstract activations, we provide activations that are fundamentally based on explainable processes.The concept of the toolbox, whilst being distributed across several open-source packages, will enable transfer of the technology to the academic community, to industrial collaborators here in Switzerland, and hopefully to the GAFA companies. The students will be complemented by several post-doctoralresearchers in the Idiap speech group working on Innosuisse, EU and industrial projects; they will aid both the research and potential impact. In a more philosophical sense, we hope to build a bridge between the engineering of the GAFA and the speech "sciences" to which academic speech technology has often looked for inspiration.