Project

Back to overview

SP2: Scopes Project on Speech Prosody

English title SP2: Scopes Project on Speech Prosody
Applicant Garner Philip
Number 152495
Funding scheme SCOPES
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.04.2014 - 31.05.2016
Approved amount 239'820.00
Show all

Keywords (6)

speech recognition; speech synthesis; prosody; multilingual; speech-to-speech translation; emotion

Lay Summary (French)

Lead
La prosodie constitue un aspect important, pourtant méconnu de la parole humaine. Elle contribue au transfert des informations telles que les émotions, les intentions et les attitudes. Lors du traitement automatique de la parole, ces fonctions de la prosodie doivent être prises en compte, ce qui peut permettre de synthétiser des émotions dans la parole, et de refléter des attitudes particulières tout en ressemblant à la voix d'une personne spécifique. Du côté de la reconnaissance automatique de la parole, ces fonctions doivent être détectées et analysées, surtout si l'on vise l'interprétation automatique, c'est-à-dire la traduction automatique de la parole, qui regroupe les deux techniques de base, synthèse et reconnaissance.
Lay summary
La prosodie constitue un aspect important, pourtant méconnu de la parole humaine. Elle contribue au transfert des informations telles que les émotions, les intentions et les attitudes. Lors du traitement automatique de la parole, ces fonctions de la prosodie doivent être prises en compte, ce qui peut permettre de synthétiser des émotions dans la parole, et de refléter des attitudes particulières tout en ressemblant à la voix d'une personne spécifique. Du côté de la reconnaissance automatique de la parole, ces fonctions doivent être détectées et analysées, surtout si l'on vise l'interprétation automatique, c'est-à-dire la traduction automatique de la parole, qui regroupe les deux techniques de base, synthèse et reconnaissance.

Contenu et objectifs du travail de recherche
La recherche sur la prosodie étant très intensive et diverse, les efforts des 4 partenaires sont multiples mais convergents et visent à (i) mieux comprendre le phénomène de la prosodie et son rôle dans la communication, (ii) tester des modèles prosodiques différents et leur efficatité, (iii) développer et évaluer des technologies pour capter et/ou synthétiser des éléments prosodiques, (iv) tester les voies possibles de tranfert prosodique d'une langue vers l'autre, (v) élargir les bases de données existantes si nécessaire etc.

Contexte scientifique et social du projet de recherche
Le traitement automatique de la parole est aujourd'hui une direction de recherche prioritaire. La prosodie, pas encore totalement integrée dans cette procédure, constitue un élément clé des applications aidant ou complétant la communication humaine et homme-machine à haut niveau. Les résultats de la recherche seront directement appliquables dans le domaine de la synthèse et la reconnaissance vocale. Un autre apport important de cette coopération est l'élargissement du panel de langues examinées et le transfert des résultats et de technologie entre les partenaires pour promouvoir également la création d'un groupe d'expertise.





Direct link to Lay Summary Last update: 15.04.2014

Responsible applicant and co-applicants

Employees

Publications

Publication
Prosodic stress detection for fixed stress languages using formal atom decomposition and a statistical hidden Markov hybrid
Szaszák György, Tündik Máté Ákos, Gerazov Branislav (2018), Prosodic stress detection for fixed stress languages using formal atom decomposition and a statistical hidden Markov hybrid, in Speech Communication, 102, 14-26.
Intonation modelling using a muscle model and perceptually weighted matching pursuit
Honnet Pierre-Edouard, Gerazov Branislav, Gjoreski Aleksandar, Garner Philip N. (2018), Intonation modelling using a muscle model and perceptually weighted matching pursuit, in Speech Communication, 97, 81-93.
Atom decomposition based stress detection and automatic phrasing of speech
Tundik Mate Akos, Gerazov Branislav, Gjoreski Aleksandar, Szaszak Gyorgy (2016), Atom decomposition based stress detection and automatic phrasing of speech, in 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Wroclaw, PolandIEEE, Poland.
Estimating the Sincerity of Apologies in Speech by DNN Rank Learning and Prosodic Analysis
Gosztolya Gábor, Grósz Tamás, Szaszák György, Tóth László (2016), Estimating the Sincerity of Apologies in Speech by DNN Rank Learning and Prosodic Analysis, in Proceedings of Interspeech, San FranciscoISCA, San Francisco.
Improving HMM speech synthesis of interrogative sentences by pitch track transformations
Nagy Péter, Németh Géza (2016), Improving HMM speech synthesis of interrogative sentences by pitch track transformations, in Speech Communication, 82, 97-112.
Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody
Lazaridis Alexandros, Cernak Milos, Garner Philip N. (2016), Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody, in Proceedings of Interspeech, San FranciscoISCA, San Francisco.
Unified Prosody Model based on Atom Decomposition for Emphasis Detection
GerazovBranislav, GjoreskiAleksandar, MelovAleksandar, HonnetPierre-Edouard, IvanovskiZoran, GarnerPhilip N. (2016), Unified Prosody Model based on Atom Decomposition for Emphasis Detection, in ETAI, StrugaETAI, Struga.
An Algorithm for Phase Manipulation in a Speech Signal
Pekar Darko, Suzić Siniša, Mak Robert, Friedlander Meir, Sečujski Milan (2016), An Algorithm for Phase Manipulation in a Speech Signal, in Proceedings of SPECOM, BudapestSpringer, Budapest.
Automatic Summarization of Highly Spontaneous Speech
Beke András, Szaszák György (2016), Automatic Summarization of Highly Spontaneous Speech, in Proceedings of SPECOM, BudapestSpringer, Budapest.
Combining Atom Decomposition of the F0 Track and HMM-based Phonological Phrase Modelling for Robust Stress Detection in Speech
Szaszák György, Tündik Máté Ákos, Gerazov Branislav, Gjoreski Aleksandar (2016), Combining Atom Decomposition of the F0 Track and HMM-based Phonological Phrase Modelling for Robust Stress Detection in Speech, in Proceedings of SPECOM, BudapestSpringer, Budapest.
Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer
Sečujski Milan, Gerazov Branislav, Csapó Tamás Gábor, Delić Vlado, Garner Philip N., Gjoreski Aleksandar, Guennec David, Ivanovski Zoran, Melov Aleksandar, Németh Géza, Stojković Ana, Szaszák György (2016), Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer, in Proceedings of SPECOM, BudapestSpringer, Budapest.
Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder
Csapo Tamas Gabor, Nemeth Geza, Cernak Milos, Garner Philip N. (2016), Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder, in 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, HungaryEURASIP, Budapest.
A linguistic interpretation of the atom decomposition of fundamental frequency contour for American English
Delić Tijana, Gerazov Branislav, Popović Branislav, Sečujski Milan (2016), A linguistic interpretation of the atom decomposition of fundamental frequency contour for American English, in Proceedings of SPECOM, Budapest.
An agonist-antagonist pitch production model
Gerazov Branislav, Garner Philip N. (2016), An agonist-antagonist pitch production model, in Proceedings of SPECOM, Budapest.
Continuous Fundamental Frequency Prediction with Deep Neural Networks
Tóth Bálint Pál, Csapó Tamás Gábor (2016), Continuous Fundamental Frequency Prediction with Deep Neural Networks, in Proceedings of EUSIPCO, BudapestEURASIP, Budapest.
DNN-Based Duration Modeling for Synthesizing Short Sentences
Nagy Péter, Németh Géza (2016), DNN-Based Duration Modeling for Synthesizing Short Sentences, in Proceedings of SPECOM, Budapest.
Sound Pattern Matching for Automatic Prosodic Event Detection
Cernak Milos, Asaei Afsaneh, Honnet Pierre-Edouard, Garner Philip N., Bourlard Hervé (2016), Sound Pattern Matching for Automatic Prosodic Event Detection, in Proceedings of Interspeech, San Francisco.
An Empirical Model of Emphatic Word Detection
Cernak Milos, Honnet Pierre-Edouard (2015), An Empirical Model of Emphatic Word Detection, in Proceedings of Interspeech, Dresden.
An Investigation of Muscle Models for Physiologically Based Intonation Modelling
Gerazov Branislav, Garner Philip N. (2015), An Investigation of Muscle Models for Physiologically Based Intonation Modelling, in Proceedings of the 23rd Telecommunications Forum, Belgrade.
Atom Decomposition-Based Intonation Modelling
Honnet Pierre-Edouard, Gerazov Branislav, Garner Philip N. (2015), Atom Decomposition-Based Intonation Modelling, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane.
Atom-decomposition based analysis for the purpose of emphatic word detection
Gjoreski Aleksandar, Gerazov Branislav, Ivanovski Zoran (2015), Atom-decomposition based analysis for the purpose of emphatic word detection, in XII International Conference ETAI, Ohrid.
Automatic transformation of irregular to regular voice by residual analysis and synthesis
Csapó T. G., Németh G. (2015), Automatic transformation of irregular to regular voice by residual analysis and synthesis, in Proceedings of Interspeech, Dresden.
Emphatic Word Detection Based on Relative Phoneme Energies within Syllables
Stojkovic Ana, Gerazov Branislav, Ivanovski Zoran (2015), Emphatic Word Detection Based on Relative Phoneme Energies within Syllables, in XII International Conference ETAI, Ohrid.
Emphatic word detection based on syllable durations
Melov Aleksandar, Gerazov Branislav, Ivanovski Zoran (2015), Emphatic word detection based on syllable durations, in XII International Conference ETAI, Ohrid.
From text to formants - indirect model for trajectory prediction based on a multi-speaker parallel speech database
Abari K., Csapó T. G., Tóth B. P., Olaszy G. (2015), From text to formants - indirect model for trajectory prediction based on a multi-speaker parallel speech database, in Proceedings of Interspeech, Dresden.
Implementation of optimized matching pursuit techniques in weighted correlation based atom decomposition intonation modelling
Gerazov Branislav, Gjoreski Aleksandar, Ivanovski Zoran (2015), Implementation of optimized matching pursuit techniques in weighted correlation based atom decomposition intonation modelling, in 3rd International Acoustics and Audio Engineering Conference TAKTONS, Novi Sad.
Joint Atom-Decomposition Based Analysis of Energy and Intonation for Emphatic Word Detection
Gjoreski Aleksandar, Gerazov Branislav, Ivanovski Zoran (2015), Joint Atom-Decomposition Based Analysis of Energy and Intonation for Emphatic Word Detection, in 3rd International Acoustics and Audio Engineering Conference TAKTONS, Novi Sad.
Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis
Csapó Tamás Gábor, Németh Géza, Cernak Milos (2015), Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis, in Dediu Adrian-Horia (ed.), 27-38.
Synthesis of Speaking Styles with Corpus- and HMM-Based Approaches
Nagy P., Zainkó Cs., Németh G. (2015), Synthesis of Speaking Styles with Corpus- and HMM-Based Approaches, in 6th IEEE Conference on Cognitive Infocommunications CogInfoCom 2015, Gyor, Hungary.
Towards Exploring the Role of Disfluencies from an Acoustic Point of view: a New Aspect of (Dis)continuous Speech Prosody Modelling
Szaszák G., Beke A. (2015), Towards Exploring the Role of Disfluencies from an Acoustic Point of view: a New Aspect of (Dis)continuous Speech Prosody Modelling, in Text Speech and Dialogue, Pilsen.
Towards Extracting the Global Component from the Syllable Duration Contour for Emphatic Word Detection
Melov Aleksandar, Gerazov Branislav, Ivanovski Zoran (2015), Towards Extracting the Global Component from the Syllable Duration Contour for Emphatic Word Detection, in 3rd International Acoustics and Audio Engineering Conference TAKTONS, Novi Sad.
Towards speech emotion recognition in Macedonian
Gerazov Branislav, Peev Gjorgi, Hristov Martin, Ivanovski Zoran (2015), Towards speech emotion recognition in Macedonian, in XII International Conference ETAI, Ohrid, Macedonia.
Using Automatic Stress Extraction from Audio for Improved Prosody Modelling in Speech Synthesis
Szaszák G., Beke A., Olaszy G., Tóth B. P. (2015), Using Automatic Stress Extraction from Audio for Improved Prosody Modelling in Speech Synthesis, in Proceedings of Interspeech, Dresden.
Weighted Correlation based Atom Decomposition Intonation Modelling
Gerazov Branislav, Honnet Pierre-Edouard, Gjoreski Aleksandar, Garner Philip N. (2015), Weighted Correlation based Atom Decomposition Intonation Modelling, in Proceedings of Interspeech, Dresden.
Combining NLP techniques and acoustic analysis for semantic focus detection in speech
Beke A., Szaszák G. (2014), Combining NLP techniques and acoustic analysis for semantic focus detection in speech, in Proceedings of the 5th IEEE International Conference on Cognitive Infocommunications: CogInfoCom, Vietri sul Mare.
Modeling the language of prosody
Gerazov Branislav, Honnet Pierre-Edouard, Ivanovski Zoran (2014), Modeling the language of prosody, in Proceedings of the DOGS - Digital speech and image processing conference, Novi Sad.
The SP2 SCOPES Project on Speech Prosody
Szaszák György, Gábor Csapó Tamás, Garner Philip N., Gerazov Branislav, Ivanovski Zoran, Németh Géza, Tóth Bálint, Secujski Milan, Delić Vlado (2014), The SP2 SCOPES Project on Speech Prosody, in Proceedings of the DOGS - Digital speech and image processing conference, Novi Sad.

Datasets

SP2 speech corpus

Author Gerazov, Branislav
Publication date 27.01.2016
Persistent Identifier (PID) https://github.com/SP2-Consortium/SP2-Speech-Corpus
Repository GitHub
Abstract
This is a multilingual speech corpus containing prosodically rich sentences designed for research in the domain of cross-lingual prosody transfer in the context of expressive speech synthesis. The corpus has been created within the research project "SP2: SCOPES Project on Speech Prosody" supported by the Swiss National Science Foundation.

Scientific events



Self-organised

Title Date Place
Special session at the DOGS - Digital speech and image processing conference 02.10.2014 Novi Sad, Serbien

Associated projects

Number Title Start Funding scheme
185010 NAST: Neural Architectures for Speech Technology 01.02.2020 Project funding (Div. I-III)
141903 SIWIS: Spoken Interaction with Interpretation in Switzerland 01.12.2012 Sinergia
165545 MASS: Multilingual Affective Speech Synthesis 01.05.2017 Project funding (Div. I-III)

Abstract

This is a proposal to work on speech prosody; that is, extraction and synthesis of pitch, duration and intensity of speech.Prosody is an important, yet not well understood aspect of speech. It is important because it imparts “feeling”, i.e., emotion and intent in natural speech. A speech synthesiser must reproduce prosody in order to both sound natural and to impart the intended emotion. It is not well understood because it is often ignored. In (automatic) speech recognition (ASR), it is ignored because it does not affect the textual representation of the words that are spoken. In text to speech synthesis (TTS), it is often ignored because the synthesiser is capable of learning the correct prosody from training data and context in the same way that the other surface acoustic features are learned.With the advent of statistical TTS, however, we have the opportunity to manipulate prosodic (as well as spectral) parameters. This may be in order for the TTS to sound like a different person, or to impart a new emotion that was not represented in training data. This in turn requires models of speech that go beyond surface acoustics, describing real natural cues and production mechanisms.In an application such as a dialogue system, it is sufficient for a dialogue manager to produce prosodic cues. However, in a speech to speech translation system, such cues must come from the utterance being translated. This causes the requirement to feed back all the way to ASR, which is now required to extract such cues when uttered by real people.The four partners are all actively but independently working on different aspects of speech prosody. The partners have not worked together before, but the process of writing the proposal has made it clear that they could work together to significant mutual benefit. The proposal describes four research plans based on work currently underway at each institution. It also describes how these plans could be brought together allowing each institution to benefit from others’ work. It also allows for the identification of other synergies that can only arise from working together for an extended period. Taken together, this represents a transition from independent work to a homogeneous collaboration.Although we hope to share resources and publish jointly, the primary goal of the work in this proposal is toreach a state where the partners can submit strong joint proposals for EU funding (including FP8 and Horizon 2020), in turn allowing significant and directed long term collaboration. This will require advancing another semantic level, from familiarity with each-others technology to understanding how it can integrate into dialogue and translation systems.
-