Project

Back to overview

SIWIS: Spoken Interaction with Interpretation in Switzerland

English title SIWIS: Spoken Interaction with Interpretation in Switzerland
Applicant Garner Philip
Number 141903
Funding scheme Sinergia
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.12.2012 - 30.11.2016
Approved amount 1'325'034.00
Show all

Keywords (4)

Cross-lingual adaptation; Translation; Swiss languages; Prosody

Lay Summary (English)

Lead
Lay summary
In order to do this, SIWIS brings together four partners. These partners possess world leading expertise in each of the component parts of a translation system. Further, three of the partners are physically located in cities representing the major language groups of Switzerland; this includes English, represented by the University of Edinburgh. The location will be important not only for data collection (ensuring we have the tools to enable the project), but also for evaluation, providing native listeners to ensure our results are as good as can be hoped.

SIWIS has two broad goals. The first is to customise the current state of the art to the Swiss language scenario. This entails constructing recognition, translation and synthesis modules in the main Swiss languages. The second goal is to advance this state of the art, focussing especially on prosody. Prosody describes aspects of speech such as volume, rhythm and pitch; these are the parts that carry emotion and emphasis, adding extra nuance and personality to the words.

The SIWIS project is about speech to speech translation. It will allow a person to speak to a machine in their native language and have it automatically recognised, translated and spoken in a different language. One characteristic of recent technology to achieve this is that the spoken synthetic voice can sound like the original speaker instead of a generic speaker or robot.
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
The SIWIS French Speech Synthesis Database. Design and recording of a high quality French database for speech synthesis
HonnetPierre-Edouard (2017), The SIWIS French Speech Synthesis Database. Design and recording of a high quality French database for speech synthesis, Idiap Research Institute, Martigny.
Speech vocoding for laboratory phonology
Cernak Milos, Beňu{\v s} {\v S}tefan, Lazaridis Alexandros (2017), Speech vocoding for laboratory phonology, in Computer Speech and Language, 42, 100-121.
Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding
Cernak Milos, Lazaridis Alexandros, Asaei Afsaneh, Garner Philip N. (2016), Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding, in IEEE/ACM Trans. on Audio, Speech and Language Processing, 24(12), 2301.
Detecting Emphaisised Spoken Words by Considering Them Prosodic Outliers and Taking Advantage of HMM-Based TTS Framework
Liang Hui (2016), Detecting Emphaisised Spoken Words by Considering Them Prosodic Outliers and Taking Advantage of HMM-Based TTS Framework, in Speech Prosody, ISCA, Web.
Emphasis Recreation for {TTS} using Intonation Atoms
Honnet Pierre-Edouard, Garner Philip N. (2016), Emphasis Recreation for {TTS} using Intonation Atoms, in Proceedings of the 9th {ISCA} Speech Synthesis Workshop, Sunnyvale, CAISCA, Web.
Intonation atom based emphasis transfer
Honnet Pierre-Edouard, Garner Philip N. (2016), Intonation atom based emphasis transfer, Idiap, Martigny.
Investigating Spectral Amplitude Modulation Phase Hierarchy Features in Speech Synthesis
Lazaridis Alexandros, Cernak Milos, Honnet Pierre-Edouard, Garner Philip N. (2016), Investigating Spectral Amplitude Modulation Phase Hierarchy Features in Speech Synthesis, in 9th ISCA Speech Synthesis Workshop, Sunnyvale, CAISCA, Web.
Parallel and cascaded deep neural networks for text-to-speech synthesis
Ribeiro Manuel Sam, Watts Oliver, Yamagishi Junichi (2016), Parallel and cascaded deep neural networks for text-to-speech synthesis, in 9th ISCA Workshop on Speech Synthesis, Sunnyvale, CAISCA, Web.
Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody
Lazaridis Alexandros, Cernak Milos, Garner Philip N. (2016), Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody, in Proceedings of Interspeech, San Francisco, CAISCA, Web.
Sound Pattern Matching for Automatic Prosodic Event Detection
Cernak Milos, Asaei Afsaneh, Honnet Pierre-Edouard, Garner Philip N., Bourlard Herv{é}} (2016), Sound Pattern Matching for Automatic Prosodic Event Detection, in Proceedings of Interspeech, ISCA, Web.
Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis
Ribeiro Manuel Sam, Watts Oliver, Yamagishi Junichi (2016), Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis, in Proceedings of Interspeech, San FranciscoISCA, Web.
The {SIWIS} database: a multilingual speech database with acted emphasis
Goldman Jean-Philippe, Honnet Pierre-Edouard, Clark Rob, Garner Philip N., Ivanova Maria, Lazaridis Alexandros, Liang Hui, Macedo Tiago, Pfister Beat, Ribeiro Manuel Sam, Wehrli Eric, Yamagishi Junichi (2016), The {SIWIS} database: a multilingual speech database with acted emphasis, in Proceedings of Interspeech, ISCA, Web.
Unified Prosody Model based on Atom Decomposition for Emphasis Detection
Gerazov Branislav, Gjoreski Aleksandar, Melov Aleksandar, Honnet Pierre-Edouard, Ivanovski Zoran, Garner Philip N. (2016), Unified Prosody Model based on Atom Decomposition for Emphasis Detection, in Proceedings of ETAI, Struga, MacedoniaETAI, Web.
Wavelet-based decomposition of f0 as a secondary task for DNN-based speech synthesis with multi-task learning
Ribeiro Manuel Sam, Watts Oliver, Yamagishi Junichi, Clark Robert A. J. (2016), Wavelet-based decomposition of f0 as a secondary task for DNN-based speech synthesis with multi-task learning, in IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, ChinaIEEE, Web.
A Multi-Level Representation of f0 using the Continuous Wavelet Transform and the Discrete Cosine Transform
Ribeiro Manuel Sam, Clark Robert A. J. (2015), A Multi-Level Representation of f0 using the Continuous Wavelet Transform and the Discrete Cosine Transform, in IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Web.
A Perceptual Investigation of Wavelet-based Decomposition of f0 for Text-to-Speech Synthesis
Ribeiro Manuel Sam, Yamagishi Junichi, Clark Robert A. J. (2015), A Perceptual Investigation of Wavelet-based Decomposition of f0 for Text-to-Speech Synthesis, in Proceedings of Interspeech, ISCA, Web.
An Empirical Model of Emphatic Word Detection
Cernak Milos, Honnet Pierre-Edouard (2015), An Empirical Model of Emphatic Word Detection, in Proceedings of Interspeech, ISCA, Web.
Atom Decomposition-based Intonation Modelling
Honnet Pierre-Edouard, Gerazov Branislav, Garner Philip N. (2015), Atom Decomposition-based Intonation Modelling, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Web.
Detecting Strong Prosodic Events
Vanessa Hunziker Janine Thoma (2015), Detecting Strong Prosodic Events, ETH Zurich, Zurich.
DNN-based Speech Synthesis: Importance of input features and training data
Lazaridis Alexandros, Potard Blaise, Garner Philip N. (2015), DNN-based Speech Synthesis: Importance of input features and training data, in International Conference on Speech and Computer, SPECOM 2015, Springer, Berlin Heidelberg.
Identification of Noun-Noun Compounds in the Context of Speech-to-Speech Translation
Ivanova Maria, Wehrli Eric (2015), Identification of Noun-Noun Compounds in the Context of Speech-to-Speech Translation, in Proceedings of The 18th International Conference of Text, Speech and Dialogue (TSD), Springer, Berlin Heidelberg.
Multiword Expressions in Machine Translation: The Case of German Compounds
Ivanova Maria, Wehrli Eric, Nerima Luka (2015), Multiword Expressions in Machine Translation: The Case of German Compounds, in Proceedings of the 2nd Workshop on Multi-word Units in Machine Translation &c, ??, ??.
Weighted Correlation based Atom Decomposition Intonation Modelling
Gerazov Branislav, Honnet Pierre-Edouard, Gjoreski Aleksandar, Garner Philip N. (2015), Weighted Correlation based Atom Decomposition Intonation Modelling, in Proceedings of Interspeech, ISCA, Web.
Capturing Speaker-Independent Prosodic Information by Syntax Tree-Based Prosody Modelling
Liang Hui, Hoffmann Sarah (2014), Capturing Speaker-Independent Prosodic Information by Syntax Tree-Based Prosody Modelling, Laboratory TIK, ETH Zurich, ETH Zurich.
Importance of Prosody in Swiss French Accent for Speech Synthesis
Honnet Pierre-Edouard, Garner Philip N. (2014), Importance of Prosody in Swiss French Accent for Speech Synthesis, in Nouveaux cahiers de linguistique françcaise, 31, 205.
Investigation into Transferability of Duration of Emphasised Words from Original Expression to Spoken Translation
Liang Hui (2014), Investigation into Transferability of Duration of Emphasised Words from Original Expression to Spoken Translation, ETH, Zurich.
Prosody in Swiss French Accents: Investigation using Analysis by Synthesis
Honnet Pierre-Edouard, Lazaridis Alexandros, Goldman Jean-Philippe, Garner Philip N. (2014), Prosody in Swiss French Accents: Investigation using Analysis by Synthesis, in Proceedings of the 7th Speech Prosody Conference, Dublin, IrelandISCA, Web.
SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis
Lazaridis Alexandros, Honnet Pierre-Edouard, Garner Philip N. (2014), SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis, in Proceedings of the 7th Speech Prosody Conference, Dublin, IrelandISCA, Web.
Swiss French Regional Accent Identification
Lazaridis Alexandros, Khoury Elie, Goldman Jean-Philippe, Avanzi Mathieu, Marcel Sébastien, Garner Philip N. (2014), Swiss French Regional Accent Identification, in Proceedings of Odyssey 2014: The Speaker and Language Recognition Workshop, FinlandISCA, Web.
Syllable-based Regional Swiss French Accent Identification using Prosodic Features
Lazaridis Alexandros, Garner Philip N. (2014), Syllable-based Regional Swiss French Accent Identification using Prosodic Features, in Nouveaux cahiers de linguistique françcaise, 31, 297.
Translation and Prosody in Swiss Languages
Garner Philip N., Clark Rob, Goldman Jean-Philippe, Honnet Pierre-Edouard, Ivanova Maria, Lazaridis Alexandros, Liang Hui, Pfister Beat, Ribeiro Manuel Sam, Wehrli Eric, Yamagishi Junichi (2014), Translation and Prosody in Swiss Languages, in Nouveaux cahiers de linguistique françcaise, 31, 211.

Datasets

The SIWIS database

Author Goldman, Jean-Philippe
Publication date 16.11.2016
Persistent Identifier (PID) http://bit.ly/siwisData
Repository University of Geneva
Abstract
We describe here a collection of speech data of bilingual and trilingual speakers of English, French, German and Ital- ian. In the context of speech to speech translation (S2ST), this database is designed for several purposes and studies: train- ing CLSA systems (cross-language speaker adaptation), con- veying emphasis through S2ST systems, and evaluating TTS systems. More precisely, 36 speakers judged as accentless (22 bilingual and 14 trilingual speakers) were recorded for a set of 171 prompts in two or three languages, amounting to a to- tal of 24 hours of speech. These sets of prompts include 100 sentences from news, 25 sentences from Europarl, the same 25 sentences with one acted emphasised word, 20 semantically un- predictable sentences, and finally a 240-word long text. All in all, it yielded 64 bilingual session pairs of the six possible com- binations of the four languages. The database is freely available for non-commercial use and scientific research purposes.

The SIWIS French Speech Synthesis Database

Author Yamagishi, Junichi
Publication date 06.02.2017
Persistent Identifier (PID) http://dx.doi.org/10.7488/ds/1705
Repository Edinburgh Datashare
Abstract
The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.

Collaboration

Group / person Country
Types of collaboration
National Institute of Informatics Japan (Asia)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Exchange of personnel

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Swiss Workshop on Prosody Talk given at a conference Importance of Prosody in Swiss French Accent for Speech Synthesis 10.09.2014 Geneva, Switzerland Garner Philip; Honnet Pierre-Edouard;
Swiss Workshop on Prosody Talk given at a conference Translation and Prosody in Swiss Languages 10.09.2014 Geneva, Switzerland Lazaridis Alexandros; Garner Philip; Goldman Jean-Philippe; Honnet Pierre-Edouard;
Swiss Workshop on Prosody Talk given at a conference Syllable-based Regional Swiss French Accent Identification using Prosodic Features 10.09.2014 Geneva, Switzerland Garner Philip; Goldman Jean-Philippe; Lazaridis Alexandros;
Speech Prosody Poster Prosody in Swiss French Accents: Investigation using Analysis by Synthesis 20.05.2014 Dublin, Ireland Honnet Pierre-Edouard; Lazaridis Alexandros;
Speech Prosody Poster SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis 20.05.2014 Dublin, Ireland Lazaridis Alexandros; Honnet Pierre-Edouard;


Associated projects

Number Title Start Funding scheme
165545 MASS: Multilingual Affective Speech Synthesis 01.05.2017 Project funding (Div. I-III)
127510 Improving the coherence of machine translation output by modeling intersentential relations 01.03.2010 Sinergia
132619 Interactive Cognitive Systems (ICS) 01.10.2010 Project funding (Div. I-III)
152495 SP2: Scopes Project on Speech Prosody 01.04.2014 SCOPES

Abstract

Speech-to-speech translation (S2ST) is the translation of spokensentences from one language to another. S2ST is not a maturetechnology but a topic in the research community. Commonly, a S2STsystem consists of three components: speech-to-text conversion,machine translation, and text-to-speech conversion. Such a S2ST systemtranslates only the words and neglects the personality associated withthem. Therefore, the translated speech reflects its original onlyincompletely.The SIWIS project aims for a better S2ST system that alsotranslates important cues in speech such as identity of the speaker,focus, contrast or emphasis; it therefore transfers the user'sintention more naturally and completely. Furthermore, the S2ST systemshould be adaptive with respect to two aspects: the speech-to-textcomponent should adapt to the user's voice in order to optimise thespeech recognition rate, and the text-to-speech component should beadaptive to allow the user to define the sound of the generated speechby means of some speech samples.Switzerland is an ideal place for S2ST research, because it works dayto day in five different languages simultaneously. Four of these arenational languages, augmented by English, the latter being asimportant as any of them for international communication. Thesituation is even more complicated if one also considers local accentsand dialects. This language mix leads to obvious difficulties, withmany people working and even living in a non-native language. Morepositively, however, this multi-linguality makes Switzerland apredestined place for multi-lingual research, not only as a geographiclocation to conduct research, but also as the country most likely tobenefit from the results of such research.The SIWIS project will begin from a baseline defined by the unionof the output of the EU FP7 EMIME project and the complementaryexpertise of four partner institutions. In a series of core tasks, the partners will pool resources to place the Swiss languageresearch community at the state-of-the-art in speech-to-speechtranslation in the major languages of Switzerland and Swiss commerce.In a series of group tasks, the partners will advance thestate-of-the-art in the field, capitalising on the unique location,language mix and expertise of Switzerland and the partners.All tasks will focus on one or more of the following common themes:Swiss languages. Whilst a focus on Swiss language is not a research issue in itself, the Swiss locality puts \project in a position to focus data collection and enable research that can only take place effectively in a multi-lingual environment.Translation. We have at our disposal a capable translation framework, hence an end-to-end recognition, translation and synthesis chain.Prosody. Prosody will be a significant research focus of SIWIS. That is, we will translate not only spoken words, but also important prosodic cues associated with them. The concept of prosody transfer across distinct languages and speakers is a largely untouched research area. It is something that can only be attacked given the pooled recognition, translation and synthesis resources of a consortium.Cross-lingual adaptation. Adaptation is the process that allows speech synthesis to mimic the voice of a speaker in another language. This will be driven by the unique availability of bilingual speakers in Switzerland.The research will result in a unique speech-to-speech translationcapability, the synthesis in a target language mimicking both thespectral and prosodic characteristics of the speaker in the source language.
-