Project

Back to overview

Flexible Grapheme-Based Automatic Speech Recognition (FlexASR)

English title Flexible Grapheme-Based Automatic Speech Recognition (FlexASR)
Applicant Magimai-Doss Mathew
Number 146229
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.05.2013 - 30.04.2014
Approved amount 46'800.00
Show all

Keywords (5)

Grapheme; Phoneme; Hidden Markov Models; Kullback-Leibler Divergence; Automatic Speech Recognition

Lay Summary (French)

Lead
Les systèmes de reconnaissance automatique de la parole (ASR) actuels utilisent des phonèmes comme unités de sous-mots. Ainsi, le développement de ces systèmes nécessite que le lexique phonétique soit bien developpé. Cependant, pas toutes les langues ou domaines ne peuvent avoir ces ressources lexicales bien développées. le projet FlexASR se concentre sur le développement de ces systèmes qui utilisent des graphèmes comme unités de sous-mots.
Lay summary

L'objectif de FlexASR est de développer, dans le cadre d'une technique de modélisation acoustique intitulé divergence de Kullback-Leibler basé sur le modèle caché de Markov (KL-HMM), un système de reconnaissance flexible des graphèmes basé sur la reconnaissance de la parole automatique (ASR) pour monolingue, cross-lingue et multilingue.

L'utilisation de graphèmes (l’unité de l'écrit correspondant à l'unité orale qu'est le phonème) comme unités de sous-mots pour l'ASR sont utiles pour les raisons suivantes:

a) faciliter le développement des lexiques,
b) représentation unique pour chaque mot

Ainsi, il pourrait faciliter l'extension des systèmes ASR pour les nouvelles langues et domaines. Cependant, le principal défi est que le lien entre les graphèmes et le signal acoustique est fortement dépendant de la langue. En anglais, le lien est faible, tandis qu'en espagnol le lien est fort.

Notre travail aborde cette question en premier lieu par la relation entre l'acoustique et les unités linguistiques, comme les phonèmes avec un réseau de neurones multicouches (MLP). Les probabilités conditionnelles de classe de phonèmes (aussi appelées fonctionnalités postérieures) estimés par le MLP sont ensuite directement utilisés comme des observations de fond pour les HMM (hidden Markov models), dont les états représentent les graphèmes.

Ce renouvellement d'un an à pour but d'apporter au travail de thèse une réussite en investigant sur :
(a) la combinaison de systèmes ASR basé sur les graphèmes-phonèmes
(b) l'ASR basé sur les KL-HMM (Kullback-Leibler - hidden Markov models) sans l'utilisation de phonèmes.

Direct link to Lay Summary Last update: 11.06.2013

Responsible applicant and co-applicants

Employees

Name Institute

Publications

Publication
Articulatory feature based continuous speech recognition using probabilistic lexical modeling
Rasipuram Ramya, Magimai.-Doss Mathew (2016), Articulatory feature based continuous speech recognition using probabilistic lexical modeling, in Computer Speech & Language, 36, 233-259.
Integrated pronunciation learning for automatic speech recognition using probabilistic lexical modeling
Rasipuram Ramya, Razavi Marzieh, Magimai-Doss Mathew (2015), Integrated pronunciation learning for automatic speech recognition using probabilistic lexical modeling, in ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, AustraliaIEEE, IEEE.
Acoustic and Lexical Resource Constrained ASR using Language-Independent Acoustic Model and Language-Dependent Probabilistic Lexical Model
Rasipuram Ramya, Magimai.-Doss Mathew (2015), Acoustic and Lexical Resource Constrained ASR using Language-Independent Acoustic Model and Language-Dependent Probabilistic Lexical Model, in Speech Communication, 68, 23-40.
Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling
RasipuramRamya (2014), Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling, EPFL, Lausanne, Switzerland.
On Learning Grapheme-to-Phoneme Relationships through the Acoustic Speech Signal
Magimai.-DossMathew, RasipuramRamya (2014), On Learning Grapheme-to-Phoneme Relationships through the Acoustic Speech Signal, in The Phonetician, 109-110, 6-23.
On Modeling Context-Dependent Clustered States: Comparing HMM/GMM, Hybrid HMM/ANN and KL-HMM Approaches
Razavi Marzieh, Rasipuram Ramya, Magimai.-Doss Mathew (2014), On Modeling Context-Dependent Clustered States: Comparing HMM/GMM, Hybrid HMM/ANN and KL-HMM Approaches, in Proceedings of ICASSP, Florence, ItalyIEEE , IEEE.
Probabilistic Lexical Modeling and Unsupervised Training for Zero-Resourced ASR
Rasipuram Ramya, Razavi Marzieh, Magimai.-Doss Mathew (2013), Probabilistic Lexical Modeling and Unsupervised Training for Zero-Resourced ASR, in Proceedings of IEEE International Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech RepublicIEEE, IEEE.

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
ICASSP 2014 Talk given at a conference On Modeling Context-Dependent Clustered States: Comparing HMM/GMM, Hybrid HMM/ANN and KL-HMM Approaches 05.05.2014 Florence, Italy Magimai-Doss Mathew;
Google Doctoral Workshop on Speech Technology 2014 Poster Probabilistic Lexical Modeling and Unsupervised Training for Zero-Resourced ASR 28.04.2014 London, Great Britain and Northern Ireland Rasipuram Ramya;
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2013 Poster Probabilistic Lexical Modeling and Unsupervised Training for Zero-Resourced ASR 10.12.2013 Olomouc, Czech Republic Rasipuram Ramya;
Interspeech 2013 Talk given at a conference Improving Grapheme-based ASR by Probabilistic Lexical Modeling Approach 26.08.2013 Lyon, France Rasipuram Ramya;


Associated projects

Number Title Start Funding scheme
124985 Flexible Grapheme-Based Automatic Speech Recognition (FlexASR) 01.01.2010 Project funding (Div. I-III)

Abstract

Current state-of-the-art automatic speech recognition (ASR) systems commonly use hidden Markov models (HMMs), where phonemes (phones) are assumed to be the intermediate subword units. Given the high (speaker and contextual) variability of those elementary units, state-of-the-art systems have to rely on complex statistical modelling (multidimensional Gaussian mixture models with large number of mixture components). Such systems also require some minimum phonetic expertise since every word to be recognized has to be explicitly modelled in terms of a Markov model capturing its official phonetic transcription (usually found from a dictionary), as well as its pronunciation variants. In spite of its relative success, this approach remains quite cumbersome whendealing with (unavoidable) new words or, when deploying new languages. Given this, there has always been an interest in using directly thegrapheme(orthographic) transcription of the word, without explicit phonetic modeling.However, while limiting the variability at the word representation level, thelink between the acoustic waveform has become weaker (depending on the language), as the standard acoustic features characterize phonemes. Most recent attempts were based on mapping orthography of the words onto HMM states using phonetic information, or extending conventional HMM-based ASR systems by improving context-dependent modelling for grapheme units.The goal of the present project is to exploit new statistical models recentlydeveloped at Idiap and that are potentially better suited to deal withthe grapheme representation of the lexicon words and to exploit in a principled way both grapheme representation and phoneme information. This will be done by extending a novel acoustic modelling approach referred to as KL-HMM (Kullback-Leibler divergence based HMM), which has recently been shown to be much simpler, and more flexible, while yielding state-of-the-art performance (on phoneme-based ASR system) and opening up multiple opportunities for further development and research. In KL-HMM system, acoustic features are replaced by elementary unit (e.g. phonemes) posterior probability distribution and, HMM states are modelled through multinomial distribution in that posterior space. We believe this can be generalized to grapheme-based systems.Also, while working in posterior probability spaces, it is much easier to combine multiple evidences coming from multiple sources of information. The present project proposal is thus particularly well suited as a PhD project since it will allow:1. Building upon a strong PhD thesis(1) and extending a new and very promising approach towards flexible speech recognition systems.2. Investigating further its generalization properties towards new types ofmodels based on grapheme word representation.(1) Guillermo Aradilla, ”Acoustic Models for Posterior Features in Speech Recognition”, PhD Thesis, No. 4164, ´ Ecole Polytechnique F´ed´erale de Lausanne, 2008.
-