Project

Back to overview

Flexible Grapheme-Based Automatic Speech Recognition (FlexASR)

English title Flexible Grapheme-Based Automatic Speech Recognition (FlexASR)
Applicant Magimai-Doss Mathew
Number 124985
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.01.2010 - 30.04.2013
Approved amount 155'526.00
Show all

Keywords (5)

Phoneme; Grapheme; Automatic Speech Recognition; Hidden Markov Models; Kullback-Leibler Divergence

Lay Summary (English)

Lead
Lay summary
Current state-of-the-art Automatic Speech Recognition (ASR) systems represent words in terms of their phonetic transcription, possibly enriched by pronunciation variants. Each phonetic unit (or other subword units) are typically represented by Hidden Markov Models (HMMs), and the parameters of the HMMs are trained on large amounts of speech data.While such approaches are yielding reasonable performance, they require explicit phonetic transcriptions of the words (which are not always available, and suffer from large phonological variability); still lack robustness towards phonetic variation introduced by speaker; do not explicitly/directly model additional sources of information, such as the orthographic form of the words; and not easily extendable to new languages.The goal of the present project is to investigate new approaches towards HMM-based ASR where new information sources can be integrated into the statistical models, primarily targeting here the use of the orthographic transcriptions as subword units and different phonetic level information as feature representation (as opposed to standard spectral-based acoustic features). To achieve this goal, we believe (following recent developments at Idiap) that it will require at least two adaptation of the standard HMM-based ASR model: (1) working in posterior feature spaces (where information combinations can be achieved more efficiently) and (2) adapting the parametrization of the HMMs (e.g., replacing Gaussian mixtures by multinomial distributions and local likelihoods by Kullback-Leibler divergence between distribution). Research in this area will be done in the context of monolingual, and multilingual ASR.
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Name Institute

Publications

Publication
Grapheme and Multilingual Posterior Features for Under-Resourced Speech Recognition: A Study on Scottish Gaelic
Rasipuram Ramya, Bell Peter, Magimai.-Doss Mathew (2013), Grapheme and Multilingual Posterior Features for Under-Resourced Speech Recognition: A Study on Scottish Gaelic, in Proceedings of IEEE International Conference on Acoustics Speech Signal Processing (ICASSP) 2013, Vancouver, CanadaIEEE, IEEE.
Improving Grapheme-based ASR by Probabilistic Lexical Modeling Approach
RasipuramRamya, Magimai.-DossMathew (2013), Improving Grapheme-based ASR by Probabilistic Lexical Modeling Approach, in Proceedings of Interspeech, Lyon, FranceISCA, ISCA.
KL-HMM and Probabilistic Lexical Modeling
Rasipuram Ramya, Magimai.-Doss Mathew (2013), KL-HMM and Probabilistic Lexical Modeling, Idiap Research Report Idiap-RR-04-2013, Martigny, Switzerland.
Probabilistic Lexical Modeling and Grapheme-based Automatic Speech Recognition
Rasipuram Ramya, Magimai.-Doss Mathew (2013), Probabilistic Lexical Modeling and Grapheme-based Automatic Speech Recognition, Idiap Research Report Idiap-RR-15-2013, Martigny, Switzerland.
Acoustic data-driven grapheme-to-phoneme conversion using KL-HMM
Rasipuram Ramya, Magimai.-Doss Mathew (2012), Acoustic data-driven grapheme-to-phoneme conversion using KL-HMM, in Proceedings of IEEE International Conference on Acoustics Speech Signal Processing (ICASSP) 2012, Kyoto, JapanIEEE International Conference on Acoustics Speech Signal Processing (ICASSP), Kyoto.
Combining Acoustic Data Driven G2P and Letter-to-Sound Rules for Under Resource Lexicon Generation
Rasipuram Ramya, Magimai.-Doss Mathew (2012), Combining Acoustic Data Driven G2P and Letter-to-Sound Rules for Under Resource Lexicon Generation, in Proceedings of Interspeech 2012, Portland, USAISCA, ISCA.
Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition
Imseng David, Rasipuram Ramya, Magimai.-Doss Mathew (2011), Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition, in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2011, Hawaii, USAIEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Hawaii.
Grapheme-based automatic speech recognition using KL-HMM
Magimai.-Doss Mathew, Rasipuram Ramya, Aradilla Guillermo (2011), Grapheme-based automatic speech recognition using KL-HMM, in Proceedings of Interspeech 2011, Florence, ItalyInterspeech, Italy.
Improving articulatory feature and phoneme recognition using multitask learning
Rasipuram Ramya, Magimai.-Doss Mathew (2011), Improving articulatory feature and phoneme recognition using multitask learning, in Proceedings of International Conference on Artificial Neural Networks (ICANN) 2011, Espoo, FinlandInternational Conference on Artificial Neural Networks (ICANN), Finland.
Integrating articulatory features using Kullback-Leibler divergence based acoustic model for phoneme recognition
Rasipuram Ramya, Magimai.-Doss Mathew (2011), Integrating articulatory features using Kullback-Leibler divergence based acoustic model for phoneme recognition, in Proceedings of IEEE International Conference on Acoustics Speech Signal Processing (ICASSP) 2011, Prague, CzechIEEE, IEEE.
Multitask learning to improve articulatory feature estimation and phoneme recognition
Rasipuram Ramya, Magimai.-Doss Mathew (2011), Multitask learning to improve articulatory feature estimation and phoneme recognition, Idiap Research Report Idiap-RR-21-2011, Martigny.

Collaboration

Group / person Country
Types of collaboration
The Centre for Speech Technology Research, University of Edinburgh Great Britain and Northern Ireland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
IEEE Workshop Automatic Speech Recognition and Understanding Poster PROBABILISTIC LEXICAL MODELING AND UNSUPERVISED TRAINING FOR ZERO-RESOURCED ASR 08.12.2013 Olomouc, Czech Republic Rasipuram Ramya;
IEEE International Conference on Acoustic, Speech Signal Processing (ICASSP) Poster GRAPHEME AND MULTILINGUAL POSTERIOR FEATURES FOR UNDER-RESOURCED SPEECH RECOGNITION: A STUDY ON SCOTTISH GAELIC 29.05.2013 Vancouver, Canada Rasipuram Ramya;
EPFL EDEE Scientific Exchange Day (EDEE SED) Poster Talked about her PhD thesis work 17.04.2013 EPFL, Lausanne, Switzerland Rasipuram Ramya;
IM2 Summer Institute 2012 Poster Acoustic Data-Driven Grapheme-to-Phoneme Conversion using KL-HMM 04.09.2012 Martigny, Switzerland, Switzerland Rasipuram Ramya;
IEEE International Conference on Acoustic, Speech Signal Processing (ICASSP) Poster ACOUSTIC DATA-DRIVEN GRAPHEME-TO-PHONEME CONVERSION USING KL-HMM 30.03.2012 Kyoto, Japan Rasipuram Ramya;
IM2 Summer Institute 2011 Poster Grapheme-based Automatic Speech Recognition using KL-HMM 02.09.2011 Martigny, Switzerland, Switzerland Rasipuram Ramya; Magimai-Doss Mathew;
International Conference on Artificial Neural Networks (ICANN) Talk given at a conference Improving Articulatory Feature and Phoneme Recognition using Multitask Learning 17.06.2011 Espoo, Finland Magimai-Doss Mathew;
IM2 Summer Institute 2010 Poster On joint Modelling of Grapheme and Phoneme Information using KL-HMM for ASR 14.09.2010 Saanenmöser, Gstaad, Switzerland, Switzerland Rasipuram Ramya; Magimai-Doss Mathew;


Associated projects

Number Title Start Funding scheme
146229 Flexible Grapheme-Based Automatic Speech Recognition (FlexASR) 01.05.2013 Project funding (Div. I-III)

Abstract

Current state-of-the-art automatic speech recognition (ASR) systems commonly use hidden Markov models (HMMs), where phonemes (phones) are assumed to be the intermediate subword units. Given the high (speaker and contextual) variability of those elementary units, state-of-the-art systems have to rely on complex statistical modelling (multidimensional Gaussian mixture models with large number of mixture components). Such systems also require some minimum phonetic expertise since every word to be recognized has to be explicitly modelled in terms of a Markov model capturing its official phonetic transcription (usually found from a dictionary), as well as its pronunciation variants. In spite of its relative success, this approach remains quite cumbersome whendealing with (unavoidable) new words or, when deploying new languages. Given this, there has always been an interest in using directly thegrapheme(orthographic) transcription of the word, without explicit phonetic modeling.However, while limiting the variability at the word representation level, thelink between the acoustic waveform has become weaker (depending on the language), as the standard acoustic features characterize phonemes. Most recent attempts were based on mapping orthography of the words onto HMM states using phonetic information, or extending conventional HMM-based ASR systems by improving context-dependent modelling for grapheme units.The goal of the present project is to exploit new statistical models recentlydeveloped at Idiap and that are potentially better suited to deal withthe grapheme representation of the lexicon words and to exploit in a principled way both grapheme representation and phoneme information. This will be done by extending a novel acoustic modelling approach referred to as KL-HMM (Kullback-Leibler divergence based HMM), which has recently been shown to be much simpler, and more flexible, while yielding state-of-the-art performance (on phoneme-based ASR system) and opening up multiple opportunities for further development and research. In KL-HMM system, acoustic features are replaced by elementary unit (e.g. phonemes) posterior probability distribution and, HMM states are modelled through multinomial distribution in that posterior space. We believe this can be generalized to grapheme-based systems.Also, while working in posterior probability spaces, it is much easier to combine multiple evidences coming from multiple sources of information. The present project proposal is thus particularly well suited as a PhD project since it will allow:1. Building upon a strong PhD thesis(1) and extending a new and very promising approach towards flexible speech recognition systems.2. Investigating further its generalization properties towards new types ofmodels based on grapheme word representation.(1) Guillermo Aradilla, ”Acoustic Models for Posterior Features in Speech Recognition”, PhD Thesis, No. 4164, ´ Ecole Polytechnique F´ed´erale de Lausanne, 2008.
-