Project

Back to overview

Interactive Cognitive Systems (ICS)

English title Interactive Cognitive Systems (ICS)
Applicant Bourlard Hervé
Number 132619
Funding scheme Project funding (Div. I-III)
Research institution IDIAP Institut de Recherche
Institution of higher education Idiap Research Institute - IDIAP
Main discipline Information Technology
Start/End 01.10.2010 - 30.09.2012
Approved amount 278'458.00
Show all

Keywords (8)

audio; speech and language processing; computer vision; speaker clustering and diarization; robot localization; online learning; semantic place modeling; advanced human-computer interaction

Lay Summary (English)

Lead
Lay summary
This project encompasses fundamental research aiming at the development of advanced techniques towards Interactive Cognitive Systems (computers and robots) for the processing and interpretation of cognitive audio and visual scenes. While being oriented to fundamental research, its core objective is the study of methods applied to the domains of activity of the Idiap Research Institute.In the present proposal, we briefly describe four research projects that embody some of the challenges described above:ICS-1: Robust privacy-sensitive audio features for interaction modeling. On one hand, advances in cognitive systems trigger more and more privacy preserving issues. On the other hand, it is also interesting to see how much information can be extracted about human-computer and human-human interaction by using audio features that fully preserve the privacy of the users (typically avoiding to extract lexical and identity information). Thus, this project investigates how to detect and model interaction, and how it relates to other aspects of natural human behaviour, based on privacy-preserving features only.ICS-2: Multilingual speech recognition. The goal of this sub-project is to extensively investigate how to extrapolate Idiap's leading edge in (English) speech recognition to multiple languages, including at least Swiss national languages. In this context, we are looking for principled approaches towards the definition and training of shared multi-lingual phone sets, fast adaptation of mono-lingual systems, or composition of multiple (mono-lingual) systems.ICS-3: Learning semantic spatial concepts for mobile robots. In this sub-project, we investigate how a robot can adapt itself to a possibly changing environment. Rather than stick to static outdoor environments, we focus on an indoor home or office environment, where furniture and people move around. Although we are initially focusing on a computer vision modality, the work has the potential to diverge into audio based cognition.ICS-4: Conversation analysis based on speaker diarization. Idiap has always been at the leading-edge in the area of speaker diarization ("Who spoke when"?). ICS-4 proposes a novel speaker diarization approach that is adaptive to its context, taking cues not only from the speakers themselves, but also from the higher semantic context available from dialogue and turn-taking.The above sub-projects span the traditional cognitive spectrum of audio and video, but also include the emerging field of social cognition and should provide potential for strong interactions. This interaction will be encouraged through the use of common tasks and databases and common software.
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
Boosting under-resourced speech recognizers by exploiting out of language data - Case study on Afrikaans
Imseng David, Bourlard Hervé, Garner Philip N. (2012), Boosting under-resourced speech recognizers by exploiting out of language data - Case study on Afrikaans, in Spoken Languages Technologies for Under-resourced Languages.
Comparing different acoustic modeling techniques for multilingual boosting
Imseng David, Dines John, Motlicek Petr, Garner Philip N., Bourlard Hervé (2012), Comparing different acoustic modeling techniques for multilingual boosting, in INTERSPEECH.
MediaParl: Bilingual mixed language accented speech database
Imseng David, Bourlard Hervé, Caesar Holger, Garner Philip N., Lecorvé Gwénolé, Nanchen Alexandre (2012), MediaParl: Bilingual mixed language accented speech database, in Spoken Language Technology.
The ICSI RT-09 Speaker Diarization System
Friedland Gerald, Janin Adam, Imseng David, Anguera Xavier, Gottlieb Luke, Huijbregts Marijn, Knox Mary Tai, Vinyals Oriol (2012), The ICSI RT-09 Speaker Diarization System, in IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 371-381.
Using KL-divergence and multilingual information to improve ASR for under-resourced languages
Imseng David, Bourlard Hervé, Garner Philip N. (2012), Using KL-divergence and multilingual information to improve ASR for under-resourced languages, in IEEE International Conference on Acoustics, Speech and Signal Processing.
An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization
Vijayasenan Deepu, Valente Fabio, Bourlard Hervé (2011), An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization, in IEEE Transactions on Audio Speech and Language Processing, 19(2), 431-438.
Current trends in multilingual speech processing
Bourlard Hervé, Dines John, Magimai.-Doss Mathew, Garner Philip N., Imseng David, Motlicek Petr, Liang Hui, Saheer Lakshmi, Valente Fabio (2011), Current trends in multilingual speech processing, in Sadhana, 36(5), 885-915.
Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition
Imseng David, Rasipuram Ramya, Magimai.-Doss Mathew (2011), Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition, in IEEE workshop on Automatic Speech Recognition and Understanding.
Improving non-native ASR through stochastic multilingual phoneme space transformations
Imseng David, Bourlard Hervé, Dines John, Garner Philip N., Magimai.-Doss Mathew (2011), Improving non-native ASR through stochastic multilingual phoneme space transformations, in INTERSPEECH.
Language dependent universal phoneme posterior estimation for mixed language speech recognition
Imseng David, Bourlard Hervé, Magimai.-Doss Mathew, Dines John (2011), Language dependent universal phoneme posterior estimation for mixed language speech recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing.
LP Residual Features for Robust, Privacy-Sensitive Speaker Diarization
Parthasarathi Sree Hari Krishnan, Bourlard Hervé, Gatica-Perez Daniel (2011), LP Residual Features for Robust, Privacy-Sensitive Speaker Diarization, in INTERSPEECH, Proceedings of Interspeech, Florence, IT.
A Multi Cue Discriminative Approach to Semantic Place Classification
Fornoni Marco, Martinez-Gomez Jesus, Caputo Barbara (2010), A Multi Cue Discriminative Approach to Semantic Place Classification, in CLEF 2010 Notebook Papers/LABs/Workshops, Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation, Amsterdam, NL.
Advances in Fast Multistream Diarization based on the Information Bottleneck Framework
Vijayasenan Deepu, Valente Fabio, Bourlard Hervé (2010), Advances in Fast Multistream Diarization based on the Information Bottleneck Framework, in INTERSPEECH, Proceedings of Interspeech, Makuhari, JP.
An Adaptive Initialization Method for Speaker Diarization based on Prosodic Features
Imseng David, Friedland Gerald (2010), An Adaptive Initialization Method for Speaker Diarization based on Prosodic Features, in IEEE International Conference on Acoustics, Speech and Signal Processing, (Idiap-RR-2), (Idiap-RR-2).
An Information Theoretic Approach to Speaker Diarization of Meeting Recordings
Vijayasenan Deepu (2010), An Information Theoretic Approach to Speaker Diarization of Meeting Recordings, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland.
Evaluating the Robustness of Privacy-Sensitive Audio Features for Speech Detection in Personal Audio Log Scenarios
Parthasarathi Sree Hari Krishnan, Magimai-Doss Mathew, Bourlard Hervé, Gatica-Perez Daniel (2010), Evaluating the Robustness of Privacy-Sensitive Audio Features for Speech Detection in Personal Audio Log Scenarios, in IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Dallas, USA.
Hierarchical Multilayer Perceptron based Language Identification
Imseng David, Magimai.-Doss Mathew, Bourlard Hervé (2010), Hierarchical Multilayer Perceptron based Language Identification, in INTERSPEECH.
Multistream Speaker Diarization beyond Two Acoustic Feature Streams
Vijayasenan Deepu, Valente Fabio, Bourlard Hervé (2010), Multistream Speaker Diarization beyond Two Acoustic Feature Streams, in IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Dallas, USA.
OM-2: An Online Multi-class Multi-kernel Learning Algorithm
Luo Jie, Orabona Francesco, Fornoni Marco, Caputo Barbara, Cesa-Bianchi Nicolo (2010), OM-2: An Online Multi-class Multi-kernel Learning Algorithm, in CVPR 2010, Online Learning for Computer Vision Workshop, IEEE Online Learning for Computer Vision Workshop, not known.
Towards mixed language speech recognition systems
Imseng David, Bourlard Hervé, Magimai.-Doss Mathew (2010), Towards mixed language speech recognition systems, in INTERSPEECH.
Tuning-Robust Initialization Methods for Speaker Diarization
Imseng David, Friedland Gerald (2010), Tuning-Robust Initialization Methods for Speaker Diarization, in IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 2028-2037.
Variational Bayesian Speaker Diarization on Meeting Recordings
Valente Fabio, Motlicek Petr, Vijayasenan Deepu (2010), Variational Bayesian Speaker Diarization on Meeting Recordings, in IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Dallas, USA.
Robust Speaker Diarization for Short Speech Recordings
Imseng David, Friedland Gerald (2009), Robust Speaker Diarization for Short Speech Recordings, in IEEE workshop on Automatic Speech Recognition and Understanding.
Privacy-Sensitive Audio Features for Speech/Nonspeech Detection
Parthasarathi Sree Hari Krishnan, Gatica-Perez Daniel, Bourlard Hervé, Magimai-Doss Mathew, Privacy-Sensitive Audio Features for Speech/Nonspeech Detection, in IEEE Transactions on Audio, Speech, and Language Processing.

Associated projects

Number Title Start Funding scheme
144281 Adaptive Multilingual Speech Processing 01.10.2012 Project funding (Div. I-III)
141903 SIWIS: Spoken Interaction with Interpretation in Switzerland 01.12.2012 Sinergia
122062 MULTI: Multimodal Interaction and Multimedia Data Mining 01.10.2008 Project funding (Div. I-III)
146411 Interactive Cognitive Systems, Indoor Scene Recognition for Intelligent Systems 01.04.2013 Project funding (Div. I-III)
146411 Interactive Cognitive Systems, Indoor Scene Recognition for Intelligent Systems 01.04.2013 Project funding (Div. I-III)

Abstract

Cognitive systems have always been a strong theme of the computing research community in general, and Idiap in particular. Until the last decade, such research tended to be placed under headings such as speech recognition or image processing. The scenarios were typically unimodal, with tight constraints on how a human could behave with respect to the computer. As we progress in time, however, our systems produce better and better results on evaluation databases, and we are able and obliged to move the goalposts. For example, speech recognition has to be able to deal with spontaneity, background noise, adaptation to the environment and task, as well as the multilingual aspects (too often underestimated, with main emphasis on English only). In robotic vision, also covered by the present project, computers have to be able to adapt to changing environments and extract relevant semantic information.This project thus encompasses fundamental research aiming at the development of advanced techniques towards Interactive Cognitive Systems (computers and robots) for the processing and interpretation of cognitive audio and visual scenes. While being oriented to fundamental research, its core objective is the study of methods applied to the domains of activity of the Idiap Research Institute.In the present proposal, we briefly describe four research projects that embody some of the challenges described above:ICS-1: Robust privacy-sensitive audio features for interaction modeling. On one hand, advances in cognitive systems trigger more and more privacy preserving issues. On the other hand, it is also interesting to see how much information can be extracted about human-computer and human-human interaction by using audio features that fully preserve the privacy of the users (typically avoiding to extract lexical and identity information). Thus, this project investigates how to detect and model interaction, and how it relates to other aspects of natural human behaviour, based on privacy-preserving features only.ICS-2: Multilingual speech recognition. The goal of this sub-project is to extensively investigate how to extrapolate Idiap’s leading edge in (English) speech recognition to multiple languages, including at least Swiss national languages. In this context, we are looking for principled approaches towards the definition and training of shared multi-lingual phone sets, fast adaptation of mono-lingual systems, or composition of multiple (mono-lingual) systems.ICS-3: Learning semantic spatial concepts for mobile robots. In this sub-project, we investigate how a robot can adapt itself to a possibly changing environment. Rather than stick to static outdoor environments, we focus on an indoor home or office environment, where furniture and people move around. Although we are initially focusing on a computer vision modality, the work has the potential to diverge into audio based cognition.ICS-4: Conversation analysis based on speaker diarization. Idiap has always been at the leading-edge in the area of speaker diarization (“Who spoke when”?). ICS-4 proposes a novel speaker diarization approach that is adaptive to its context, taking cues not only from the speakers themselves, but also from the higher semantic context available from dialogue and turn-taking.The above sub-projects span the traditional cognitive spectrum of audio and video, but also include the emerging field of social cognition and should provide potential for strong interactions. This interaction will be encouraged through the use of common tasks and databases and common software.
-