Project

Back to overview

Vision-supported speech-based human machine interaction

Applicant Pfister Beat
Number 130224
Funding scheme Project funding (Div. I-III)
Research institution Institut für Technische Informatik und Kommunikationsnetze ETH Zürich
Institution of higher education ETH Zurich - ETHZ
Main discipline Information Technology
Start/End 01.04.2010 - 31.03.2014
Approved amount 426'376.00
Show all

All Disciplines (2)

Discipline
Information Technology
Other disciplines of Engineering Sciences

Keywords (6)

speech-based human machine interaction; face tracking; lip-reading; microphone array; audio-visual speech recognition; audio-visual speech

Lay Summary (English)

Lead
Lay summary
Speech-controlled machines can successfully be deployed in rather quiet locations. However, they fail in situations of high background noise and in particular in the presence of other voices. In these situations the error rate of the speech recognition component of such systems rises drastically and, even worse, the system cannot distinguish between the spoken commands of the user and other background speech.The aim of this project is to improve the noise robustness of speech-based human machine interaction (HMI) by means of information from the visual channel. For example, by observing the mouth of the user, distinguishing user speech from background speech may become much more reliable. The basic idea is to exploit helpful information from the visual channel and combine it with the information from the audio channel. This principle can be applied in various tasks such as voice activity detection, speech recognition and user verification. These tasks are particularly important in speech-based HMI.
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
An efficient method to estimate pronunciation from multiple utterances
Naghibi Tofigh, Hoffmann Sarah, Pfister Beat (2013), An efficient method to estimate pronunciation from multiple utterances, in Interspeech, Lyon (France).
Convex approximation of the NP-hard search problem in feature subset selection
Naghibi Tofigh, Hoffmann Sarah, Pfister Beat (2013), Convex approximation of the NP-hard search problem in feature subset selection, in ICASSP 2013, Vancouver (Canada).
Human Pose Estimation using Body Parts Dependent Joint Regressors
Dantone Matthias, Gall Jürgen, Leistner Christian, Van Gool Luc (2013), Human Pose Estimation using Body Parts Dependent Joint Regressors, in Conference on Computer Vision and Pattern Recognition.
Real time 3D face alignment with random forests-based active appearance models
Fanelli Gabriele, Dantone Matthias, Van Gool Luc (2013), Real time 3D face alignment with random forests-based active appearance models, in Automatic Face and Gesture Recognition.
An Approach to Prevent Adaptive Beamformers from Cancelling the Desired Signal
Naghibi T., Pfister B. (2012), An Approach to Prevent Adaptive Beamformers from Cancelling the Desired Signal, in ICASSP 2012, Kyoto (Japan).
Beamformer Design For Nonstationary Signals by Means of Interfrequency Correlations
Naghibi T., Pfister B. (2012), Beamformer Design For Nonstationary Signals by Means of Interfrequency Correlations, in SAM 2012, Hoboken, NJ (USA).
Random forests for real time 3D face analysis
Fanelli Gabriele, Dantone Matthias, Gall Jürgen, Fossati Andrea, Van Gool Luc (2012), Random forests for real time 3D face analysis, in International Journal of Computer Vision, 101(3), 437-458.
Real time 3D head pose estimation: recent achievements and future challenges
Fanelli G., Gall J. (2012), Real time 3D head pose estimation: recent achievements and future challenges, in International Symposium on Communications, Control and Signal Processing, Rome.
Real-time Facial Feature Detection using Conditional Regression Forests
Dantone M., Gall J., Fanelli G. (2012), Real-time Facial Feature Detection using Conditional Regression Forests, in Van Gool, L..
Real-time facial feature detection using conditional regression forests
Dantone Matthias, Gall Jürgen, Fanelli Gabriele, Van Gool Luc (2012), Real-time facial feature detection using conditional regression forests, in Computer Vision and Pattern Recognition , Rhode Island.
Real Time Head Pose Estimation from Consumer Depth Cameras
Fanelli G., Weise T., Gall J. (2011), Real Time Head Pose Estimation from Consumer Depth Cameras, in Van Gool, L. .

Abstract

Speech-controlled machines can successfully be deployed in rather quiet locations. However, they fail in situations of high background noise and in particular in the presence of other voices. In these situations the error rate of the speech recognition component of such systems rises drastically and, even worse, the system cannot distinguish between the spoken commands of the user and other background speech.The aim of this project is to improve the noise robustness of speech-based human machine interaction (HMI) by means of information from the visual channel. For example, by observing the mouth of the user, distinguishing user speech from background speech may become much more reliable. The basic idea is to exploit helpful information from the visual channel and combine it with the information from the audio channel. This principle can be applied in various tasks such as voice activity detection, speech recognition and user verification. These tasks are particularly important in speech-based HMI.There is currently substantial research activity in the area of multimodal approaches. Most of this research is focused on the processing of multimodal data recorded from meeting rooms. In contrast to the meeting scenario, where processing is done off-line, HMI requires on-line processing. Therefore, algorithms for such a task need not only work as correctly as possible, but have to be efficient as well.For the development of such algorithms, large amounts of audio-visual data recorded from HMI is needed. As such data is currently not available, the project is split in several cycles of development and testing. After an initial development phase, the algorithms will be integrated in a demonstrator and users will try it out. The data recorded from these trials will be used for the evaluation of the algorithms and for the development within the next cycle.
-