speech-based human machine interaction; face tracking; lip-reading; microphone array; audio-visual speech recognition; audio-visual speech
Naghibi Tofigh, Hoffmann Sarah, Pfister Beat (2013), An efficient method to estimate pronunciation from multiple utterances, in Interspeech
, Lyon (France).
Naghibi Tofigh, Hoffmann Sarah, Pfister Beat (2013), Convex approximation of the NP-hard search problem in feature subset selection, in ICASSP 2013
, Vancouver (Canada).
Dantone Matthias, Gall Jürgen, Leistner Christian, Van Gool Luc (2013), Human Pose Estimation using Body Parts Dependent Joint Regressors, in Conference on Computer Vision and Pattern Recognition
Fanelli Gabriele, Dantone Matthias, Van Gool Luc (2013), Real time 3D face alignment with random forests-based active appearance models, in Automatic Face and Gesture Recognition
Naghibi T., Pfister B. (2012), An Approach to Prevent Adaptive Beamformers from Cancelling the Desired Signal, in ICASSP 2012
, Kyoto (Japan).
Naghibi T., Pfister B. (2012), Beamformer Design For Nonstationary Signals by Means of Interfrequency Correlations, in SAM 2012
, Hoboken, NJ (USA).
Fanelli Gabriele, Dantone Matthias, Gall Jürgen, Fossati Andrea, Van Gool Luc (2012), Random forests for real time 3D face analysis, in International Journal of Computer Vision
, 101(3), 437-458.
Fanelli G., Gall J. (2012), Real time 3D head pose estimation: recent achievements and future challenges, in International Symposium on Communications, Control and Signal Processing
Dantone M., Gall J., Fanelli G. (2012), Real-time Facial Feature Detection using Conditional Regression Forests, in Van Gool, L.
Dantone Matthias, Gall Jürgen, Fanelli Gabriele, Van Gool Luc (2012), Real-time facial feature detection using conditional regression forests, in Computer Vision and Pattern Recognition
, Rhode Island.
Fanelli G., Weise T., Gall J. (2011), Real Time Head Pose Estimation from Consumer Depth Cameras, in Van Gool, L.
Speech-controlled machines can successfully be deployed in rather quiet locations. However, they fail in situations of high background noise and in particular in the presence of other voices. In these situations the error rate of the speech recognition component of such systems rises drastically and, even worse, the system cannot distinguish between the spoken commands of the user and other background speech.The aim of this project is to improve the noise robustness of speech-based human machine interaction (HMI) by means of information from the visual channel. For example, by observing the mouth of the user, distinguishing user speech from background speech may become much more reliable. The basic idea is to exploit helpful information from the visual channel and combine it with the information from the audio channel. This principle can be applied in various tasks such as voice activity detection, speech recognition and user verification. These tasks are particularly important in speech-based HMI.There is currently substantial research activity in the area of multimodal approaches. Most of this research is focused on the processing of multimodal data recorded from meeting rooms. In contrast to the meeting scenario, where processing is done off-line, HMI requires on-line processing. Therefore, algorithms for such a task need not only work as correctly as possible, but have to be efficient as well.For the development of such algorithms, large amounts of audio-visual data recorded from HMI is needed. As such data is currently not available, the project is split in several cycles of development and testing. After an initial development phase, the algorithms will be integrated in a demonstrator and users will try it out. The data recorded from these trials will be used for the evaluation of the algorithms and for the development within the next cycle.