Project

Back to overview

Robust face tracking, feature extraction and multimodal fusion for audio-visual speech recognition and visual attention modeling in complex environment

English title Robust face tracking, feature extraction and multimodal fusion for audio-visual speech recognition and visual attention modeling in complex environment
Applicant Thiran Jean-Philippe
Number 130152
Funding scheme Project funding (Div. I-III)
Research institution Laboratoire de traitement des signaux 5 EPFL - STI - IEL - LTS5
Institution of higher education EPF Lausanne - EPFL
Main discipline Information Technology
Start/End 01.04.2010 - 31.03.2014
Approved amount 287'400.00
Show all

Keywords (8)

3D face tracking; multi-view; audio-visual speech recognition; gaze; multimodal signal processing; visual focus of attention; Face tracking; multimodal

Lay Summary (English)

Lead
Lay summary
Human communication is a combination of speech and non-verbal behavior. A significant part of the non-verbal information is contained in face movements and expressions. Therefore, a major step in the automatic analysis of human communication is the location and tracking of human faces. In this project, we will first tackle the problem of robust face tracking, that is, the continuous estimation of the head pose and of the facial animations in video sequences. Based on this first development, two subsequent workpackages will address important building blocks towards the automatic analysis of natural scenes, namely automatic audio-visual speech recognition and Visual Focus of Attention (VFOA) analysis. Both of them strongly rely on robust face tracking and therefore will directly exploit and benefit from the results of the first workpackage.Our research in face tracking will rely on 3D deformable models learned from training data, which have shown their efficiency at modeling individual face shapes and expressions and at handling self-occlusions. We will address recurrent issues in the domain (strong illumination variations, tracking near profile views, automatic initialization and reinitialization).Human speech perception is bimodal in nature, as we unconsciously combine both audio and visual information to decide what has been spoken. Therefore, in the second workpackage we consider the audio and visual dimensions of the problem and develop techniques exploiting both modalities and their interaction. Building on our previous works, in this project we focus on visual feature extraction and audio-visual integration in realistic situations. Considering the third workpackage, gaze is recognized as one of the most important aspects of non-verbal communication and social interaction, with functions such as establishing relationships through mutual gaze, regulating the course of interaction, expressing intimacy or social control. Exploiting again the results of the first workpackage, we will develop probabilistic models mapping visual information like head pose or orientation into gazing directions. As we can see, in this project we will address three fundamental technical components towards automatic human-to-human communication analysis. The project will be an important technical contribution to both the emerging field of social signal processing, which aims at the development of computational models for machine understanding of communicative and social behavior, and human computing, which seeks to design human-centered interfaces capable of seamless interaction with people.
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
3D head pose and gaze tracking and their application to diverse multimodal tasks
Funes Mora Kenneth Alberto, 3D head pose and gaze tracking and their application to diverse multimodal tasks, in ICMI Doctoral consortium.
A semi-automated system for accurate gaze coding in natural dyadic interactions
Funes Mora Kenneth Alberto, Nguyen Laurent Son, Gatica-Pérez Daniel, Odobez Jean Marc, A semi-automated system for accurate gaze coding in natural dyadic interactions, in ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction, Sydney.
Efficient Algorithm for Level Set Method Preserving Distance Function
Estellers V, Zosso D, Lai RJ, Osher S, Thiran JP, Bresson X, Efficient Algorithm for Level Set Method Preserving Distance Function, in IEEE TRANSACTIONS ON IMAGE PROCESSING, 21(12), 4722-4734.
Enhanced Compressed Sensing Recovery with Level Set Normals
Estellers Casas Virginia, Thiran Jean-Philippe, Bresson Xavier, Enhanced Compressed Sensing Recovery with Level Set Normals, in IEEE Trans. on Image Processing, 22(7), 2611-2626.
EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras
Funes Mora Kenneth Alberto, Monay Florent, Odobez Jean-Marc, EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras, in Proceedings of the ACM Symposium on Eye Tracking Research and Applications, ACM.
Gaze estimation from multimodal Kinect data
Funes Keneth, Odobez Jean-Marc, Gaze estimation from multimodal Kinect data, in CVPR Workshop on Gesture Recognition.
Harmonic Active Contours
Estellers Casas V., Zosso D., Bresson X., Thiran J.-Ph., Harmonic Active Contours, in IEEE Transactions on Image Processing, 23(1), 69-82.
HARMONIC ACTIVE CONTOURS FOR MULTICHANNEL IMAGE SEGMENTATION
Estellers V, Zosso D, Bresson X, Thiran JP, HARMONIC ACTIVE CONTOURS FOR MULTICHANNEL IMAGE SEGMENTATION, in IEEE International Conference on Image Processing, Brussels, Belgium, 2011.
Harmonic Active Contours for multichannel image segmentation
Estellers Casas Virginia, Zosso Dominique, Bresson Xavier, Thiran Jean-Philippe, Harmonic Active Contours for multichannel image segmentation, in IEEE International Conference on Image Processing, Brussels, Belgium.
Multipose Audio-Visual Speech Recognition
Estellers Virginia, Thiran Jean-Philippe, Multipose Audio-Visual Speech Recognition, in European Signal Processing Conference, Barcelona, Augst 2011.
Multi-pose lipreading and audio-visual speech recognition
Estellers Casas Virginia, Thiran Jean-Philippe, Multi-pose lipreading and audio-visual speech recognition, in EURASIP Journal on Advances in Signal Processing, 51, 1-39.
On dynamic stream weighting for Audio-Visual Speech Recognition
Estellers Casas Virginia, Gurban Mihai, Thiran Jean-Philippe, On dynamic stream weighting for Audio-Visual Speech Recognition, in Transactions on Audio, Speech, and Language Processing, 20(4), 1145-1157.
Overcoming Asynchrony in Audio-Visual Speech Recognition
Estellers Casas Virginia, Thiran Jean-Philippe, Overcoming Asynchrony in Audio-Visual Speech Recognition, in IEEE International Workshop on Multimedia Signal Processing, Saint Malo, France.
Person independent 3D gaze estimation from remote RGB-D cameras
Mora Kenneth Alberto Funes, Odobez Jean-Marc, Person independent 3D gaze estimation from remote RGB-D cameras, in International Conference on Image Processing, Melbourne, Australia.

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
ACM Symposium on Eye Tracking Research and Applications (ETRA) Poster EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras 26.03.2014 Safety Harbor, United States of America Funes Mora Kenneth;
2013 International Conference on Multimodal Interaction, ICMI '13, Sydney, NSW, Australia, December 9-13, 2013 Talk given at a conference 3D head pose and gaze tracking and their application to diverse multimodal tasks 09.12.2013 Sydney, Australia Funes Mora Kenneth;
2013 International Conference on Multimodal Interaction, ICMI '13, Sydney, NSW, Australia, December 9-13, 2013 Poster A semi-automated system for accurate gaze coding in natural dyadic interactions 09.12.2013 Sydney, Australia Funes Mora Kenneth;
International Conference on Image Processing Talk given at a conference Person independent 3D gaze estimation from remote RGB-D cameras 15.09.2013 Melbourne, Australia Funes Mora Kenneth;
CVPR Workshop on Kinect and Gesture Recognition Talk given at a conference Gaze estimation from multimodal Kinect data 16.06.2012 Providence, United States of America Funes Mora Kenneth;
IEEE International Conference on Image Processing Talk given at a conference Harmonic Active Contours for multichannel image segmentation 11.09.2011 Brussels, Belgium Thiran Jean-Philippe; Estellers Casas Virginia;
European Signal Processing Conference Talk given at a conference Multipose Audio-Visual Speech Recognition 29.08.2011 Barcelona, Spain Thiran Jean-Philippe; Estellers Casas Virginia;
European Signal Processing Conference Talk given at a conference Class-specific classifiers in audio-visual speech recognition 23.08.2010 Aalborg, Denmark, Denmark Estellers Casas Virginia; Thiran Jean-Philippe;


Awards

Title Year
Best student paper award for the work "Gaze estimation from multimodal Kinect data", Kenneth Funes, Jean-Marc Odobez, in Proc. of CVPR Workshop on Gesture Recognition, Rhode Island, US, June 2012. 2012

Associated projects

Number Title Start Funding scheme
153085 G3E: Geometric Generative Gaze Estimation model 01.04.2014 Project funding (Div. I-III)
136811 Multimodal Computational Modeling of Nonverbal Social Behavior in Face to Face Interaction 01.12.2011 Ambizione

Abstract

Human communication is a combination of speech and non-verbal behavior. A significant part of the non-verbal information is contained in face movements and expressions. Therefore, a major step in the automatic analysis of human communication is the location and tracking of human faces. In this project, we will first tackle the problem of robust face tracking, that is, the continuous estimation of the head pose and of the facial animations in video sequences. Based on this first development, two subsequent workpackages will address important building blocks towards the automatic analysis of natural scenes, namely automatic audio-visual speech recognition and Visual Focus of Attention (VFOA) analysis. Both of them strongly rely on robust face tracking and therefore will directly exploit and benefit from the results of the first workpackage.\\Our research in face tracking will rely on 3D deformable models learned from training data, which have shown their efficiency at modeling individual face shapes and expressions and at handling self-occlusions. We will address recurrent issues in the domain (strong illumination variations, tracking near profile views, automatic initialization and reinitialization) by investigating three main points: memory-based appearance learning, which aims at building face-state dependent mixture appearance models from past tracking observations; a multi-feature face representation, by combining stable semantic structural points located around facial attributes (eyes, mouth), opportunistic sparse texture and interest points distributed throughout the face and in particular on regions with less predictable appearance (head sides), and dynamic features (head profiles); and an hybrid fitting scheme combining discriminant approaches for fast feature localization, matching for distant 3D (rigid) registration, and iterative approaches for precise model estimation. Human speech perception is bimodal in nature, as we unconsciously combine both audio and visual information to decide what has been spoken. Therefore, in the second workpackage we consider the audio and visual dimensions of the problem and develop techniques exploiting both modalities and their interaction. Building on our previous works, in this project we focus on visual feature extraction and audio-visual integration in realistic situations. The work is divided in three main tasks: exploitation of single-view video sequences, of multiple-view sequences and application to a real-world task. We will first address the problem of non-ideal lighting conditions, image sequences where people suddenly move, turn their heads or occlude their mouths. Our work will address the extraction of optimal visual features, the estimation of their reliability and their dynamic combination with the audio stream for speech recognition. The second task involves multi-view sequences to extract more robust and reliable visual features. Finally, the developed techniques will be applied to audio-visual speech recognition in cars in different real situations.Considering the third workpackage, gaze is recognized as one of the most important aspects of non-verbal communication and social interaction, with functions such as establishing relationships through mutual gaze, regulating the course of interaction, expressing intimacy or social control. Exploiting again the results of the first workpackage, we will develop probabilistic models mapping visual information like head pose or orientation into gazing directions. Two main research threads will be explored. The first one will rely on computer vision techniques to obtain gaze measurements from the eye region in addition to the head pose measurements and inferring their contribution to the estimation of the gazing direction. The second thread will investigate the development of gaze models involving the coordination between head and gaze orientations by exploiting the empirical findings made in behavioral studies of alert monkeys and humans describing the contribution of head and eyes movements to gaze shifts. Building on our previous work, the gaze system will be exploited to identify different gazing gestures and human attitudes in dynamic human-human communication settings, like establishing a relation through eye-contact or averting gaze.As we can see, in this project we will address three fundamental technical components towards automatic human-to-human communication analysis. The project will be an important technical contribution to both the emerging field of social signal processing, which aims at the development of computational models for machine understanding of communicative and social behavior, and human computing, which seeks to design human-centered interfaces capable of seamless interaction with people.
-