3D face tracking; multi-view; audio-visual speech recognition; gaze; multimodal signal processing; visual focus of attention; Face tracking; multimodal
Funes Mora Kenneth Alberto, 3D head pose and gaze tracking and their application to diverse multimodal tasks, in ICMI Doctoral consortium
Funes Mora Kenneth Alberto, Nguyen Laurent Son, Gatica-Pérez Daniel, Odobez Jean Marc, A semi-automated system for accurate gaze coding in natural dyadic interactions, in ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction
Estellers V, Zosso D, Lai RJ, Osher S, Thiran JP, Bresson X, Efficient Algorithm for Level Set Method Preserving Distance Function, in IEEE TRANSACTIONS ON IMAGE PROCESSING
, 21(12), 4722-4734.
Estellers Casas Virginia, Thiran Jean-Philippe, Bresson Xavier, Enhanced Compressed Sensing Recovery with Level Set Normals, in IEEE Trans. on Image Processing
, 22(7), 2611-2626.
Funes Mora Kenneth Alberto, Monay Florent, Odobez Jean-Marc, EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras, in Proceedings of the ACM Symposium on Eye Tracking Research and Applications
Funes Keneth, Odobez Jean-Marc, Gaze estimation from multimodal Kinect data, in CVPR Workshop on Gesture Recognition
Estellers Casas V., Zosso D., Bresson X., Thiran J.-Ph., Harmonic Active Contours, in IEEE Transactions on Image Processing
, 23(1), 69-82.
Estellers V, Zosso D, Bresson X, Thiran JP, HARMONIC ACTIVE CONTOURS FOR MULTICHANNEL IMAGE SEGMENTATION, in IEEE International Conference on Image Processing
, Brussels, Belgium, 2011.
Estellers Casas Virginia, Zosso Dominique, Bresson Xavier, Thiran Jean-Philippe, Harmonic Active Contours for multichannel image segmentation, in IEEE International Conference on Image Processing
, Brussels, Belgium.
Estellers Virginia, Thiran Jean-Philippe, Multipose Audio-Visual Speech Recognition, in European Signal Processing Conference
, Barcelona, Augst 2011.
Estellers Casas Virginia, Thiran Jean-Philippe, Multi-pose lipreading and audio-visual speech recognition, in EURASIP Journal on Advances in Signal Processing
, 51, 1-39.
Estellers Casas Virginia, Gurban Mihai, Thiran Jean-Philippe, On dynamic stream weighting for Audio-Visual Speech Recognition, in Transactions on Audio, Speech, and Language Processing
, 20(4), 1145-1157.
Estellers Casas Virginia, Thiran Jean-Philippe, Overcoming Asynchrony in Audio-Visual Speech Recognition, in IEEE International Workshop on Multimedia Signal Processing
, Saint Malo, France.
Mora Kenneth Alberto Funes, Odobez Jean-Marc, Person independent 3D gaze estimation from remote RGB-D cameras, in International Conference on Image Processing
, Melbourne, Australia.
Human communication is a combination of speech and non-verbal behavior. A significant part of the non-verbal information is contained in face movements and expressions. Therefore, a major step in the automatic analysis of human communication is the location and tracking of human faces. In this project, we will first tackle the problem of robust face tracking, that is, the continuous estimation of the head pose and of the facial animations in video sequences. Based on this first development, two subsequent workpackages will address important building blocks towards the automatic analysis of natural scenes, namely automatic audio-visual speech recognition and Visual Focus of Attention (VFOA) analysis. Both of them strongly rely on robust face tracking and therefore will directly exploit and benefit from the results of the first workpackage.\\Our research in face tracking will rely on 3D deformable models learned from training data, which have shown their efficiency at modeling individual face shapes and expressions and at handling self-occlusions. We will address recurrent issues in the domain (strong illumination variations, tracking near profile views, automatic initialization and reinitialization) by investigating three main points: memory-based appearance learning, which aims at building face-state dependent mixture appearance models from past tracking observations; a multi-feature face representation, by combining stable semantic structural points located around facial attributes (eyes, mouth), opportunistic sparse texture and interest points distributed throughout the face and in particular on regions with less predictable appearance (head sides), and dynamic features (head profiles); and an hybrid fitting scheme combining discriminant approaches for fast feature localization, matching for distant 3D (rigid) registration, and iterative approaches for precise model estimation. Human speech perception is bimodal in nature, as we unconsciously combine both audio and visual information to decide what has been spoken. Therefore, in the second workpackage we consider the audio and visual dimensions of the problem and develop techniques exploiting both modalities and their interaction. Building on our previous works, in this project we focus on visual feature extraction and audio-visual integration in realistic situations. The work is divided in three main tasks: exploitation of single-view video sequences, of multiple-view sequences and application to a real-world task. We will first address the problem of non-ideal lighting conditions, image sequences where people suddenly move, turn their heads or occlude their mouths. Our work will address the extraction of optimal visual features, the estimation of their reliability and their dynamic combination with the audio stream for speech recognition. The second task involves multi-view sequences to extract more robust and reliable visual features. Finally, the developed techniques will be applied to audio-visual speech recognition in cars in different real situations.Considering the third workpackage, gaze is recognized as one of the most important aspects of non-verbal communication and social interaction, with functions such as establishing relationships through mutual gaze, regulating the course of interaction, expressing intimacy or social control. Exploiting again the results of the first workpackage, we will develop probabilistic models mapping visual information like head pose or orientation into gazing directions. Two main research threads will be explored. The first one will rely on computer vision techniques to obtain gaze measurements from the eye region in addition to the head pose measurements and inferring their contribution to the estimation of the gazing direction. The second thread will investigate the development of gaze models involving the coordination between head and gaze orientations by exploiting the empirical findings made in behavioral studies of alert monkeys and humans describing the contribution of head and eyes movements to gaze shifts. Building on our previous work, the gaze system will be exploited to identify different gazing gestures and human attitudes in dynamic human-human communication settings, like establishing a relation through eye-contact or averting gaze.As we can see, in this project we will address three fundamental technical components towards automatic human-to-human communication analysis. The project will be an important technical contribution to both the emerging field of social signal processing, which aims at the development of computational models for machine understanding of communicative and social behavior, and human computing, which seeks to design human-centered interfaces capable of seamless interaction with people.