Project

Back to overview

Motion Models for Monocular People Tracking

English title Motion Models for Monocular People Tracking
Applicant Fua Pascal
Number 144318
Funding scheme Project funding (Div. I-III)
Research institution Laboratoire de vision par ordinateur EPFL - IC - ISIM - CVLAB
Institution of higher education EPF Lausanne - EPFL
Main discipline Information Technology
Start/End 01.09.2013 - 31.08.2015
Approved amount 115'020.00
Show all

Keywords (3)

Motion Tracking; Video; Computer Vision

Lay Summary (French)

Lead
La modélisation du corps humain et de ses mouvements reste difficile pour plusieurs raisons: Les humains ont une géométrie complexe. Leurs vêtements, en se déformant, rendent l’analyse du mouvement plus difficile et les différentes parties du corps se cachent souvent les unes les autres. C’est la problématique sur laquelle nous entendons travailler. Les applications potentielles en sont la formation sportive, la surveillance, le divertissement, et l'édition électronique.
Lay summary

Dans des travaux antérieurs, nous avons montré que, dans un contexte multi caméra, le suivi de personne pouvait être formulé comme un problème de Programmation Linéaire.  Les trajectoires sont calculées comme l'optimum global d'une fonction d'objectif convexe bien définie, ce qui rend le processus à la fois robuste et rapide. Dans ce projet, nous entendons démontrer que cette approche reste applicable lorsque l’on n’utilise qu’une seule caméra.

A cette fin, nous proposons une approche en deux étapes. Tout d'abord, nous détecterons les individus et leur pose 3D dans chaque image individuellement tout en prenant en compte les occlusions produites par les autres personnes présentes dans la scène. Ensuite, nous sélectionnerons parmi toutes ces détections celles qui résultent dans un mouvement cohérent et répondant à un modèle de mouvement approprié.

En substance, dans nos travaux antérieurs, les ambiguïtés ont été résolues par l'utilisation de plusieurs caméras. Ici, nous avons l'intention de montrer qu'elles peuvent être résolues en prenant en compte la cohérence temporelle, ce qui rendra l’approche plus générique et plus facile à mettre en œuvre. 

Direct link to Lay Summary Last update: 15.08.2013

Lay Summary (English)

Lead
Today, there is great interest in capturing complex motions solely by analyzing video sequences, both because many potential applications. These include athletic training, surveillance, and entertainment. However, existing techniques remain fairly brittle for many reasons: Humans have a complex articulated geometry overlaid with deformable tissues, skin and loosely attached clothing. Their motion is often complex and self-occluding. These are the problems we address in this work.
Lay summary
Most early approaches to monocular people tracking, including some we developed earlier under SNSF funding, involved recursive frame-to-frame tracking and were found to be brittle, due, among other things, to distractions and occlusions from other people or scene objects. Since then the focus has shifted to ``tracking by detection,'' which involves detecting people more or less independently in every frame followed by linking detection across frames, which is much more robust to algorithmic failures in isolated frames.

In earlier work, we showed that, in a multi-camera context, this process could be formulated as an Integer Programming (IP) problem whose solution is a set of trajectories. These trajectories are therefore found as the global optimum of a well defined convex objective function, which makes the process both robust and fast. Since single camera solutions are easier to deploy and therefore more attractive, we have been exploring for the past year how such an approach can be made to work monocularly.

The framework we propose includes two main steps. First, we detect people and their poses in individual frames while taking occlusions into account. Occlusion are handled by an iterative scheme that selects the most likely people detections given what is known about potential occlusions produced by other people, recomputes the likely occluded areas, and iterates. Second, we select among all these detections those that are most consistent across time, according to an appropriate motion model.

During the first phase of this project, we have focused on the first step. We have also run some preliminary experiments involving the second step and will now turn them into a solid well-formalized algorithm during the proposed project continuation. This will involve

 - casting the linking of detected poses at individual time steps into an IP framework,

 - developing algorithms for computing the transition probabilities between poses so as to reliably obtain the desired result when solving the IP problem,

 - developing feedback mechanism from the second step to the first to correct potential mistakes.

In essence, in our earlier work, ambiguities were resolved by using multiple cameras. Here, we intend to show that they can be handled just as well by enforcing continuity on the detected poses. We will endeavor to show that it yields improved performance and results in a truly automated system that can handle generic behaviors in real world environments using videos acquired using a single camera and in potentially adverse conditions.

Direct link to Lay Summary Last update: 15.08.2013

Responsible applicant and co-applicants

Employees

Name Institute

Publications

Publication
Probability Occupancy Maps for Occluded Depth Images
Bagautdinov Timur, Fua Pascal, Fleuret François (2015), Probability Occupancy Maps for Occluded Depth Images, in Conference on Vision on Pattern Recognition, Boston, MA.

Collaboration

Group / person Country
Types of collaboration
IDIAP Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication

Associated projects

Number Title Start Funding scheme
129495 Motion Models for Monocular People Tracking 01.10.2010 Project funding (Div. I-III)
159248 Motion Models for Monocular People Tracking 01.09.2015 Project funding (Div. I-III)
147693 Tracking in the Wild 01.01.2014 Sinergia

Abstract

Modeling the human body and its movements is one of the most difficult and challenging problems in Computer Vision. Today, there is great interest incapturing complex motions solely by analyzing video sequences. Potential applications include athletic training, surveillance, entertainment, and electronic publishing.Most early approaches to monocular people tracking, including some we developed earlier under SNSF funding, involved recursive frame-to-frame tracking and were found to be brittle, due, among other things, to distractions and occlusions from other people or scene objects. Since then the focus has shifted to ``tracking by detection,'' which involves detecting people more or less independently in every frame followed by linking detection across frames, which is much more robust to algorithmic failures in isolated frames.In earlier work, we showed that, in a multi-camera context, this process could be formulated as an Integer Programming (IP) problem whose solution is a set of trajectories. These trajectories are therefore found as the global optimum of a well defined convex objective function, which makes the process both robust and fast. Since single camera solutions are easier to deploy and therefore more attractive, we have been exploring for the past year how such an approach can be made to work monocularly.The framework we propose includes two main steps. First, we detect people and their poses in individual frames while taking occlusions into account. Occlusion are handled by an iterative scheme that selects the most likely people detections given what is known about potential occlusions produced by other people, recomputes the likely occluded areas, and iterates. Second, we select among all these detections those that are most consistent across time, according to an appropriate motion model.During the first phase of this project, we have focused on the first step. We have also run some preliminary experiments involving the second step and will now turn them into a solid well-formalized algorithm during the proposed project continuation. This will involve - casting the linking of detected poses at individual time steps into an IP framework, - developing algorithms for computing the transition probabilities between poses so as to reliably obtain the desired result when solving the IP problem, - developing feedback mechanism from the second step to the first to correct potential mistakes.In essence, in our earlier work, ambiguities were resolved by using multiple cameras. Here, we intend to show that they can be handled just as well by enforcing continuity on the detected poses. We will endeavor to show that it yields improved performance and results in a truly automated system that can handle generic behaviors in real world environments using videos acquired using a single camera and in potentially adverse conditions.
-