Action recognition is key for many tasks such as automatic annotation of videos, improved human-computer interaction and guidance in monitoring public spaces. As the amount of available videos from different sources (from raw personal videos to more professional content) has dramatically increased in the last few years, new methodologies are needed to organize these datasets.
Most recent state-of-the-art recent techniques for action recognition in naturalistic and unconstrained video documents rely on Bag-of-Word (BoW) representations built from Spatio-Temporal interest point (STIP) descriptors and collected over video segments. Such methods, however, often suffer from two severe and related drawbacks:
(i) the time information is discarded, although actions are often characterized by strong temporal components; alternatively, fixed temporal grid schemes are used, assuming that the video clip is already temporally segmented.
(ii) activities in the same video segments are mixed in the representation, and plagues recognition algorithms that are based on these.
To address these issues, we will investigate novel techniques relying on principled probablistic techniques (topic models) and symbolic pattern mining to capture information lying in the temporal relationships between recognized ``action'' units. To this end we will extend our previous work on the automatic extraction of temporal motifs from word x time documents, which not only captures the co-occurrence between words, but also the order in which they occur, and can handle interleaved activites. Investigated techniques will be focused around three main axes.
A. Motif representation, will investigate models with a hierarchical structure relying on recurring sequences of lower-level temporal motifs, and improving the robustness of the motif representation to handle the usually small amount of annotated data available in supervised action classification.
B. Action recognition in unconstrained video documents, which investigate the use of Motifs extracted from STIP BoW representations, leveraging on our modeling to identify meaningful and interleaved temporal patterns with longer temporal support than those of STIP, and addressing corresponding challenges (generative vs discriminative modeling, vocabulary size, complexity).
C. Joint temporal and spatial action learning and recognition will address the learning of action motifs while jointly infering where these motifs occur in the images in addition to when they occur as currently performed by our model, allowing to address weakly supervised action recognition tasks.
Evaluation on standard human action, movie, and sports databases from the litterature will be conducted to assess the performances of our algorithms.
|