Back to overview

Second-Order Layers in Deep Networks for Visual Recognition

Applicant Salzmann Mathieu
Number 165648
Funding scheme Project funding (Div. I-III)
Research institution Laboratoire de vision par ordinateur EPFL - IC - ISIM - CVLAB
Institution of higher education EPF Lausanne - EPFL
Main discipline Information Technology
Start/End 01.09.2017 - 30.04.2021
Approved amount 185'918.00
Show all

Keywords (6)

Deep learning; Visual recognition; Riemannian geometry; Convolutional neural networks; Computer vision; Region covariance descriptors

Lay Summary (French)

Dans le domaine de la vision par ordinateur, d'énormes progrès ont été récemment réalisés grâce aux réseaux de neurones artificiels. Ces réseaux consistent en une série de couches successives, effectuant typiquement des opérations linéaires, telles que des convolutions, alternées avec des opérations de groupement et des opérations non-linéaires. Les paramètres de ces couches sont appris automatiquement dans une phase d’apprentissage. Dans le cadre de la reconnaissance d'image, ces réseaux extraient une représentation vectorielle de l'image d'entrée, et cette représentation est passée à un classifieur qui retourne la classe correspondant à l'objet dans l'image. Avant la récente popularité des réseaux de neurones artificiels, il a été montré que des représentations matricielles, correspondant à des statistiques de deuxième ordre des pixels, telles que des matrices de covariance, pouvaient améliorer les taux de reconnaissance par rapport à des représentations vectorielles.
Lay summary

Le but de ce projet consiste à développer des algorithmes permettant de calculer des matrices de covariance à l'intérieur des réseaux de neurones artificiels, afin d'améliorer leurs taux de reconnaissance d'image. Ces matrices de covariance feront partie intégrante du réseau, dans le sense où elles seront optimisées durant l'apprentissage. En particulier, nous étudierons plusieurs configurations, telles que des covariances non-paramétriques, paramétriques, globales pour l’image entière et locales. Ce projet résultera donc en un nouveau type d’architectures de réseaux de neurones artificiels pour la reconnaissance d’image.


Direct link to Lay Summary Last update: 24.01.2017

Responsible applicant and co-applicants


Name Institute

Associated projects

Number Title Start Funding scheme
175581 Deep Structured Representation Learning for Visual Recognition 01.09.2018 Project funding (Div. I-III)


Automatically recognizing objects and people in images could have impact in a wide variety of domains, such as security and autonomous navigation. Unsurprisingly, visual recognition has therefore been one of the fundamental goals of Computer Vision since the beginning. Over the years, great progress has been made, which can be roughly equally attributed to the advances in Machine Learning and the development of new image features. With the advent of Big Data and the renewed popularity of Deep Learning, this progress has recently accelerated significantly.In particular, Convolutional Neural Networks (CNNs), that jointly learn features and classifiers, have played a crucial role in this acceleration. CNNs are multi-layer architectures in which the output of the previous layer is convolved with a series of filters learned from data. In essence, this convolution process computes a weighted sum of the values output by the previous layer, either locally for convolutional layers, or globally for fully-connected ones. By performing such a linear operation, each filter can be thought of as computing first-order statistics from the previous layer output.Psychophysics experiments have shown, however, that, to discriminate different textures, humans strongly rely on second-order statistics, such as covariance, which depend quadratically on the input measurements. This was exploited in Computer Vision by the Region Covariance Descriptors (RCDs), which represent an image region by a covariance matrix computed from local image features. In previous work, including our own, these RCDs were shown to outperform standard, first-order, features in many visual recognition tasks. However, these second-order descriptors have always been extracted from handcrafted features, and are thus currently unable to deliver the performance achieved by deep networks.Our goal is therefore to introduce second-order statistics in Deep Learning architectures. In particular, we intend to focus on CNNs, which provide, in their inner layers, local representations from which covariance matrices can be computed. To this end, we will investigate two different scenarios:- One covariance descriptor as final image representation. The nature of the lower layers of the CNN will remain unchanged in that they will still compute first-order statistics, but the final layer will encode one covariance matrix, which will be fed to a classifier.- Local covariance descriptors in the inner network layers. Second-order statistics will be computed in the lower layers of the network. These second-order layers will then act as input either to first-order layers, or to other second-order ones.To interface our second-order layers with either the classifier, or first-order layers, we will account for the fact that covariance matrices, which are symmetric positive definite (SPD), form a Riemmannian manifold, and can thus not be studied under Euclidean geometry. This has proven crucial in previous work on RCDs from handcrafted features, including our own. We will therefore leverage our expertise with the SPD manifold to develop a generic and effective framework to incorporating second-order statistics in deep networks. In this framework, the parameters of all layers and the final classifier will be learned jointly, which will combine the benefits of deep architectures and of second-order statistics, thus yielding even more powerful models.We are confident that our framework will further improve the performance of visual recognition to the point of making it practical for real-world applications. Furthermore, we expect that our approach will have repercussions beyond the Computer Vision community, and motivate others to go beyond standard, first-order Deep Learning architectures.