Project

Back to overview

Multi-modal speech sensing based on 2D and 3D optical and acoustic signals for identity recognition and authentication

English title Multi-modal speech sensing based on 2D and 3D optical and acoustic signals for identity recognition and authentication
Applicant Rey Julien
Number 190424
Funding scheme Spark
Research institution Applied Complex Systems Institute of Applied Mathematics and Physics University of Zurich
Institution of higher education Zurich University of Applied Sciences - ZHAW
Main discipline Other disciplines of Physics
Start/End 01.02.2020 - 31.03.2021
Approved amount 110'129.00
Show all

All Disciplines (2)

Discipline
Other disciplines of Physics
Microelectronics. Optoelectronics

Keywords (4)

Authentication; Identity-recognition; Speech-sensing; 3D sensing

Lay Summary (French)

Lead
La perception humaine de la parole se base à la fois sur des aspects auditifs et visuels alors que la majorité des méthodes de reconnaissance automatique du langage parlé se concentre sur l'un ou l'autre de ces aspects. Lorsqu'une personne connue s'adresse à nous, nous sommes en général capables de reconnaitre à la fois le contenu du message et l'identité du locuteur. Les analyses automatiques de la parole par voie informatique s'intéressent non seulement à identifier le contenu d'un message mais aussi à identifier l'auteur du message.
Lay summary

Ce projet vise à exploiter à la fois les informations sonores et visuelles provenant des mouvements du visage afin d'obtenir une meilleure identification des locuteurs. Dans ce but, un système utilisant des caméras 2D et 3D sera développé afin d'enregistrer les mouvements du visage du locuteur. Les données audio et optiques seront combinées afin d'améliorer l'identification de locuteurs. Ces données ouvriront en outre la voie à une meilleure compréhension des relations entre les mouvements faciaux et le signal acoustique. Ceci pourra avoir des implications dans diverses technologies utilisant le langage parlé dans les interfaces homme-machine telles que l'identification et l'authentification de personnes, la reconnaissance électronique du langage ou la modélisation de locuteur virtuel.

Direct link to Lay Summary Last update: 17.01.2020

Responsible applicant and co-applicants

Employees

Abstract

Human speech perception is a process that takes into account both acoustic and visual speech information but yet most automatic recognition systems are typically based on one of the two modes. In this project we will combine the strong between-speaker differences of acoustic dynamics and facial movement dynamics and to build more reliable person identification systems. Facial movements are usually extracted from 2D frontal facial images. Visual 3D facial features obtained with 3D cameras improve the accuracy of facial feature extraction, especially for non-frontal facial images.One goal of this project is to develop a sensing platform to collect both the 2D and 3D optical speech characteristics and the acoustic signals. Dynamic information including 3D facial movements and speech rhythm will be used to improve speaker recognition and authentication. Compared to previous identity recognition methods that usually use either the 2D face or the voice of an individual, the here proposed scheme will be more robust since it is based on simultaneous visual (2D and 3D) and acoustic dynamic speech sensing. Another goal is to investigate the relationship between the acoustic dynamics of speech and 3D facial dynamics and thus pave the way to predicting voices from faces and faces from voices. The obtained deeper understanding of the acoustic and face dynamics of speech will have an impact on various speech technologies, including: automatic speech recognition, lip modelling for speaking face synthesis, real time human computer interface applications and speech-based password authentication.
-