Back to overview

The dynamics of indexical information in speech and its role in speech communication and speaker recognition

English title The dynamics of indexical information in speech and its role in speech communication and speaker recognition
Applicant Dellwo Volker
Number 185399
Funding scheme Project funding (Div. I-III)
Research institution Phonetisches Laboratorium Institut für Computerlinguistik Universität Zürich
Institution of higher education University of Zurich - ZH
Main discipline Other languages and literature
Start/End 01.12.2019 - 30.11.2023
Approved amount 896'518.00
Show all

Keywords (6)

speaking styles; speech processing in computers; human voice processing; automatic speaker recognition ; speech processing in humans; indexical information in speech

Lay Summary (German)

Stimmen sind individuell und Menschen nutzen diese Individualität um einzelne Personen zu erkennen. Dies ist essenziell in der sozialen Interaktion von Menschen, um Individuen in einer Dialogsituation zu erkennen und zuzuordnen. Die Stimmerkennung findet zusätzlich Anwendung in der Praxis, z.B. bei Zugangssystemen, die Benutzer an ihrer Stimme erkennen oder in der Forensik zur Erkennung von Stimmen, die kriminell auffällig geworden sind.
Lay summary
In der Vergangenheit wurde die Stimmerkennung fast ausschliesslich auf der Seite des Rezipienten studiert, d.h. unter der Fragestellung, wie gut ist die Erkennungsleistung von menschlichen Hörern oder Maschinen unter bestimmten Umständen. Im vorliegenden Projekt wird untersucht, ob, und wenn ja, wie Sprecher ihre Stimme modifizieren, um sie besser oder schlechter erkennbar zu machen. In einem ersten Schritt bauen wir eine grosse Datenbank von ca. 500 Sprechern aus dem Grossraum Zürich auf in denen Sprecher unter verschiedenen situativen Umständen sprechen (z.B. sprechen mit Kindern, mit älteren Menschen oder in einer formalen Situation). Diese Datenbank gilt als Referenz. Im Anschluss an die Datenerhebung werden wir in unterschiedlichen Experimenten testen, wie Sprecher ihre Stimme verändern, wenn sie sich besser erkennbar machen wollen und wenn sie ihre stimmliche Identität verbergen wollen. Wir nehmen an, dass Sprecher sich an der Referenz orientieren, wenn sie ihre Stimmen anpassen, d.h. dass sie ihre stimmlichen Merkmale eher dem stimmlichen Durchschnitt anpassen, um nicht erkannt zu werden, bzw. vom Durchschnitt abweichen, um erkennbar zu werden. 
Die Ergebnisse sind sowohl von Bedeutung zum Verständnis des gesamten Prozesses der Stimmerkennung beim Menschen, aber auch für die Verbesserung der Erkennungsleistung von Computersystemen. 
Direct link to Lay Summary Last update: 01.12.2019

Responsible applicant and co-applicants


Project partner

Associated projects

Number Title Start Funding scheme
159350 Acoustic Characteristics of Voice in Music and Straight Theatre, and Related Aspects of Production and Perception 01.09.2015 Project funding (Div. I-III)
183152 "Voice Theft": Chances and risks of digital voice technology 01.12.2018 Digital Lives
135287 Speaker Identification Based on Speech Temporal Information: A Forensic Phonetic Study of Speech Rhythm and Timing in Swiss Standard German 01.09.2011 Project funding (Div. I-III)
165544 Voice Identification in Infants 01.12.2015 International short research visits


Humans have elaborate skills in recognizing speakers by their voice, a phenomenon that is deeply rooted in the evolution of human behavior. To date, the processes underlying voice recognition, how it is acquired, what role it plays in human communication and why evolution has equipped humans with voice identification skills are only poorly understood. Here, we argue that such knowledge is essential in understanding human speech communication and in improving applied areas where knowledge about human individuality cues is crucial, such as automatic recognition or forensic speaker comparison. Vocal cues to individuality (indexical information) have typically been viewed as static attributes of a speaker which are some by-product of the human articulation process. However, by now there is strong evidence that this view is incorrect. The dynamics of indexical information has been repeatedly pointed out in diverse fields of speech technology, forensic sciences and linguistics. In our own previous work we found that situational voice alterations (speaking styles) have an asymmetrical effect on speaker recognition. For example, learning a speaker under infant-directed speech (IDS) has advantage for recognition in adult directed conversational speech but not vice versa. It thus seems plausible that a speaking style like IDS evolved a specific combination of indexical cues as a technique for mothers to make themselves better recognizable to their offspring. To date it is unknown, however, how such recognition advantages may be accomplished. The major theoretical aim of the present project is to understand the dynamics of indexical information by examining speech from human interaction. For this we will investigate how indexical information varies (a) within utterances and (b) speaking styles and (c) what effects this variability has on speaker recognition. The results will show to what degree speakers have control over indexical information to either reveal or suppress their individuality in speech communication situations. To reach these aims we will sample a large homogenous population of human voices (~500 speakers) in which individuals will be recorded varying their speaking styles in situations where individuality plays a role (e.g. charismatic, deceptive, clear, or computer directed speech). Subsequently, this data will be analysed with computer modelling to understand the variability of indexical information between speaking styles in individuals in respect of the population mean. Using machine and human speaker recognition experiments we will test the effects that indexical variability has on the recognizability of speakers’ voices under certain speaking styles. Results will be crucial to the understanding of the role of indexical information for human-human and human-machine speech communication. They will affect the way we think about linguistic and speaker specific information processing alike and have an impact on the areas of psychology, behavioural biology and/or neurosciences that aim at understanding how humans perceive and process speech as well as speech technology and forensic phonetic applications where different samples most typically involve differences in speaking style. Finally, knowledge about the role of individuality information in speech will be important for understanding evolutionary processes of speech communication, a research area of increasing strength at UZH and across Switzerland. The project will also be a continuation of SNF Digital Lives seed money with which we study the impact of voice manipulation on human and computer speaker recognition (grant: 183152).