Project

Back to overview

Non-randomness in Morphological Diversity: A Computational Approach Based on Multilingual Corpora

English title Non-randomness in Morphological Diversity: A Computational Approach Based on Multilingual Corpora
Applicant Samardzic Tanja
Number 176305
Funding scheme Project funding (Div. I-III)
Research institution UFSP Sprache und Raum Universität Zürich
Institution of higher education University of Zurich - ZH
Main discipline Other languages and literature
Start/End 01.09.2018 - 31.08.2022
Approved amount 621'653.00
Show all

All Disciplines (2)

Discipline
Other languages and literature
Information Technology

Keywords (14)

word frequency distribution; data science; Zip'f law; derivation; natural language processing; inflection; corpus-based typology; vocabulary diversity; morphology; multilingual corpora; language adaptation; linguistic diversity; morphological segmentation; Shannon entropy

Lay Summary (German)

Lead
Untersuchung der räumlichen Verteilung morphologischer Merkmale mit stark multilingualer quantitativer Textanalyse
Lay summary

Die statistische Modellierung von Worthäufigkeitsverteilungen in Korpora über Sprachen hinweg ist ein systematisches und wiederholbares Verfahren zur Ableitung der Art und der sprachübergreifenden Variation der zugrundeliegenden linguistischen Strukturen. Korpus-Masse, wie z.B. die Shannon-Entropie, können bei der Untersuchung linguistischer Diversität als allgemeine Indikatoren für die Grösse des morphologischen Inventars einer Sprache verwendet werden. Das Grundprinzip dieser Methode ist folgendes: mehr morphologische Kategorien (z.B. Geschlechtskennzeichnung, Fall, Zeitform, Aspekt) → mehr potentielle Worttypen → geringere Wahrscheinlichkeit individueller Worttypen → höhere Shannon-Entropie. Erste Erfolge im korpusbasierten Sprachvergleich zeigen die Relevanz von Worthäufigkeitsverteilungen für die Untersuchung sprachlicher Diversität. Es konnte beispielsweise aufgezeigt werden, dass zwischen solchen Massen und Sprachkontakt ein Zusammenhang besteht. Allerdings gibt es zwei bedeutende Hindernisse, die eine vollständige Integration der Erkenntnisse aus der quantitativen Textanalyse in die linguistische Beschreibung von Sprachvariation und –veränderung beeinträchtigen. Erstens können die bisher verwendeten Korpus-Masse die morphologische Vielfalt nur auf einer aggregierten Ebene beschreiben, ohne zwischen Wortschatzdiversität, Flexion und Derivation zu unterscheiden. Solche Vergleiche lassen sich nur schwer direkt mit der traditionellen Analyse in Verbindung bringen. Zweitens sind korpusbasierte Befunde abhängig von der jeweiligen Auswahl an Textbeispielen aus den Korpora, was generalisierende Annahmen in Frage stellt. Das hier vorgeschlagene Projekt hat eine fundierte Textanalyse für den textbasierten Sprachvergleich zum Ziel. Dafür werden wir Methoden und Werkzeuge anwenden (Lemmatisierung, Segmentierung und morphologischen Analyse), welche sich im Bereich der anwendungsorientierten natürlichen Sprachverarbeitung rasant entwickeln. 

Direct link to Lay Summary Last update: 03.07.2018

Responsible applicant and co-applicants

Employees

Abstract

Linguistic diversity has inspired many linguistic theories since the development of the discipline in the nineteenth century till the present day. The factors involved in language diversification are complex, and often defy scientific explanation. As a consequence, linguistic research has shifted its focus towards - and away from - explaining linguistic diversity several times in its history. The recent development of large digital data sets and computational methods creates an opportunity for a new take on this fascinating topic: an interdisciplinary approach combining linguistic expert knowledge with general data science.In this context, an old view of language as an adaptive system has reemerged as an extension of linguistic typology. This framework specifically targets to explain non-randomness of linguistic diversity relying on statistical modelling to test dependence between elements of linguistic environment (e.g. demography, location) and elements of linguistic structure (e.g. sound inventory, morphological complexity). In these studies, information about relevant linguistic structures is commonly extracted from linguistic databases.The project proposed here intends to extend the study of language adaptation to linguistic data extractedfrom multilingual corpora. Statistical modelling of word frequency distributions in corpora across languagesis a systematic and reproducible means of inferring the nature and cross-linguistic variation of the underlying linguistic structures. Corpus measures, such as the Shannon entropy of word forms, for nstance, have been employed in the study of linguistic diversity as general indicators of the size of the morphological inventory of a language. The rationale behind this method is the following causal chain: more morphological categories (e.g. gender marking, case, tense, aspect) ? more potential word types ? lower probability of individual word types ? higher Shannon entropy.Such approaches to corpus-based language comparison demonstrate the relevance of word frequency distributions to the study of linguistic diversity. However, there are two important obstacles to a full integration of the evidence from quantitative text analysis into linguistic accounts of language variation and change.First, corpus measures used so far can describe morphological diversity only at a fully aggregated level without discriminating between traditional morphological categories such as vocabulary diversity, inflection, and derivation. As a consequence, such comparisons are hard to relate directly to the traditional analysis incorporated in the existing linguistic descriptions. Second, corpus-based findings are dependent on the particular sample of texts contained in the corpora, which calls generalised claims into question.We propose to overcome the identified obstacles by applying the knowledge, methods, and tools that arerapidly developing in the domain of use-inspired natural language processing. In order to make a step to-wards a deeper text analysis, we will discriminate between the main lexical and morphological factors behind the observed word frequency distributions - vocabulary diversity, inflection, derivation. To this end, we will process language corpora automatically using state-of-the-art systems for lemmatisation, segmentation and morphological analysis. We will address the issue of corpus representativeness by breaking down corpus-based language representations into a set of features whose values are collected from texts of different size, genre and topic.Similarly, we will address language adaptation at a more fine-grained level by focusing on a set of geographical features extracted using geographical information systems and by testing the relation between these features and the structural linguistic features inferred from quantitative corpus analysis. As language adaptation and historical change are closely related topics, we aim to add a temporal dimension to our study by including in the analysis data from corpora of texts created in different historical periods.In addition to the theoretical findings, the project will provide a major methodological contribution facilitating future use of corpus-based computational methods in scientific approaches to linguistic diversity and change.
-