Project

Back to overview

Regional Linguistic Data Initiative

English title Regional Linguistic Data Initiative
Applicant Samardzic Tanja
Number 160501
Funding scheme SCOPES
Research institution Seminar für Allgemeine Sprachwissenschaft Universität Zürich
Institution of higher education University of Zurich - ZH
Main discipline Other languages and literature
Start/End 01.05.2015 - 30.06.2017
Approved amount 178'586.00
Show all

All Disciplines (2)

Discipline
Other languages and literature
Information Technology

Keywords (5)

automatic langage processing; Croatian, Serbian; language corpora; distance learning; empirical methods

Lay Summary (German)

Lead
Die durch das SCOPES-Projekt ermöglichte institutionelle Partnerschaft zwischen der Schweiz (Partnerschafts-Koordinator), Serbien (Partner A) und Kroatien (Partner B) zielt auf die Gründung einer regionalen Initiative zur Verbesserung der Ressourcen- und Forschungsmethoden auf dem Gebiet der empirischen Sprachforschung. Es soll dabei die enge Verbundenheit Kroatiens und Serbiens genutzt werden. Im Mittelpunkt der Zusammenarbeit stehen die Bereiche (1) Sprachdaten und damit verbundene Werkzeuge und Ressourcen einerseits, und (2) Forschungsdesign und Methoden in der korpus-basierten Sprachforschung andererseits; statistische Methoden kommen in beiden Bereichen zur Anwendung. Die miteinander verknüpften Massnahmen der geplanten Aktivitäten umfassen zum einen (1) die Erstellung eines Datenaufbewahrungsortes (sog. „data repository“), und zum anderen (2) pädagogische Aktivitäten.
Lay summary

Die vorgeschlagene Partnerschaft ist für verschiedene Bereiche der Sprachforschung relevant. Diese reichen von der theoretischen Linguistik bis zur angewandten Linguistik (Spracherwerb, Übersetzung, Lexikologie), die „Digital Humanities“ im Allgemeinen und die maschinelle Verarbeitung natürlicher Sprache im Besonderen.

Die Zusammenarbeit zwischen den Ländern Schweiz, Serbien und Kroation in diesem Projekt hat bedeutendes Potenzial mit Langzeitauswirkungen: Die zentrale Datenaufbewahrung wird es erlauben, dass die Projektpartner die weitere Entwicklung der Ressourcen aufrechterhalten und aufeinander abstimmen können. Werkzeuge und Anwendungen für Kroatisch und Serbisch werden entwickelt und bereitgestellt, wobei die Ähnlichkeit der beiden Sprachen ausgenutzt wird. Diese Entwicklung war bisher aufgrund der schwierigen politischen Beziehungen zwischen Serbien und Kroation erschwert. Darüber hinaus werden die pädagogischen Aktivitäten eine wichtige Lücke in der Schulung der Forscher schliessen, da sie ihnen nicht nur ermöglichen, neue Methoden und Ressourcen in ihren linguistischen Forschungen zu erproben und anzuwenden, sondern sie auch befähigen, andere Wissenschaftler und ihre Studierenden entsprechend auszubilden. Das Projekt wird dazu beitragen, dass sich im linguistischen Forschungskontext die Handhabung von Sprachdaten in dieser Region in Zukunft effektiver und ökonomischer gestaltet. Ausserdem wird erwartet, dass Möglichkeiten zum Networking zwischen Universitätsangehörigen und Industrie entstehen und dadurch auch gesellschaftliche Effekt in den Partnerländern bewirkt werden können.

Direct link to Lay Summary Last update: 21.04.2015

Responsible applicant and co-applicants

Employees

Publications

Publication
Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text
Ljubešić Nikola, Erjavec Tomaž, Fišer Darja (2017), Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text, in Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing.
Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages
Samardžić Tanja, Starović Mirjana, Agić Željko, Ljubešić Nikola (2017), Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages, in In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing.
A Framework for automatic acquisition of Croatian and Serbian verb aspect from corpora
Samardžić Tanja, Miličević Maja (2016), A Framework for automatic acquisition of Croatian and Serbian verb aspect from corpora, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene
Ljubešić Nikola, Erjavec Tomaž (2016), Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
Corpus-based diacritic restoration for South Slavic languages.
Ljubešić Nikola, Erjavec Tomaž, Fišer Darja (2016), Corpus-based diacritic restoration for South Slavic languages., in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France.
Easily accessible language technologies for Slovene, Croatian and Serbian
Ljubešić Nikola, Erjavec Tomaž, Fišer Darja, Samardžić Tanja, Miličević Maja, Klubička Filip, Petkovski Filip (2016), Easily accessible language technologies for Slovene, Croatian and Serbian, in Proceedings of the Conference on Language Technologies & Digital Humanities.
New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian
Ljubešić Nikola, Klubička Filip, Agić Željko, Jazbec Ivo-Pavao (2016), New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France.
Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets
Miličević Maja, Ljubešić Nikola (2016), Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets, in Slovenščina 2.0, 4(2), 156-188.
Using machine learning for language and structure annotation in an 18th century dictionary
Bago Petra, Ljubesic Nikola (2015), Using machine learning for language and structure annotation in an 18th century dictionary, in Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of, Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., Ljubljana/Brighton.
Comparing the nonstandard language of Slovene, Croatian and Serbian tweets
Fišer Darja, Erjavec Tomaž, Ljubešić Nikola, Miličević Maja (2015), Comparing the nonstandard language of Slovene, Croatian and Serbian tweets, in Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del), Filozofska fakulteta, Ljubljana.
Regional Linguistic Data Initiative
Samardzic Tanja, Ljubesic Nikola, Milicevic Maja (2015), Regional Linguistic Data Initiative, in Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing,.

Collaboration

Group / person Country
Types of collaboration
University of Ljubljana, JANES project Slovenia (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Jožef Stefan Institute, Ljubljana, CLARIN-SI Slovenia (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Research Infrastructure
Belgrade Center for Digital Humanities Serbien (Europe)
- in-depth/constructive exchanges on approaches, methods or results
Department of General Linguistics, Faculty of Philology, University of Belgrade, Maja Milicevic Serbien (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Research Infrastructure
- Exchange of personnel
Abu-MaTran project Ireland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Department of Information and Communication Sciences, University of Zagreb, Nikola Ljubesic Croatia (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Research Infrastructure
- Exchange of personnel
University of Copenhagen, LOWLANDS Denmark (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Bilingualism vs. monolingualism: a new perspective on limitations to L2 acquisition Talk given at a conference Can translation lead to L1 attrition? Anaphora resolution by English>Croatian translators 19.06.2017 Toulouse, France Milicevic Maja;
Workshop at the Zadar Linguistic Forum Talk given at a conference Using language technologies for South Slavic languages for linguistic researc 08.06.2017 Zadar, Croatia Ljubešic Nikola;
NoDaLiDa Workshop on Universal Dependencies (UDW2017) Poster Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages 22.05.2017 Gothenburg, Sweden Ljubešic Nikola; Samardzic Tanja;
Spatial Boundaries and Transitions in Language and Interaction Talk given at a conference Establishing borders between states vs. languages: Twitter data to the rescue 24.04.2017 Monte Verità, Switzerland Samardzic Tanja; Milicevic Maja; Ljubešic Nikola;
Workshop for the FRAMNAT project Talk given at a conference Language technologies for South Slavic languages 08.04.2017 Rijeka, Croatia Ljubešic Nikola;
Die Bibliothek vernetzt. Infrastrukturen für Forschungsdaten in den Geisteswissenschaften Talk given at a conference Infrastrukturen und Services für linguistische Projekte 09.02.2017 Zurich, Switzerland Samardzic Tanja;
Language Technologies & Digital Humanities 2016 Talk given at a conference Analysing spatial distribution of linguistic variables in geoencoded tweets from Croatia, Bosnia, Montenegro and Serbia 29.09.2016 Ljubljana, Slovenia Milicevic Maja; Samardzic Tanja; Ljubešic Nikola;
4th Novi Sad workshop on Psycholinguistic, Neurolinguistic and Clinical Linguistic Research Talk given at a conference ReLDI (Regional Linguistic Data Initiative): Project presentation (SNSF SCOPES Project 160501 16.04.2016 Novi Sad, Serbien Milicevic Maja;
Variation in space and time: clausal complementation in South Slavic Talk given at a conference ReLDI resources for Croatian and Serbian 17.03.2016 Zurich, Switzerland Samardzic Tanja;
Janes Ekspres Talk given at a conference Annotating non-standard elements in Serbian tweets - linguistic guidelines 10.12.2015 Belgrade, Serbien Milicevic Maja;
Janes Ekspres Talk given at a conference Annotating non-standard elements in Croatian tweets - linguistic guidelines 04.12.2015 Zagreb, Croatia Ljubešic Nikola;
Slovene on the Internet and in New Media Talk given at a conference Beyond example extraction: Quantitative analysis of the JANES corpus 25.11.2015 Ljubljana, Slovenia Milicevic Maja;
5th Workshop on Balto-Slavic Natural Language Processing Talk given at a conference Regional Linguistic Data Initiative (ReLDI) 10.09.2015 Hissar, Bulgaria Samardzic Tanja; Ljubešic Nikola;
eLex conference 2015 (Electronic lexicography in the 21st century: Linking lexical data in the digital age) Talk given at a conference Using machine learning for language and structure annotation in an 18th century dictionary 11.08.2015 Sussex, Great Britain and Northern Ireland Ljubešic Nikola;


Knowledge transfer events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
angesprochen. Der Linguistik-Podcast, ZüKL Performances, exhibitions (e.g. for education institutions) 29.10.2016 Zurich, Switzerland
Data Science Monetization Talk 13.04.2016 Zagreb, Croatia


Abstract

An important global trend in the study of human language in the past decades has been a growing reliance on empirical data. This is due to the rapid development and wider accessibility of large collections of machine-readable text and tools needed for its processing, as well as to a rise in standards of experimental research on language, often criticised in the past for lack of solid experimental design. With the advent of larger amounts of data, statistics has also become integral for linguists, and mastering the methodology of language research has turned into an increasingly complex task. At the same time, more attention has started being paid to the issues of replicability and reproducibility of language research and exciting steps have been made towards establishing common standards for sharing data and resources, as evidenced by the rising number of initiatives focused on creating dedicated repositories.Due to the difficulties brought by the transition period following the breakup of former Yugoslavia, Serbia and Croatia lag behind most other European countries in the implementation of these trends. The countries’ isolation during the 1990s, insufficient institutional support and the continuing lack of funds needed for joining some of the wider European initiatives have largely left language researchers to their own devices, leading to a situation in which a number of individuals and groups do conduct internationally visible research, but no impetus is given to the wider linguistic community. The developed resources and experimental instruments are often unavailable to external users, and the training of language specialists remains largely traditional, preventing them from using even those resources and instruments that are available to them. In addition, the split of the once shared Serbo-Croatian into two distinct official languages has led to a separate development of resources, with very little transfer and with many missed joint effort opportunities.With the above in mind, the main objective of the proposed institutional partnership between Switzerland, Croatia and Serbia is to establish a regional initiative that will improve resources and research methodology in the domain of language studies based on empirical data, focusing on Croatian and Serbian and fully exploiting their close relatedness. The two core areas of interest are (1) language corpora and related tools and resources, and (2) experimental design in language research and related instruments; statistical analysis will be covered in both areas. Two (intertwined) strands of planned activity are (1) the creation of a resource repository, and (2) pedagogical activities. The resources will be integrated within a web platform comprising portions dedicated to corpus-based and experimental research, as well as an e-learning environment for web courses and tutorials dedicated to resource use. The pedagogical activities will also include traditional live seminars, serving as both training and networking events.The proposed partnership is relevant for different sub-fields of language study, ranging from theoretical linguistics to the applied areas of language acquisition, translation and lexicology, as well as for information retrieval and digital humanities. It has significant potential for long-term impact, as the resource repository will allow the project partners to sustain and synchronise the development of language resources and instruments for Croatian and Serbian, making future resource management more effective and more economical, while pedagogical activities will fill an important lacuna in the researchers’ training, enabling them not only to use new methods in their research, but also to train others. Lastly, networking opportunities involving both members of academia and industry are expected to bring about a wider societal effect.
-