automatic langage processing; Croatian, Serbian; language corpora; distance learning; empirical methods
LjubešićNikola, Miličević PetrovićMaja, SamardžićTanja (2019), Jezična akomodacija na Twitteru: Primjer Srbije, in
Slavistična revija, 67(1), 87-106.
Ljubešić Nikola, Miličević Petrović Maja, Samardžić Tanja (2018), Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue, in
Journal of Linguistic Geography, 6(2), 100-124.
BatanovićVuk (2018), SETimes.SR – A reference training corpus of Serbian, in
Conference on Language Technologies & Digital Humanities, Slovenian Language Technologies Society , Ljubljana.
Ljubešić Nikola, Erjavec Tomaž, Fišer Darja (2017), Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text, in
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, Valencia, Spain.
Samardžić Tanja, Starović Mirjana, Agić Željko, Ljubešić Nikola (2017), Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages, in
In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, Valencia, Spain.
Samardžić Tanja, Miličević Maja (2016), A Framework for automatic acquisition of Croatian and Serbian verb aspect from corpora, in
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France.
Ljubešić Nikola, Erjavec Tomaž (2016), Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene, in
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France.
Ljubešić Nikola, Erjavec Tomaž, Fišer Darja (2016), Corpus-based diacritic restoration for South Slavic languages., in
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France.
Ljubešić Nikola, Erjavec Tomaž, Fišer Darja, Samardžić Tanja, Miličević Maja, Klubička Filip, Petkovski Filip (2016), Easily accessible language technologies for Slovene, Croatian and Serbian, in
Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana University Press, Ljubljana.
Ljubešić Nikola, Klubička Filip, Agić Željko, Jazbec Ivo-Pavao (2016), New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian, in
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France.
Miličević Maja, Ljubešić Nikola (2016), Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets, in
Slovenščina 2.0, 4(2), 156-188.
Bago Petra, Ljubesic Nikola (2015), Using machine learning for language and structure annotation in an 18th century dictionary, in
Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of, Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., Ljubljana/Brighton.
Fišer Darja, Erjavec Tomaž, Ljubešić Nikola, Miličević Maja (2015), Comparing the nonstandard language of Slovene, Croatian and Serbian tweets, in
Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del), Filozofska fakulteta, Ljubljana.
Samardzic Tanja, Ljubesic Nikola, Milicevic Maja (2015), Regional Linguistic Data Initiative, in
Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing,, Association for Computational Linguistics, Hissar, Bulgaria.
An important global trend in the study of human language in the past decades has been a growing reliance on empirical data. This is due to the rapid development and wider accessibility of large collections of machine-readable text and tools needed for its processing, as well as to a rise in standards of experimental research on language, often criticised in the past for lack of solid experimental design. With the advent of larger amounts of data, statistics has also become integral for linguists, and mastering the methodology of language research has turned into an increasingly complex task. At the same time, more attention has started being paid to the issues of replicability and reproducibility of language research and exciting steps have been made towards establishing common standards for sharing data and resources, as evidenced by the rising number of initiatives focused on creating dedicated repositories.Due to the difficulties brought by the transition period following the breakup of former Yugoslavia, Serbia and Croatia lag behind most other European countries in the implementation of these trends. The countries’ isolation during the 1990s, insufficient institutional support and the continuing lack of funds needed for joining some of the wider European initiatives have largely left language researchers to their own devices, leading to a situation in which a number of individuals and groups do conduct internationally visible research, but no impetus is given to the wider linguistic community. The developed resources and experimental instruments are often unavailable to external users, and the training of language specialists remains largely traditional, preventing them from using even those resources and instruments that are available to them. In addition, the split of the once shared Serbo-Croatian into two distinct official languages has led to a separate development of resources, with very little transfer and with many missed joint effort opportunities.With the above in mind, the main objective of the proposed institutional partnership between Switzerland, Croatia and Serbia is to establish a regional initiative that will improve resources and research methodology in the domain of language studies based on empirical data, focusing on Croatian and Serbian and fully exploiting their close relatedness. The two core areas of interest are (1) language corpora and related tools and resources, and (2) experimental design in language research and related instruments; statistical analysis will be covered in both areas. Two (intertwined) strands of planned activity are (1) the creation of a resource repository, and (2) pedagogical activities. The resources will be integrated within a web platform comprising portions dedicated to corpus-based and experimental research, as well as an e-learning environment for web courses and tutorials dedicated to resource use. The pedagogical activities will also include traditional live seminars, serving as both training and networking events.The proposed partnership is relevant for different sub-fields of language study, ranging from theoretical linguistics to the applied areas of language acquisition, translation and lexicology, as well as for information retrieval and digital humanities. It has significant potential for long-term impact, as the resource repository will allow the project partners to sustain and synchronise the development of language resources and instruments for Croatian and Serbian, making future resource management more effective and more economical, while pedagogical activities will fill an important lacuna in the researchers’ training, enabling them not only to use new methods in their research, but also to train others. Lastly, networking opportunities involving both members of academia and industry are expected to bring about a wider societal effect.