Parallel Concordancing; Corpus Linguistics; Articles; Linguistic Variation; Parallel Corpora
Volk Martin, Graën Johannes (2017), Multi-word Adverbs – How well are they handled in Parsing and Machine Translation?, in The 3rd Workshop on Multi-word Units in Machine Translation and Translation Technology
, LondonEurophras, London.
Graën Johannes, Bless Christof (2017), Exploring Properties of Intralingual and Interlingual Association Measures Visually, in Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa
Graën Johannes, Sandoz Dominique, Volk Martin (2017), Multilingwis2 – Explore Your Parallel Corpus, in Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa
Graën Johannes, Schneider Gerold (2017), Crossing the Border Twice: Reimporting Prepositions to Alleviate L1-Specific Transfer Errors, in 6th Workshop on NLP for Computer Assisted Language Learning
, GothenburgUniversity of Gothenburg, Gothenburg.
Translated documents in multiple languages (here: parallel corpora) are valuable resources for various tasks in natural language processing and linguistic research. Parallel corpora are useful for tasks as diverse as word sense disambiguation, machine translation and contrastive corpus linguistics. The usefulness of these resources for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts on the word and phrase level.We work on the automatic annotation and alignment of large parallel corpora from Europarl (the transcriptions of the debates in the European Parliament), with a focus on English and German, but moving beyond these to include French, Finnish, Italian, Polish, and Spanish. Linguistic annotations include Part-of-Speech tags, lemmas and syntactic dependencies. The early languages of the EU (DE, EN, ES, FR, IT) each have around 50 million words in the corpus, Finnish and Polish somewhat less. In sum, this amounts to several 100 million entries in the database. To this, we have to add the cross-lingual alignment links which number in the same order of magnitude.This amount of data poses challenges for storage and efficient retrieval. We work on a powerful query language that will allow a linguist to access and view the linguistic data in a user-friendly fashion. However, the massive parallelism of the texts also offers interesting options for improving the annotation and alignments. It is thus one of the main aims of our project to investigate the advantages of multi-parallel corpora for improving the quality of word alignments.On the linguistic side we will use a data-driven approach to modelling variation in English article use. Previously, it has been difficult to retrieve noun phrases without an article (so-called `bare NPs') from electronic corpora. Since German makes use of articles to a greater extent than English, retrieving NPs with an article in German but no article or other determiner in the aligned English NPs allows us to systematically target bare NPs in English. A similar approach is possible for the language pairs Italian-English and Spanish-English, since the Romance languages, again, make greater use of articles than English. For languages without articles - such as Polish - using English as a starting point and retrieving English NPs with articles will allow modelling the strategies that these languages employ to mark syntactic categories such as `definiteness' and `indefiniteness'. Finally, since our parallel corpus contains both original texts and translations, we will make use of the materials to study the impact of typological differences (i.e. article vs. no-article language) in language contact. The focus for this will be on the language pair English-Finnish.This is a proposal for a one-year extension of our SNF project ``Large-Scale Annotation and Alignment of PARallel Corpora for the Investigation of LINGuistic Variation" which started in the fall 2013 and will end in 2016. The PhD students in this project have made remarkable progress. Additional supervision and support from Computational Linguist Simon Clematide played an important role in this. We now ask for funding for a 4th year so that we can ensure the sustainability of the corpus query system and allow the PhD students to conclude their research and finish their theses.