Corpus Linguistics; Parallel Corpora; Linguistic Variation; Parallel Concordancing; Articles
Graën Johannes, Clematide Simon, Volk Martin (2016), Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database, in Proceedings of the 4th Workshop on the Challenges in the Management of Large Corpora
, PortorozELRA, Paris.
Graën Johannes, Clematide Simon (2015), Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, in Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-3)
, LancasterOnline, Lancaster.
Clematide Simon, Graën Johannes, Volk Martin (2015), Multilingwis -- A Multilingual Search Tool for Multi-word Units in Multiparallel Corpora, in Proceedings of EUROPHRAS
, MalagaOnline, Malaga.
Grigonyte Gintare (ed.) (2015), Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
, Linköping University Electronic Press, Linköping, Schweden.
Clematide Simon (2015), Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora, in Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
, Vilnius, LitauenLinköping University Electronic Press, Linköping.
Volk Martin, Graën Johannes, Callegaro Elena (2014), Innovations in Parallel Corpus Search Tools, in Proceedings of the Ninth International Conference on Language Resources and Evaluation
, ReykjavikELRA, Paris.
Clematide Simon, Graën Johannes, Lehner Stéphanie, Volk Martin, A Multilingual Gold Standard for Translation Spotting of German Compounds and their corresponding Multi-Word Units in English, Spanish, French and Italian, in Monti Johanna (ed.), Benjamins, Amsterdam.
Volk Martin, Clematide Simon, Graën Johannes, Ströbel Phillip, Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs, in Proceedings of KONVENS
, BochumOnline, Bochum.
Graën Johannes, Volk Martin, Batnic Dolores, Cleaning the Europarl Corpus for Linguistic Applications, in Proceedings of KONVENS
, HildesheimOnline, Hildesheim.
Translated documents in multiple languages (here: parallel documents) are highly regarded as valuable resources for various tasks in natural language processing and linguistic research. Parallel corpora are useful for tasks as diverse as word sense disambiguation, terminology extraction and contrastive corpus linguistics. The usefulness of these resources for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts not only on the sentence level but also on the word level.We propose to align (sentence alignment and word alignment) and annotate (PoS-Tagging and Parsing) a large parallel corpus for the language pairs English-German, English-French, and English-Spanish. For this purpose we will use two large parallel corpora: the Europarl corpus and the UN-Corpus. In addition, we will align the corpus for the language pairs English-Russian and English-Finnish. The latter language pair enables us to include comparisons with a non-Indoeuropean language in the linguistic research.Linguistic variation at times involves the choice between the use of an element and its omission (articles, relativizers, pronouns etc.). Modelling such zero or null contexts in a corpus-driven approach is particularly challenging because variable zero elements are difficult to retrieve even from annotated monolingual corpora: While it is possible to extract noun phrases without an article from a parsed corpus, such algorithms have poor precision because a vast number of instances will not allow for a definite or indefinite article. Therefore, we propose to use parallel annotated and word-aligned corpora that will enable us to target constructions with variable optional elements in one of the languages. Our goal is to prove the usefulness of such a large aligned and annotated corpus (LAAC) for the investigation of linguistic variation. As a case in point we will investigate variable article use in these languages, and zero articles in English, in particular. We believe that such a LAAC provides a number of advantages for new insights in these areas.Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or that use articles in ways clearly different from English. With the help of our LAAC we will retrieve instances of zero articles of the language pairs in the corpus where both languages have definite and indefinite articles (English, French, German and Spanish). The language pairs English-Russian and English-Finnish will serve to model variable article use in English against the background of typological differences.In addition, we will build a rich database that models factors influencing variable article choice in the language pairs English and German. These factors include lexical variation in the head noun, the internal structure of the noun phrase (pre- and postmodification), syntactic function of the noun phrase but also discourse-pragmatic functions (given vs. new). The aim for the linguistics part of the project is to arrive at a detailed description of variable article use in English and German using a multi-variate analysis. This will prove useful for purposes of language teaching and machine translation. Because of the element of lexico-grammatical variation, large parallel corpora are a prerequisite for this kind of research.The challenge for computational linguistics lies in the high-quality alignment and annotation of large corpora and the construction of an efficient and powerful corpus query tool that is able to handle these corpora. Efficient query tools for large monolingual corpora exist but the development of such a tool for parallel corpora is highly innovative.