Europarl is a large text collection of the transcriptions from the EU Parliament and their translations. We automatically analyze these texts, add information on speaker nationality, word class and grammatical function (for individual languages), and we align the sentences and words across languages. The result is a valuable resource for automatic language processing and linguistic research. The computational linguists in the project process and store millions of words in different languages, and make them accessible for complex queries. The linguists use this enriched source to study variation in linguistic patterns. In particular, we are interested in the differences in article usage between English and other languages (e.g. German, Italian, Polish).

Lay summary

Translated documents in multiple languages are valuable for various tasks in natural language processing and linguistic research. Their usefulness for contrastive language studies has increased tremendously with the possibility to automatically align the texts on different levels, down to single words. This means that we can automatically compute which word in English has been translated with which word in German (e.g. goal being translated with Tor or Ziel).

Linguistic variation at times involves the choice between the use of an element and its omission. Missing elements are impossible to retrieve, however. We use parallel corpora to target constructions with optional elements in one of the languages. As a case in point we will investigate variable article use in these languages, and, in particular, zero articles in English (for instance She’s at university vs. Sie besucht die Universität).

Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or use them differently. The aim of the project is a detailed description of variable article use. This will prove useful for language teaching and machine translation. We will approach this goal by corpus-driven methods where the processing of large amounts of text leads to new research hypotheses.

The challenge for computational linguistics lies in high-quality alignment and annotation of large corpora. We exploit translations in multiple languages to improve the annotation of the texts in the various languages and the cross-language alignments. We also work on the construction of efficient and powerful corpus query tools. Many such tools for monolingual corpora exist, but the development of query and exploration tools for large multi-parallel corpora is highly innovative.