Project

Back to overview

Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation

English title Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation
Applicant Hundt Marianne
Number 146781
Funding scheme Project funding (Div. I-III)
Research institution Englisches Seminar Universität Zürich
Institution of higher education University of Zurich - ZH
Main discipline German and English languages and literature
Start/End 01.09.2013 - 30.09.2016
Approved amount 366'487.00
Show all

All Disciplines (3)

Discipline
German and English languages and literature
Information Technology
Other languages and literature

Keywords (5)

Corpus Linguistics; Parallel Corpora; Linguistic Variation; Parallel Concordancing; Articles

Lay Summary (German)

Lead
Ziel des Projektes ist die Annotation und Verknüpfung (Satz- und Wortebene) eines grossen parallelen Textkorpus. Solche Korpora stellen eine wichtige Resource für die Untersuchung sprachlicher Variation über verschiedene Sprachpaare dar. Als Fallstudie liegt der Fokus in diesem Projekt auf variablem Gebrauch von Artikeln. Das Projekt macht sich die Tatsache zu Nutzen, dass in einer Sprache ein Artikel gebraucht wird, in der Übersetzung allenfalls nicht, um besonders an Nullartikel zu gelangen.
Lay summary

Übersetzte Dokomente in multilingualen Kontexten sind eine wertvolle Resource sowohl für automatische Sprachverarbeitung als auch für linguistische Studien. Ihre Nützlichkeit für kontrastive linguistische Fragestellungen wird vor allem durch die automatische Verknüfung auf Satz- und Wortebene wesentlich gesteigert. Ziel des Projektes ist daher die Alignierung und automatische Annotation (Wortart und Satzstruktur) eines grossen, multilingualen Korpus.

Selbst in verwandten Sprachen wie dem Deutschen und Englischen kommt es vor, dass einem Element in der einen Sprache ein Nullelement in der anderen entspricht. Solche Nullelemente sind in Untersuchungen von Einzelsprachen kaum zu erheben. Ziel des Projektes ist es, variable Artikelgebrauch (inclusive Nullkontexte) im Korpus zu untersuchen, mit besonderem Fokus auf Nullartikeln im Englischen.

Variabler Artikelgebrauch im Englischen ist von daher relevant, als die Englische Sprache zunehmend als Zweitsprache erworben wird, auch von Sprechern, deren Muttersprache keine Artikel kennt oder in der Artikel abweichend gebraucht werden. Ziel des Projektes ist die detaillierte Beschreibung des Artikelgebrauchs und somit wichtige Grundlagenforschung für Sprachvermittlung und maschinelle Übersetzung.

Der computerlinguistische Beitrag des Projektes liegt in der automatischen Alignierung und Annotation sowie der Entwicklung eines geeigneten Abfragesystems. Während es effiziente Abfragesoftware für einzelsprachliche Korpora bereits gibt, ist die Entwicklung effizienter Abfragetools für grosse wort-alignierte parallele Korpora eine Herausforderung.


Direct link to Lay Summary Last update: 12.07.2013

Lay Summary (English)

Lead
We aim to develop large-Scale PARallel Corpora to study LINGuistic variation (SPARCLING). These can be used to study variation in linguistic patterns across different languages. The focus in the linguistic part of the project is on variable article use. We use contexts where one language in a language pair uses an article to retrieve instances where the other language does not require an article. Such zero contexts are notoriously difficult to access in monolingual corpora.
Lay summary

Translated documents in multiple languages are a valuable resource for various tasks in natural language processing and linguistic research. Their usefulness for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts on both the sentence and the word level. We will align and annotate (PoS-Tagging and Parsing) a large parallel corpus for several language pairs.

Linguistic variation at times involves the choice between the use of an element and its omission. Zero elements are difficult to retrieve, however. We use parallel corpora to target constructions with variable optional elements in one of the languages. As a case in point we will investigate variable article use in these languages, and zero articles in English, in particular.

Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or use them differently. The aim of the project is to arrive at a detailed description of variable article use. This will prove useful for purposes of language teaching and machine translation.

The challenge for computational linguistics lies in the high-quality alignment and annotation of large corpora and the construction of an efficient and powerful corpus query tool that is able to handle these corpora. Efficient query tools for large monolingual corpora exist but the development of such a tool for parallel corpora is highly innovative.


Direct link to Lay Summary Last update: 12.07.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database
Graën Johannes, Clematide Simon, Volk Martin (2016), Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database, in Proceedings of the 4th Workshop on the Challenges in the Management of Large Corpora, PortorozELRA, Paris.
Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora
Graën Johannes, Clematide Simon (2015), Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, in Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-3), LancasterOnline, Lancaster.
Multilingwis -- A Multilingual Search Tool for Multi-word Units in Multiparallel Corpora
Clematide Simon, Graën Johannes, Volk Martin (2015), Multilingwis -- A Multilingual Search Tool for Multi-word Units in Multiparallel Corpora, in Proceedings of EUROPHRAS, MalagaOnline, Malaga.
Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
Grigonyte Gintare (ed.) (2015), Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015, Linköping University Electronic Press, Linköping, Schweden.
Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora
Clematide Simon (2015), Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora, in Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015, Vilnius, LitauenLinköping University Electronic Press, Linköping.
Innovations in Parallel Corpus Search Tools
Volk Martin, Graën Johannes, Callegaro Elena (2014), Innovations in Parallel Corpus Search Tools, in Proceedings of the Ninth International Conference on Language Resources and Evaluation, ReykjavikELRA, Paris.
A Multilingual Gold Standard for Translation Spotting of German Compounds and their corresponding Multi-Word Units in English, Spanish, French and Italian
Clematide Simon, Graën Johannes, Lehner Stéphanie, Volk Martin, A Multilingual Gold Standard for Translation Spotting of German Compounds and their corresponding Multi-Word Units in English, Spanish, French and Italian, in Monti Johanna (ed.), Benjamins, Amsterdam.
Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs
Volk Martin, Clematide Simon, Graën Johannes, Ströbel Phillip, Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs, in Proceedings of KONVENS, BochumOnline, Bochum.
Cleaning the Europarl Corpus for Linguistic Applications
Graën Johannes, Volk Martin, Batnic Dolores, Cleaning the Europarl Corpus for Linguistic Applications, in Proceedings of KONVENS, HildesheimOnline, Hildesheim.

Collaboration

Group / person Country
Types of collaboration
Institut für Linguistik der Universität Stockholm Sweden (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Center for Computational Linguistics in Kaunas Lithuania (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Ruta Marcinkeviciene, Vytautas Magnus Universität, Kaunas Lithuania (Europe)
- in-depth/constructive exchanges on approaches, methods or results

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Critical Link 8 - Future-proofing interpreting and translating Talk given at a conference Keynote talk on Machine Translation for Media Accessibility 01.07.2016 Edinburgh, Great Britain and Northern Ireland Volk Martin;
Parseme EU COST Action Meeting Poster Searching Multi-words Simultaneously in Multiparallel Corpora 08.04.2016 Struga, Macedonia Volk Martin;
Jahrestagung des Instituts für Deutsche Sprache Talk given at a conference Multilingwis - Ein Sprachspiegel zur Untersuchung sprachübergreifender Variationen 09.03.2016 Mannheim, Germany Volk Martin;


Self-organised

Title Date Place

Communication with the public

Communication Title Media Place Year

Associated projects

Number Title Start Funding scheme
165819 SPARCLING: Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation 01.09.2016 Project funding (Div. I-III)

Abstract

Translated documents in multiple languages (here: parallel documents) are highly regarded as valuable resources for various tasks in natural language processing and linguistic research. Parallel corpora are useful for tasks as diverse as word sense disambiguation, terminology extraction and contrastive corpus linguistics. The usefulness of these resources for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts not only on the sentence level but also on the word level.We propose to align (sentence alignment and word alignment) and annotate (PoS-Tagging and Parsing) a large parallel corpus for the language pairs English-German, English-French, and English-Spanish. For this purpose we will use two large parallel corpora: the Europarl corpus and the UN-Corpus. In addition, we will align the corpus for the language pairs English-Russian and English-Finnish. The latter language pair enables us to include comparisons with a non-Indoeuropean language in the linguistic research.Linguistic variation at times involves the choice between the use of an element and its omission (articles, relativizers, pronouns etc.). Modelling such zero or null contexts in a corpus-driven approach is particularly challenging because variable zero elements are difficult to retrieve even from annotated monolingual corpora: While it is possible to extract noun phrases without an article from a parsed corpus, such algorithms have poor precision because a vast number of instances will not allow for a definite or indefinite article. Therefore, we propose to use parallel annotated and word-aligned corpora that will enable us to target constructions with variable optional elements in one of the languages. Our goal is to prove the usefulness of such a large aligned and annotated corpus (LAAC) for the investigation of linguistic variation. As a case in point we will investigate variable article use in these languages, and zero articles in English, in particular. We believe that such a LAAC provides a number of advantages for new insights in these areas.Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or that use articles in ways clearly different from English. With the help of our LAAC we will retrieve instances of zero articles of the language pairs in the corpus where both languages have definite and indefinite articles (English, French, German and Spanish). The language pairs English-Russian and English-Finnish will serve to model variable article use in English against the background of typological differences.In addition, we will build a rich database that models factors influencing variable article choice in the language pairs English and German. These factors include lexical variation in the head noun, the internal structure of the noun phrase (pre- and postmodification), syntactic function of the noun phrase but also discourse-pragmatic functions (given vs. new). The aim for the linguistics part of the project is to arrive at a detailed description of variable article use in English and German using a multi-variate analysis. This will prove useful for purposes of language teaching and machine translation. Because of the element of lexico-grammatical variation, large parallel corpora are a prerequisite for this kind of research.The challenge for computational linguistics lies in the high-quality alignment and annotation of large corpora and the construction of an efficient and powerful corpus query tool that is able to handle these corpora. Efficient query tools for large monolingual corpora exist but the development of such a tool for parallel corpora is highly innovative.
-