Projekt

Zurück zur Übersicht

Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation

Titel Englisch Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation
Gesuchsteller/in Hundt Marianne
Nummer 146781
Förderungsinstrument Projekte
Forschungseinrichtung Englisches Seminar Universität Zürich
Hochschule Universität Zürich - ZH
Hauptdisziplin Schwerpunkt Germanistik und Anglistik
Beginn/Ende 01.09.2013 - 30.09.2016
Bewilligter Betrag 366'487.00
Alle Daten anzeigen

Alle Disziplinen (3)

Disziplin
Schwerpunkt Germanistik und Anglistik
Informatik
Weitere Sprachen

Keywords (5)

Corpus Linguistics; Parallel Corpora; Linguistic Variation; Parallel Concordancing; Articles

Lay Summary (Deutsch)

Lead
Ziel des Projektes ist die Annotation und Verknüpfung (Satz- und Wortebene) eines grossen parallelen Textkorpus. Solche Korpora stellen eine wichtige Resource für die Untersuchung sprachlicher Variation über verschiedene Sprachpaare dar. Als Fallstudie liegt der Fokus in diesem Projekt auf variablem Gebrauch von Artikeln. Das Projekt macht sich die Tatsache zu Nutzen, dass in einer Sprache ein Artikel gebraucht wird, in der Übersetzung allenfalls nicht, um besonders an Nullartikel zu gelangen.
Lay summary

Übersetzte Dokomente in multilingualen Kontexten sind eine wertvolle Resource sowohl für automatische Sprachverarbeitung als auch für linguistische Studien. Ihre Nützlichkeit für kontrastive linguistische Fragestellungen wird vor allem durch die automatische Verknüfung auf Satz- und Wortebene wesentlich gesteigert. Ziel des Projektes ist daher die Alignierung und automatische Annotation (Wortart und Satzstruktur) eines grossen, multilingualen Korpus.

Selbst in verwandten Sprachen wie dem Deutschen und Englischen kommt es vor, dass einem Element in der einen Sprache ein Nullelement in der anderen entspricht. Solche Nullelemente sind in Untersuchungen von Einzelsprachen kaum zu erheben. Ziel des Projektes ist es, variable Artikelgebrauch (inclusive Nullkontexte) im Korpus zu untersuchen, mit besonderem Fokus auf Nullartikeln im Englischen.

Variabler Artikelgebrauch im Englischen ist von daher relevant, als die Englische Sprache zunehmend als Zweitsprache erworben wird, auch von Sprechern, deren Muttersprache keine Artikel kennt oder in der Artikel abweichend gebraucht werden. Ziel des Projektes ist die detaillierte Beschreibung des Artikelgebrauchs und somit wichtige Grundlagenforschung für Sprachvermittlung und maschinelle Übersetzung.

Der computerlinguistische Beitrag des Projektes liegt in der automatischen Alignierung und Annotation sowie der Entwicklung eines geeigneten Abfragesystems. Während es effiziente Abfragesoftware für einzelsprachliche Korpora bereits gibt, ist die Entwicklung effizienter Abfragetools für grosse wort-alignierte parallele Korpora eine Herausforderung.


Direktlink auf Lay Summary Letzte Aktualisierung: 12.07.2013

Lay Summary (Englisch)

Lead
We aim to develop large-Scale PARallel Corpora to study LINGuistic variation (SPARCLING). These can be used to study variation in linguistic patterns across different languages. The focus in the linguistic part of the project is on variable article use. We use contexts where one language in a language pair uses an article to retrieve instances where the other language does not require an article. Such zero contexts are notoriously difficult to access in monolingual corpora.
Lay summary

Translated documents in multiple languages are a valuable resource for various tasks in natural language processing and linguistic research. Their usefulness for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts on both the sentence and the word level. We will align and annotate (PoS-Tagging and Parsing) a large parallel corpus for several language pairs.

Linguistic variation at times involves the choice between the use of an element and its omission. Zero elements are difficult to retrieve, however. We use parallel corpora to target constructions with variable optional elements in one of the languages. As a case in point we will investigate variable article use in these languages, and zero articles in English, in particular.

Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or use them differently. The aim of the project is to arrive at a detailed description of variable article use. This will prove useful for purposes of language teaching and machine translation.

The challenge for computational linguistics lies in the high-quality alignment and annotation of large corpora and the construction of an efficient and powerful corpus query tool that is able to handle these corpora. Efficient query tools for large monolingual corpora exist but the development of such a tool for parallel corpora is highly innovative.


Direktlink auf Lay Summary Letzte Aktualisierung: 12.07.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Publikationen

Publikation
Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database
Graën Johannes, Clematide Simon, Volk Martin (2016), Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database, in Proceedings of the 4th Workshop on the Challenges in the Management of Large Corpora, PortorozELRA, Paris.
Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora
Graën Johannes, Clematide Simon (2015), Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, in Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-3), LancasterOnline, Lancaster.
Multilingwis -- A Multilingual Search Tool for Multi-word Units in Multiparallel Corpora
Clematide Simon, Graën Johannes, Volk Martin (2015), Multilingwis -- A Multilingual Search Tool for Multi-word Units in Multiparallel Corpora, in Proceedings of EUROPHRAS, MalagaOnline, Malaga.
Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
Grigonyte Gintare (ed.) (2015), Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015, Linköping University Electronic Press, Linköping, Schweden.
Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora
Clematide Simon (2015), Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora, in Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015, Vilnius, LitauenLinköping University Electronic Press, Linköping.
Innovations in Parallel Corpus Search Tools
Volk Martin, Graën Johannes, Callegaro Elena (2014), Innovations in Parallel Corpus Search Tools, in Proceedings of the Ninth International Conference on Language Resources and Evaluation, ReykjavikELRA, Paris.
A Multilingual Gold Standard for Translation Spotting of German Compounds and their corresponding Multi-Word Units in English, Spanish, French and Italian
Clematide Simon, Graën Johannes, Lehner Stéphanie, Volk Martin, A Multilingual Gold Standard for Translation Spotting of German Compounds and their corresponding Multi-Word Units in English, Spanish, French and Italian, in Monti Johanna (ed.), Benjamins, Amsterdam.
Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs
Volk Martin, Clematide Simon, Graën Johannes, Ströbel Phillip, Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs, in Proceedings of KONVENS, BochumOnline, Bochum.
Cleaning the Europarl Corpus for Linguistic Applications
Graën Johannes, Volk Martin, Batnic Dolores, Cleaning the Europarl Corpus for Linguistic Applications, in Proceedings of KONVENS, HildesheimOnline, Hildesheim.

Zusammenarbeit

Gruppe / Person Land
Formen der Zusammenarbeit
Institut für Linguistik der Universität Stockholm Schweden (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Publikation
Center for Computational Linguistics in Kaunas Litauen (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Publikation
Ruta Marcinkeviciene, Vytautas Magnus Universität, Kaunas Litauen (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten

Wissenschaftliche Veranstaltungen

Aktiver Beitrag

Titel Art des Beitrags Titel des Artikels oder Beitrages Datum Ort Beteiligte Personen
Critical Link 8 - Future-proofing interpreting and translating Vortrag im Rahmen einer Tagung Keynote talk on Machine Translation for Media Accessibility 01.07.2016 Edinburgh, Grossbritannien und Nordirland Volk Martin;
Parseme EU COST Action Meeting Poster Searching Multi-words Simultaneously in Multiparallel Corpora 08.04.2016 Struga, Mazedonien Volk Martin;
Jahrestagung des Instituts für Deutsche Sprache Vortrag im Rahmen einer Tagung Multilingwis - Ein Sprachspiegel zur Untersuchung sprachübergreifender Variationen 09.03.2016 Mannheim, Deutschland Volk Martin;


Selber organisiert

Titel Datum Ort

Kommunikation mit der Öffentlichkeit

Kommunikation Titel Medien Ort Jahr

Verbundene Projekte

Nummer Titel Start Förderungsinstrument
165819 SPARCLING: Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation 01.09.2016 Projektförderung (Abt. I-III)

Abstract

Translated documents in multiple languages (here: parallel documents) are highly regarded as valuable resources for various tasks in natural language processing and linguistic research. Parallel corpora are useful for tasks as diverse as word sense disambiguation, terminology extraction and contrastive corpus linguistics. The usefulness of these resources for contrastive linguistics, in particular, has increased tremendously with the possibility to automatically align the texts not only on the sentence level but also on the word level.We propose to align (sentence alignment and word alignment) and annotate (PoS-Tagging and Parsing) a large parallel corpus for the language pairs English-German, English-French, and English-Spanish. For this purpose we will use two large parallel corpora: the Europarl corpus and the UN-Corpus. In addition, we will align the corpus for the language pairs English-Russian and English-Finnish. The latter language pair enables us to include comparisons with a non-Indoeuropean language in the linguistic research.Linguistic variation at times involves the choice between the use of an element and its omission (articles, relativizers, pronouns etc.). Modelling such zero or null contexts in a corpus-driven approach is particularly challenging because variable zero elements are difficult to retrieve even from annotated monolingual corpora: While it is possible to extract noun phrases without an article from a parsed corpus, such algorithms have poor precision because a vast number of instances will not allow for a definite or indefinite article. Therefore, we propose to use parallel annotated and word-aligned corpora that will enable us to target constructions with variable optional elements in one of the languages. Our goal is to prove the usefulness of such a large aligned and annotated corpus (LAAC) for the investigation of linguistic variation. As a case in point we will investigate variable article use in these languages, and zero articles in English, in particular. We believe that such a LAAC provides a number of advantages for new insights in these areas.Studying articles in English is of interest and importance because of the growing influence of non-native English speakers whose first languages do not have articles or that use articles in ways clearly different from English. With the help of our LAAC we will retrieve instances of zero articles of the language pairs in the corpus where both languages have definite and indefinite articles (English, French, German and Spanish). The language pairs English-Russian and English-Finnish will serve to model variable article use in English against the background of typological differences.In addition, we will build a rich database that models factors influencing variable article choice in the language pairs English and German. These factors include lexical variation in the head noun, the internal structure of the noun phrase (pre- and postmodification), syntactic function of the noun phrase but also discourse-pragmatic functions (given vs. new). The aim for the linguistics part of the project is to arrive at a detailed description of variable article use in English and German using a multi-variate analysis. This will prove useful for purposes of language teaching and machine translation. Because of the element of lexico-grammatical variation, large parallel corpora are a prerequisite for this kind of research.The challenge for computational linguistics lies in the high-quality alignment and annotation of large corpora and the construction of an efficient and powerful corpus query tool that is able to handle these corpora. Efficient query tools for large monolingual corpora exist but the development of such a tool for parallel corpora is highly innovative.
-