Projekt

Zurück zur Übersicht

Exploiting Parallel Treebanks for Hybrid Machine Translation

Titel Englisch Exploiting Parallel Treebanks for Hybrid Machine Translation
Gesuchsteller/in Volk Martin
Nummer 132219
Förderungsinstrument Projektförderung (Abt. I-III)
Forschungseinrichtung Institut für Computerlinguistik Universität Zürich
Hochschule Universität Zürich - ZH
Hauptdisziplin Weitere Sprachen
Beginn/Ende 01.01.2011 - 31.07.2014
Bewilligter Betrag 345'233.00
Alle Daten anzeigen

Alle Disziplinen (2)

Disziplin
Weitere Sprachen
Informatik

Keywords (6)

Machine Translation; Hybrid Translation System; Corpus Annotation; Parallel Treebank; Under-resourced language; Quechua

Lay Summary (Englisch)

Lead
Lay summary
Machine translation research has been dominated by the statistical approach in recent years. Google Translate is the most prominent example. But this approach requires large human-translated text collections as training material for automatically building a translation system. For many language pairs (as e.g. Spanish - Quechua) there exist only small amounts of translated texts. Therefore it is worthwhile to explore alternative paths that allow the development of hybrid machine translation systems that combine the rule-based approach with statistical methods.We have chosen to investigate the automatic translation from Spanish to Quechua and from Spanish to German. This allows us to study one pair of languages which are typologically clearly different (Spanish - Quechua). We will contrast this language pair with a typologically closer language pair (Spanish - German) which enables us to use large linguistic resources (bilingual corpora, bilingual dictionaries, language analysis and generation tools). This contrast will shed new light on the development of machine translation systems under very different conditions. The use of Spanish as source language in both cases is advantageous from a theoretical and practical perspective. It allows close comparisons and profits from the availability of open-source grammar analysis modules for Spanish.Quechua is an indigenous language in South America spoken by 10 million people mostly in Bolivia, Ecuador and Peru. Despite the large number of speakers, Quechua is losing ground against Spanish. Spanish is the majority language and dominates administration and education. It is one of our project goals to strengthen Quechua's reputation by making it easier to translate Spanish into Quechua. We envision that Spanish-language newspapers in the Quechua area will be interested in translating some of the local news into Quechua in order to increase the attractiveness of their papers for the local communities.The central scientific goal is to develop methods for exploiting parallel treebanks for machine translation. Parallel treebanks are collections of grammatically analyzed sentences. They derive their name from the fact that the grammatical structure of a sentence with functional labels for e.g. subject, predicate and object is often represented as a syntax tree. We will use medium-size manually built treebanks and large automatically built treebanks and compare their respective merits.
Direktlink auf Lay Summary Letzte Aktualisierung: 21.02.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Publikationen

Publikation
Building a Spanish-German Dictionary for Hybrid MT
Göhring Anne (2014), Building a Spanish-German Dictionary for Hybrid MT, in Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra), Göteborg.
Enhancing a Rule-Based MT System with Cross-Lingual WSD
Rudnick Alex, Rios Annette, Gasser Michael (2014), Enhancing a Rule-Based MT System with Cross-Lingual WSD, in SaLTMiL Workshop on free/open-source language resources for the machine translation of less-resource, Reykjavik.
Morphological Disambiguation and Text Normalization for Southern Quechua Varieties
Rios Annette, Castro Mamani Richard (2014), Morphological Disambiguation and Text Normalization for Southern Quechua Varieties, in COLING Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial), Dublin.
A tree is a Baum is an ´arbol is a sach’a: Creating a trilingual treebank
Rios Annette, Göhring Anne (2012), A tree is a Baum is an ´arbol is a sach’a: Creating a trilingual treebank, in Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul.
Parallel Treebanking {Spanish-Quechua}: How and how well do they align?
Rios Annette, Göhring Anne, Volk Martin (2012), Parallel Treebanking {Spanish-Quechua}: How and how well do they align?, in The 10th InternationalWorkshop on Treebanks and Linguistic Theories (TLT10), CSLI Publications, Stanford.
Building and Querying Parallel Treebanks
Volk Martin, Marek Torsten, Samuelsson Yvonne (2011), Building and Querying Parallel Treebanks, in Translation: Computation, Corpora, Cognition (Special Issue on Parallel Corpora: Annotation, Exploit, 1(1), 7-28.
From Multilingual Web-Archives to Parallel Treebanks in Five Minutes
Killer Markus, Sennrich Rico, Volk Martin (2011), From Multilingual Web-Archives to Parallel Treebanks in Five Minutes, in Multilingual Resources and Multilingual Applications. Proceedings of the Conference of the GSCL, GSCL, Hamburg.
Spell checking an agglutinative language: Quechua
Rios Annette (2011), Spell checking an agglutinative language: Quechua, in Proceedings of the 5th Language & Technology Conference (LTC'11), Poznan, PolandUniversität Poznan, Poznan.
Word-aligned Parallel Text. A new Resource for Contrastive Language Studies
Volk Martin, Göhring Anne, Lehner Stéphanie, Rios Annette, Sennrich Rico, Uibo Heli (2011), Word-aligned Parallel Text. A new Resource for Contrastive Language Studies, in Proceedings of SDH 2011 - Supporting Digital Humanities: Answering the Unaskable, Universität Kopenhagen, Kopenhagen.
Machine Learning applied to Rule-Based Machine Translation
Rios Annette, Göhring Anne, Machine Learning applied to Rule-Based Machine Translation, in Marta R. Costa-jussà Reinhard Rapp Patrik Lambert (ed.), Springer, Heidelberg.

Zusammenarbeit

Gruppe / Person Land
Formen der Zusammenarbeit
University of the Basque Country Spanien (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
Universität Cuzco Peru (Südamerika)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Publikation
- Forschungsinfrastrukturen
- Austausch von Mitarbeitern

Wissenschaftliche Veranstaltungen



Selber organisiert

Titel Datum Ort
Vortragsreihe zur Maschinellen Übersetzung in Peru (Martin Volk) 22.06.2013 Cuzco, Peru
Annotation and Alignment of Parallel Corpora for Linguistic Research 22.01.2013 Dagstuhl, Deutschland, Deutschland
Seminar "Sprachtechnologie für wenig beachtete Sprachen" 21.02.2012 Universität Zürich, Schweiz

Kommunikation mit der Öffentlichkeit

Kommunikation Titel Medien Ort Jahr
Medienarbeit: Printmedien, Online-Medien Die Übersetzungsmaschinisten Horizonte - Das Schweizer Forschungsmagazin Westschweiz Deutschschweiz 2013

Anwendungsorientierte Outputs

Software

Name Jahr
Spellchecker for Quechua 2013


Verbundene Projekte

Nummer Titel Start Förderungsinstrument
126999 Domain-specific Statistical Machine Translation 01.01.2010 Projekte
137766 Domain-specific Statistical Machine Translation 01.04.2012 Projekte
149841 Hybrid Machine Translation for Morphologically Rich Languages 01.01.2014 Projektförderung (Abt. I-III)

Abstract

Machine translation (MT) research has been dominated by the statistical approach in recent years. But this approach requires large parallel corpora as training material. For many language pairs there exist only small amounts of translated written texts. Therefore it is worthwhile to explore alternative paths that allow the development of hybrid machine translation systems that combine the rule-based approach with statistical methods. We propose to investigate the role of parallel treebanks for building and tuning hybrid MT systems. We derive weighted transfer rules from parallel treebanks. These transfer rules will constitute the core of a transfer component and help to generate multiple translation hypotheses in the target language. These hypotheses will be ranked with the help of a statistical language model. We have chosen to investigate the automatic translation from Spanish to Quechua and from Spanish to German. This allows us to study one pair of languages which are typologically clearly different (Spanish - Quechua). We will contrast this language pair with a typologically closer language pair (Spanish - German). We believe that this contrast will shed new light on the development of MT systems under very different conditions.
-