Projekt

Zurück zur Übersicht

Domain-specific Statistical Machine Translation

Titel Englisch Domain-specific Statistical Machine Translation
Gesuchsteller/in Volk Martin
Nummer 126999
Förderungsinstrument Projekte
Forschungseinrichtung Institut für Computerlinguistik Universität Zürich
Hochschule Universität Zürich - ZH
Hauptdisziplin Schwerpunkt Germanistik und Anglistik
Beginn/Ende 01.01.2010 - 31.05.2012
Bewilligter Betrag 219'235.00
Alle Daten anzeigen

Alle Disziplinen (2)

Disziplin
Schwerpunkt Germanistik und Anglistik
Informatik

Keywords (6)

Machine Translation; Corpus Annotation; Named Entity Classification; Parsing; Investigation of Translated Documents; Alpine Texts

Lay Summary (Englisch)

Lead
Lay summary
The automatic translation from one language to another is an old dream. This dream has triggered research on machine translation (MT) since the start of the computer age. It soon became clear that languages are complex systems that pose hard problems on all levels (words, grammar, meaning) for computer processing. Until the 1980s the dominating paradigm was based on human labor mostly spent on compiling large bilingual dictionaries and large collections of grammar and transfer rules. This approach has led to a number of impressive MT systems, the most famous is arguably Systran. Their output is not a perfect translation, but is useful for draft translations. Specially tailored versions of rule-based MT systems are in every-day use in large international organizations like the European Union.However the development of such translation systems was limited because of the huge manual effort. This situation has changed dramatically with Statistical MT in the 1990s. The input is large amounts of human translated texts (i.e. parallel texts in source and target language). Based on these parallel texts the computer derives the bilingual dictionary automatically, cuts the parallel sentences into pieces and re-assembles the pieces when translating a new sentence.Within this new paradigm new translation systems can be built within a few weeks when enough high-quality texts are available for the desired language pair. Our experiments indicate that a collection of 10 million words of translated texts is a good starting point. If more text is available, the translation quality will improve. Google Translate with its many language pairs is an example of this new MT paradigm.MT systems work better when they are tuned for specific textual domains. For example, if the input comes from technical user manuals, the system will work best on such manuals. The goal of our project is to build a Statistical MT system for alpine texts. We will first focus on reports of mountaineering expeditions. Our input data come from the yearbooks of the Swiss Alpine Club (SAC) who has published translated articles in French - German since 1957. We are currently digitizing these yearbooks (www.textberg.ch) and will be the first to investigate this parallel text collection. We estimate that we will be able to extract around 5 million translated words plus 30 million words of monolingual texts. The challenge thus lies in combining the translated and untranslated parts in innovative ways in order to create a high-quality translation system.
Direktlink auf Lay Summary Letzte Aktualisierung: 21.02.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Publikationen

Publikation
Extrinsic Evaluation of Sentence Alignment Systems
Abdul-Rauf S., Fishel M, Lambert P, Noubours S, Sennrich R (2012), Extrinsic Evaluation of Sentence Alignment Systems, in Proceedings of CREDISLAS 2012, LREC, Istanbul.
Mixture-Modeling with Unsupervised Clusters for Domain Adaptation in Statistical Machine Translation
Sennrich Rico (2012), Mixture-Modeling with Unsupervised Clusters for Domain Adaptation in Statistical Machine Translation, in EAMT-2012: the 16th Annual Conference of the European Association for Machine Translation.
Towards a Wikipedia-extracted Alpine Corpus
Plamada Magdalena, Volk Martin (2012), Towards a Wikipedia-extracted Alpine Corpus, in Proceedings of BUCC 2012, Istanbul.
Perplexity minimization for translation model domain adaptation in statistical machine translation
Sennrich Rico (2012), Perplexity minimization for translation model domain adaptation in statistical machine translation, in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Ling, AvignonAssociation For Computational Linguistics.
Digging for names in the mountains: Combined person name recognition and reference resolution for German alpine texts
Ebling S, Sennrich R, Klaper D, Volk M (2011), Digging for names in the mountains: Combined person name recognition and reference resolution for German alpine texts, in 5th Language & Technology Conference, Poznan.
From historic books to annotated XML: Building a large multilingual diachronic corpus
Jitca M, Sennrich R, Volk M (2011), From historic books to annotated XML: Building a large multilingual diachronic corpus, in Conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011, Hamburg(96), Universit��t Hamburg, (96).
The UZH system combination system for WMT 2011
Sennrich R (2011), The UZH system combination system for WMT 2011, in Proceedings of the Sixth Workshop on Statistical Machine Translation, EdinburghAssociation For Computational Linguistics.
Combining multi-engine machine translation and online learning through dynamic phrase tables
Sennrich R (2011), Combining multi-engine machine translation and online learning through dynamic phrase tables, in EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation, Leuven.
Iterative, MT-based sentence alignment of parallel texts
Sennrich R, Volk M (2011), Iterative, MT-based sentence alignment of parallel texts, in NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
Strategies for reducing and correcting OCR error
Volk Martin, Furrer Lenz, Sennrich Rico (2011), Strategies for reducing and correcting OCR error, in Zervanou Kalliopi, Sporleder Caroline, van den Bosch Antal (ed.), Springer, Berlin, 3-22.

Wissenschaftliche Veranstaltungen

Aktiver Beitrag

Titel Art des Beitrags Titel des Artikels oder Beitrages Datum Ort Beteiligte Personen
BUCC 2012, the Fifth Workshop on Building and Using Comparable Corpora 22.05.2012 Istanbul
EAMT 2012, the 16th Annual Conference of the European Association for Machine Translation 15.05.2012 Trento
EACL 2012, the 13th Conference of the European Chapter of the Association of Computational Linguistics 15.04.2012 Avignon
GSCL 2011, Conference of the German Society for Computational Linguistics and Language Technology 01.10.2011 Hamburg
WMT 2011, the 6th Workshop on Statistical Machine Translation 01.07.2011 Edinburgh
EAMT 2011, the 15th Annual Conference of the European Association for Machine Translation 31.05.2011 Leuven
NODALIDA 2011, the 18th Nordic Conference of Computational Linguistics 15.05.2011 Riga


Verbundene Projekte

Nummer Titel Start Förderungsinstrument
132219 Exploiting Parallel Treebanks for Hybrid Machine Translation 01.01.2011 Projektförderung (Abt. I-III)
147653 MODERN: Modeling discourse entities and relations for coherent machine translation 01.08.2013 Sinergia
137766 Domain-specific Statistical Machine Translation 01.04.2012 Projekte
149841 Hybrid Machine Translation for Morphologically Rich Languages 01.01.2014 Projektförderung (Abt. I-III)
169888 Rich Context in Neural Machine Translation 01.01.2017 Projekte

Abstract

Statistical Machine Translation (SMT) systems can be built quickly when sufficient training material is available. Our experiments indicate that a parallel corpus of 10 million words per language serves as a good training corpus for SMT systems. But for many language pairs and application areas there exist only smaller amounts of translated texts. For example, when we wanted to build an SMT system for the automobile industry, our industry partner had "only" 1 million words of domain-specific translated texts at its disposal. This situation is very common in SMT application scenarios. The general question is then how we can best combine limited domain-specific corpora with, on the one hand, large, freely available parallel corpora and, on the other hand, large domain-specific monolingual corpora in the respective source and target languages. We will address these issues for a textual domain that is at the heart of Swiss identity: alpine texts, i.e. documents on the nature and culture of the Alps, mountaineering and travel reports, ethnology and geology articles and the like. We want to investigate how to best build SMT systems for translating alpine texts between English, German and French. Here is our rationale in brief.1. Most research on SMT focuses on few freely available corpora. In particular there is a lot of research based on the written transcripts of the European Parliament (Europarl) and on the EU legislative texts (Acquis Communautaire). It is known that the translation quality of SMT systems decreases when used outside the textual domain of the training data. SMT systems that are built on EU texts work best on EU texts. Little is known about how to built SMT systems for domains where little or no translated texts are available. This is the focus of the current project.2. We want to work on the textual domain "Alpinism" which includes mountaineering, mountain sports, mountain culture, flora, fauna and geology. These topics are obviously of central interest in Switzerland and in the neighboring alpine countries.3. We want to build SMT systems that translate alpine texts between English, German and French. This means that we focus on the main languages for publication in Switzerland and the alpine region.4. Our starting point for building our baseline systems are the freely available corpora: Europarl (DE-EN-FR), Acquis Communautaire (DE-EN-FR), and the Swiss Federal Laws (DE-FR).5. We have a sizable corpus of alpine texts at our disposal (the yearbooks of the Swiss Alpine Club from 1864 until today). Part of this corpus is parallel DE-FR (about 5 million tokens from the last 50 years), the rest constitutes a comparable corpus, which can be used in combination or as monolingual training material for the creation of target language models. We will add comparable English texts from the British Alpine Club.6. We will use NLP systems (Name Classifiers, PoS-Taggers, Parsers) for English, French and German (which have been developed in our institute) in order to investigate when and how linguistic information improves SMT systems.7. Part of the effort will go into building a domain-specific parallel treebank French-German which will help us to automatically identify specific translation problems. In the past we have built parallel treebanks English-German-Swedish. We will reuse our methodology and our tools (in particular the TreeAligner) for the efficient creation of a parallel treebank of alpine texts with 1000 sentences in the two languages. We propose to build domain-specific DE-EN-FR Statistical Machine Translation systems for alpine texts. We have collected a unique corpus of alpine texts from the Swiss Alpine Club in French and German. Part of the corpus is parallel and will be exploited to train SMT systems, the rest of the corpus will serve as comparable corpus and help us to fill lexical gaps and to tune the system to the domain. We will add English texts from the British Alpine Club part of which consists of translations from or to our French and German texts.
-