Projekt

Zurück zur Übersicht

Domain-specific Statistical Machine Translation

Titel Englisch Domain-specific Statistical Machine Translation
Gesuchsteller/in Volk Martin
Nummer 137766
Förderungsinstrument Projekte
Forschungseinrichtung Institut für Computerlinguistik Universität Zürich
Hochschule Universität Zürich - ZH
Hauptdisziplin Schwerpunkt Germanistik und Anglistik
Beginn/Ende 01.04.2012 - 31.07.2013
Bewilligter Betrag 106'704.00
Alle Daten anzeigen

Alle Disziplinen (2)

Disziplin
Schwerpunkt Germanistik und Anglistik
Informatik

Keywords (4)

Machine Translation; Domain Adaptation; Natural Language Processing; Parallel Corpora

Lay Summary (Englisch)

Lead
Lay summary

The automatic translation from one language to another is an old dream. This dream has triggered research on machine translation (MT) since the start of the computer age. It soon became clear that languages are complex systems that pose hard problems on all levels (words, grammar, meaning) for computer processing. Until the 1980s the dominating paradigm was based on human labor mostly spent on compiling large bilingual dictionaries and large collections of grammar and transfer rules. This approach has led to a number of impressive MT systems, the most famous is arguably Systran. Their output is not a perfect translation, but is useful for draft translations. Specially tailored versions of rule-based MT systems are in every-day use in large international organizations like the European Union. However the development of such translation systems was limited because of the huge manual effort.

This situation has changed dramatically with Statistical MT in the 1990s. The input is large amounts of human translated texts (i.e. parallel texts in source and target language). Based on these parallel texts the computer derives the bilingual dictionary automatically, cuts the parallel sentences into pieces and re-assembles the pieces when translating a new sentence. Within this new paradigm new translation systems can be built within a few weeks when enough high-quality texts are available for the desired language pair. A collection of 10 million words of translated texts is a good starting point. If more text is available, the translation quality will improve. Google Translate with its many language pairs is an example of this new MT paradigm.

MT systems work better when they are tuned for a specific textual domain. For example, if the input comes from technical user manuals, the MT system will work best on such manuals. The goal of our project is to build a Statistical MT system for alpine texts. We focus on reports of mountaineering expeditions. Our input data come from the yearbooks of the Swiss Alpine Club (SAC) which has published translated articles in French - German since 1957. We have digitized these yearbooks (www.textberg.ch) and are the first to investigate this parallel text collection. We have extracted around 4.5 million translated words plus 30 million words of monolingual texts. The challenge lies in combining the translated and untranslated parts in innovative ways in order to create a high-quality translation system.


Direktlink auf Lay Summary Letzte Aktualisierung: 21.02.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Publikationen

Publikation
Using parallel treebanks for machine translation evaluation
Plamada Magdalena, Volk Martin (2012), Using parallel treebanks for machine translation evaluation, in Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, Lisbon, PortugalEdi{�� c}��es Colibri.
A Multi-Domain Translation Model Framework for Statistical Machine Translation
Sennrich Rico, Schwenk Holger, Aransa Walid, A Multi-Domain Translation Model Framework for Statistical Machine Translation, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, BulgariaAssociation for Computational Linguistics.
Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Smith Jason, Saint-Amand Herve, Plamada Magdalena, Koehn Philipp, Callison-Burch Chris, Lopez Adam, Dirt Cheap Web-Scale Parallel Text from the Common Crawl, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria.
Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis
Sennrich Rico, Volk Martin, Schneider Gerold, Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis, in Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria.
Mining for Domain-specific Parallel Text from Wikipedia
Plamada Magdalena, Volk Martin, Mining for Domain-specific Parallel Text from Wikipedia, in Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria.
Promoting Flexible Translations in Statistical Machine Translation
Sennrich Rico, Promoting Flexible Translations in Statistical Machine Translation, in Proceedings of Machine Translation Summit XIV, Nice, France.

Zusammenarbeit

Gruppe / Person Land
Formen der Zusammenarbeit
Eckhard Bick/Southern Denmark University Dänemark (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
Joakim Nivre/Universität Uppsala Schweden (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
Université du Maine Frankreich (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Publikation
- Forschungsinfrastrukturen
- Austausch von Mitarbeitern
Finnova AG Schweiz (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Industrie/Wirtschaft/weitere anwendungs-orientierte Zusammenarbeit

Wissenschaftliche Veranstaltungen

Aktiver Beitrag

Titel Art des Beitrags Titel des Artikels oder Beitrages Datum Ort Beteiligte Personen
51st Annual Meeting of the Association for Computational Linguistics 04.07.2013 Sofia, Bulgaria
11th International Workshop on Treebanks and Linguistic Theories 30.11.2012 Lisbon, Portugal
Seventh MT Marathon 2012 03.09.2012 Edinburgh, UK


Selber organisiert

Titel Datum Ort

Kommunikation mit der Öffentlichkeit

Kommunikation Titel Medien Ort Jahr
Medienarbeit: Printmedien, Online-Medien Von Nogo, Tricks und guten Übersetzungen Journal. Die Zeitung der Universität Zürich Deutschschweiz 03.05.2013

Auszeichnungen

Titel Jahr
Mercator Award der Universität Zürich 2013

Verbundene Projekte

Nummer Titel Start Förderungsinstrument
132219 Exploiting Parallel Treebanks for Hybrid Machine Translation 01.01.2011 Projektförderung (Abt. I-III)
126999 Domain-specific Statistical Machine Translation 01.01.2010 Projekte

Abstract

Statistical Machine Translation (SMT) systems can be built quickly when sufficient training material is available. But for many language pairs and application areas there exist only small amounts of high quality human-translated texts. The general question in this project is how we can best adapt an SMT system to a particular application domain when only a limited amount of in-domain human-translated texts are available as training data.In the first months of the project we have prepared a domain-specific parallel corpus of Alpine texts in German and French as the result of a digitization project on the yearbooks of the Swiss Alpine Club. Since 1957 the SAC has published its yearbooks in parallel French and German versions. We have turned this material into a unique corpus of more than 4 million words in both languages.We have run an extensive series of SMT experiments on this corpus. They have confirmed our expectation that even a limited domain-specific corpus results in much better in-domain translation quality than an order-of-magnitude larger out-of-domain corpus or even a general MT system like Google Translate.In this proposal for extension of the project we first survey the recent developments in SMT in general and in domain adaption for SMT in particular. We then describe our accomplishments in the first year of the project and re-state our tasks for the second year. We have followed the project schedule closely and our publications testify the recognition of our work.We apply for a project extension of two years. In this period we will continue our work in a number of directions. We realized that the exploitation of comparable domain-specific corpora for the enlargement of training corpora is important for improving the SMT quality. Recent work on extracting parallel sentences from Wikipedia pages confirms this point. This will be our first line of activities.Second, we will contrast and compare our work with competing approaches. This means that we need to run controlled experiments on the methods proposed by other researchers, for instance on domain adaption through caching the recent history, through corpus identifiers in factored SMT, and through creating synthetic corpora by automatically translating in-domain monolingual corpora.Third, we observe a tendency to include linguistic information in SMT systems and to combine rule-based and statistical MT in hybrid systems. We will include linguistic information ranging from named entities over specific word reordering to dependency parsing. Our German dependency parser is partly rule-based and will thus be able to account for properties of the source language. We propose to combine these hybrid systems with domain adaptation methods. Finally, we plan to enrich our parallel French-German treebank, which we have built in the first year of this project with a shallow semantic layer of geographical tags. The parallel treebank and the annotated parallel corpus will be showcases of our work through a combined web-based Alignment Search System and machine translation system for Alpine texts.
-