Machine Translation; Domain Adaptation; Natural Language Processing; Parallel Corpora
(2012), Using parallel treebanks for machine translation evaluation, in Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories
, Lisbon, Portugal.
, A Multi-Domain Translation Model Framework for Statistical Machine Translation, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
, Sofia, Bulgaria.
, Dirt Cheap Web-Scale Parallel Text from the Common Crawl, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
, Sofia, Bulgaria.
, Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis, in Proceedings of the International Conference Recent Advances in Natural Language Processing
, Hissar, Bulgaria.
, Mining for Domain-specific Parallel Text from Wikipedia, in Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
, Sofia, Bulgaria.
, Promoting Flexible Translations in Statistical Machine Translation, in Proceedings of Machine Translation Summit XIV
, Nice, France.
Statistical Machine Translation (SMT) systems can be built quickly when sufficient training material is available. But for many language pairs and application areas there exist only small amounts of high quality human-translated texts. The general question in this project is how we can best adapt an SMT system to a particular application domain when only a limited amount of in-domain human-translated texts are available as training data.In the first months of the project we have prepared a domain-specific parallel corpus of Alpine texts in German and French as the result of a digitization project on the yearbooks of the Swiss Alpine Club. Since 1957 the SAC has published its yearbooks in parallel French and German versions. We have turned this material into a unique corpus of more than 4 million words in both languages.We have run an extensive series of SMT experiments on this corpus. They have confirmed our expectation that even a limited domain-specific corpus results in much better in-domain translation quality than an order-of-magnitude larger out-of-domain corpus or even a general MT system like Google Translate.In this proposal for extension of the project we first survey the recent developments in SMT in general and in domain adaption for SMT in particular. We then describe our accomplishments in the first year of the project and re-state our tasks for the second year. We have followed the project schedule closely and our publications testify the recognition of our work.We apply for a project extension of two years. In this period we will continue our work in a number of directions. We realized that the exploitation of comparable domain-specific corpora for the enlargement of training corpora is important for improving the SMT quality. Recent work on extracting parallel sentences from Wikipedia pages confirms this point. This will be our first line of activities.Second, we will contrast and compare our work with competing approaches. This means that we need to run controlled experiments on the methods proposed by other researchers, for instance on domain adaption through caching the recent history, through corpus identifiers in factored SMT, and through creating synthetic corpora by automatically translating in-domain monolingual corpora.Third, we observe a tendency to include linguistic information in SMT systems and to combine rule-based and statistical MT in hybrid systems. We will include linguistic information ranging from named entities over specific word reordering to dependency parsing. Our German dependency parser is partly rule-based and will thus be able to account for properties of the source language. We propose to combine these hybrid systems with domain adaptation methods. Finally, we plan to enrich our parallel French-German treebank, which we have built in the first year of this project with a shallow semantic layer of geographical tags. The parallel treebank and the annotated parallel corpus will be showcases of our work through a combined web-based Alignment Search System and machine translation system for Alpine texts.