Machine Translation; Corpus Annotation; Named Entity Classification; Parsing; Investigation of Translated Documents; Alpine Texts
(2012), Extrinsic Evaluation of Sentence Alignment Systems, in Proceedings of CREDISLAS 2012, LREC
(2012), Mixture-Modeling with Unsupervised Clusters for Domain Adaptation in Statistical Machine Translation, in EAMT-2012: the 16th Annual Conference of the European Association for Machine Translation
(2012), Towards a Wikipedia-extracted Alpine Corpus, in Proceedings of BUCC 2012
(2012), Perplexity minimization for translation model domain adaptation in statistical machine translation, in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Ling
(2011), Digging for names in the mountains: Combined person name recognition and reference resolution for German alpine texts, in 5th Language & Technology Conference
(2011), From historic books to annotated XML: Building a large multilingual diachronic corpus, in Conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011
, Hamburg(96), (96).
(2011), The UZH system combination system for WMT 2011, in Proceedings of the Sixth Workshop on Statistical Machine Translation
(2011), Combining multi-engine machine translation and online learning through dynamic phrase tables, in EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation
(2011), Iterative, MT-based sentence alignment of parallel texts, in NODALIDA 2011, Nordic Conference of Computational Linguistics
(2011), Strategies for reducing and correcting OCR error, 3-22.
Statistical Machine Translation (SMT) systems can be built quickly when sufficient training material is available. Our experiments indicate that a parallel corpus of 10 million words per language serves as a good training corpus for SMT systems. But for many language pairs and application areas there exist only smaller amounts of translated texts. For example, when we wanted to build an SMT system for the automobile industry, our industry partner had "only" 1 million words of domain-specific translated texts at its disposal. This situation is very common in SMT application scenarios. The general question is then how we can best combine limited domain-specific corpora with, on the one hand, large, freely available parallel corpora and, on the other hand, large domain-specific monolingual corpora in the respective source and target languages. We will address these issues for a textual domain that is at the heart of Swiss identity: alpine texts, i.e. documents on the nature and culture of the Alps, mountaineering and travel reports, ethnology and geology articles and the like. We want to investigate how to best build SMT systems for translating alpine texts between English, German and French. Here is our rationale in brief.1. Most research on SMT focuses on few freely available corpora. In particular there is a lot of research based on the written transcripts of the European Parliament (Europarl) and on the EU legislative texts (Acquis Communautaire). It is known that the translation quality of SMT systems decreases when used outside the textual domain of the training data. SMT systems that are built on EU texts work best on EU texts. Little is known about how to built SMT systems for domains where little or no translated texts are available. This is the focus of the current project.2. We want to work on the textual domain "Alpinism" which includes mountaineering, mountain sports, mountain culture, flora, fauna and geology. These topics are obviously of central interest in Switzerland and in the neighboring alpine countries.3. We want to build SMT systems that translate alpine texts between English, German and French. This means that we focus on the main languages for publication in Switzerland and the alpine region.4. Our starting point for building our baseline systems are the freely available corpora: Europarl (DE-EN-FR), Acquis Communautaire (DE-EN-FR), and the Swiss Federal Laws (DE-FR).5. We have a sizable corpus of alpine texts at our disposal (the yearbooks of the Swiss Alpine Club from 1864 until today). Part of this corpus is parallel DE-FR (about 5 million tokens from the last 50 years), the rest constitutes a comparable corpus, which can be used in combination or as monolingual training material for the creation of target language models. We will add comparable English texts from the British Alpine Club.6. We will use NLP systems (Name Classifiers, PoS-Taggers, Parsers) for English, French and German (which have been developed in our institute) in order to investigate when and how linguistic information improves SMT systems.7. Part of the effort will go into building a domain-specific parallel treebank French-German which will help us to automatically identify specific translation problems. In the past we have built parallel treebanks English-German-Swedish. We will reuse our methodology and our tools (in particular the TreeAligner) for the efficient creation of a parallel treebank of alpine texts with 1000 sentences in the two languages. We propose to build domain-specific DE-EN-FR Statistical Machine Translation systems for alpine texts. We have collected a unique corpus of alpine texts from the Swiss Alpine Club in French and German. Part of the corpus is parallel and will be exploited to train SMT systems, the rest of the corpus will serve as comparable corpus and help us to fill lexical gaps and to tune the system to the domain. We will add English texts from the British Alpine Club part of which consists of translations from or to our French and German texts.