Lead


Lay summary
The automatic translation from one language to another is an old dream. This dream has triggered research on machine translation (MT) since the start of the computer age. It soon became clear that languages are complex systems that pose hard problems on all levels (words, grammar, meaning) for computer processing. Until the 1980s the dominating paradigm was based on human labor mostly spent on compiling large bilingual dictionaries and large collections of grammar and transfer rules. This approach has led to a number of impressive MT systems, the most famous is arguably Systran. Their output is not a perfect translation, but is useful for draft translations. Specially tailored versions of rule-based MT systems are in every-day use in large international organizations like the European Union.However the development of such translation systems was limited because of the huge manual effort. This situation has changed dramatically with Statistical MT in the 1990s. The input is large amounts of human translated texts (i.e. parallel texts in source and target language). Based on these parallel texts the computer derives the bilingual dictionary automatically, cuts the parallel sentences into pieces and re-assembles the pieces when translating a new sentence.Within this new paradigm new translation systems can be built within a few weeks when enough high-quality texts are available for the desired language pair. Our experiments indicate that a collection of 10 million words of translated texts is a good starting point. If more text is available, the translation quality will improve. Google Translate with its many language pairs is an example of this new MT paradigm.MT systems work better when they are tuned for specific textual domains. For example, if the input comes from technical user manuals, the system will work best on such manuals. The goal of our project is to build a Statistical MT system for alpine texts. We will first focus on reports of mountaineering expeditions. Our input data come from the yearbooks of the Swiss Alpine Club (SAC) who has published translated articles in French - German since 1957. We are currently digitizing these yearbooks (www.textberg.ch) and will be the first to investigate this parallel text collection. We estimate that we will be able to extract around 5 million translated words plus 30 million words of monolingual texts. The challenge thus lies in combining the translated and untranslated parts in innovative ways in order to create a high-quality translation system.