Back to overview

Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database

Type of publication Peer-reviewed
Publikationsform Proceedings (peer-reviewed)
Publication date 2016
Author Graën Johannes, Clematide Simon, Volk Martin,
Project Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation
Show all

Proceedings (peer-reviewed)

Title of proceedings Proceedings of the 4th Workshop on the Challenges in the Management of Large Corpora
Place Portoroz

Open Access

Type of Open Access Repository (Green Open Access)


We present an approach for searching and exploring translation variants of multi-word units in large multiparallel corpora based on a relational database management system. Our web-based application Multilingwis, which allows for multilingual lookups of phrases and words in English, French, German, Italian and Spanish, is of interest to anybody who wants to quickly compare expressions across several languages, such as language learners without linguistic knowledge. In this paper, we focus on the technical aspects of how to represent and efficiently retrieve all occurrences that match the user’s query in one of five languages simultaneously with their translations into the other four languages. In order to identify such translations in our corpus of 220 million tokens in total, we use statistical sentence and word alignment. By using materialized views, composite indexes, and pre-planned search functions, our relational database management system handles large result sets with only moderate requirements to the underlying hardware. As our systematic evaluation on 200 search terms per language shows, we can achieve retrieval times below 1 second in 75 % of the cases for multi-word expressions.