Data and Documentation
Open Data Policy
FAQ
EN
DE
FR
Suchbegriff
Advanced search
Publication
Back to overview
Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora
Type of publication
Peer-reviewed
Publikationsform
Proceedings (peer-reviewed)
Publication date
2015
Author
Graën Johannes, Clematide Simon,
Project
Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation
Show all
Proceedings (peer-reviewed)
Title of proceedings
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-3)
Place
Lancaster
Open Access
URL
http://www.zora.uzh.ch/111877/
Type of Open Access
Repository (Green Open Access)
Abstract
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly inter-connected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.
-