Data and Documentation
Open Data Policy
FAQ
EN
DE
FR
Suchbegriff
Advanced search
Publication
Back to overview
Cleaning the Europarl Corpus for Linguistic Applications
Type of publication
Peer-reviewed
Publikationsform
Proceedings (peer-reviewed)
Author
Graën Johannes, Volk Martin, Batnic Dolores,
Project
Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation
Show all
Proceedings (peer-reviewed)
Title of proceedings
Proceedings of KONVENS
Place
Hildesheim
Open Access
URL
http://www.zora.uzh.ch/99005/
Type of Open Access
Repository (Green Open Access)
Abstract
We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular polit- ical group or in particular language combinations.
-