Projekt

Zurück zur Übersicht

Improving the coherence of machine translation output by modeling intersentential relations

Titel Englisch Improving the coherence of machine translation output by modeling intersentential relations
Gesuchsteller/in Popescu-Belis Andrei
Nummer 127510
Förderungsinstrument Sinergia
Forschungseinrichtung IDIAP Institut de Recherche
Hochschule Idiap Research Institute - IDIAP
Hauptdisziplin Informatik
Beginn/Ende 01.03.2010 - 31.07.2013
Bewilligter Betrag 869'019.00
Alle Daten anzeigen

Alle Disziplinen (2)

Disziplin
Informatik
Weitere Sprachen

Keywords (8)

machine translation; statistical language processing; textual cohesion; synchronous parsing; intersentential relations; discourse; corpus linguistics; statistical parsing

Lay Summary (Englisch)

Lead
Lay summary
The goal of COMTIS is to extend the current statistical machine translation paradigm by modeling intersentential relations. COMTIS involves researchers in human language technology, machine learning, linguistics, and system evaluation, coming from three different groups with extensive contributions to the relevant fields.Machine translation (MT) has made significant progress in the past decade, but its focus has remained on the translation of sentences considered individually. However, in order to ensure overall coherence throughout a translated text, an MT system must also consider and render correctly the items that depend on intersentential relations. The perceived coherence of a translated text, and therefore its overall quality, are mainly influenced by the following markers: pronouns, verb tense/mode/aspect, discourse connectives, and politeness/style/register. None of these markers can be reliably translated on a pure sentence-by-sentence basis.In COMTIS, linguistic theory and corpus studies will provide the ground for a detailed study of a number of cohesion markers. Methods from corpus linguistics will be used to assess which cohesion markers have the most impact on the perceived coherence of a translated text. This will provide information about their most suitable representation, the most robust features for automatic identification, as well as their translation (English/French). Monolingual and parallel corpora will be prepared, to be used as training data or as test suites.Automatic labeling modules will identify intersentential relations, using surface features and labels inspired from the linguistic studies of cohesion markers, as well as features obtained from joint syntactic parsing and semantic role analysis. New SMT models will be developed and trained over parallel corpora enriched with the labels defined above, based on state-of-the-art phrase-based SMT models extended to exploit intersentential relations. Metrics that assess the improvement in the coherence of MT output will be designed in a principled way. The performance of past systems and of those resulting from COMTIS will be assessed using the new metrics and current sentence-specific ones.
Direktlink auf Lay Summary Letzte Aktualisierung: 21.02.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Publikationen

Publikation
Proceedings of the ACL Workshop on Discourse in Machine Translation (DiscoMT 2013)
Bonnie Webber, Katja Markert, Joerg Tiedemann, Andrei Popescu-Belis (ed.) (2013), Proceedings of the ACL Workshop on Discourse in Machine Translation (DiscoMT 2013), Association for Computational Linguistics, Sofia, Bulgaria.
Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique
Cartoni B., Zufferey S., Meyer T. (2013), Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique, in Dialogue & Discourse, (Beyond sem), 65-86.
"Jusqu'où les temps verbaux sont-ils procéduraux?"
Grisot C., Moeschler J., Cartoni B. (2012), "Jusqu'où les temps verbaux sont-ils procéduraux?", in Nouveaux cahiers de linguistique française., 232-250.
"Une description bilingue des temps verbaux: étude contrastive en corpus"
Grisot C., Cartoni B. (2012), "Une description bilingue des temps verbaux: étude contrastive en corpus", in Nouveaux cahiers de linguistique française., 119-139.
Using Sense-labeled Discourse Connectives for Statistical Machine Translation
Meyer Thomas, Popescu-Belis Andrei (2012), Using Sense-labeled Discourse Connectives for Statistical Machine Translation, in EACL 2012 Workshop on Hybrid Approaches to Machine Translation (HyTra), , Avignon, FranceProceedings of the EACL-2012 Workshop on Hybrid Approaches to Machine Translation (HyTra), Avignon, FR.
Negation and lexical morphology across languages: insights from a trilingual translation corpus
Cartoni Bruno, Lefer Marie-Aude (2012), Negation and lexical morphology across languages: insights from a trilingual translation corpus, in Poznan Studies in Contemporary Linguistics, Special Issue on English Word-Formation in Contras, 795-843.
Improving MT coherence through text-level processing of input texts: the COMTIS project
Cartoni Bruno, Gesmundo Andrea, Henderson James, Grisot Cristina, Merlo Paola, Meyer Thomas, Moeschler Jacques, Zufferey Sandrine, Popescu-Belis Andrei (2011), Improving MT coherence through text-level processing of input texts: the COMTIS project, in Proceedings of Tralogy, Session 6, Traduction et traitement automatique des langues (TAL), online, Proceedings of the Tralogy conference 2011, Paris, FR.
A contrastive analysis of English and French causal connectives
Zufferey Sandrine, Cartoni Bruno, Meyer Thomas (2011), A contrastive analysis of English and French causal connectives, in LPTS 2011 (2nd Int. Conf. on Linguistic and Psycholinguistic Approaches to Text Structuring), Proceedings of the 2nd Int. Conf. on Linguistic and Psycholinguistic Approaches to Text Structuring, Louvain-la-Neuve, BE.
"Car, Parce Que, Puisque" Revisited: Three Empirical Studies on French Causal Connectives
Zufferey Sandrine (2011), "Car, Parce Que, Puisque" Revisited: Three Empirical Studies on French Causal Connectives, in Journal of Pragmatics, 44(2), 138-153.
A Corpus-based Contrastive Analysis for Defining Minimal Semantics of Inter-sentential Dependencies for Machine Translation
Meyer Thomas, Popescu-Belis Andrei, Liyanapathirana Jeevanthi, Cartoni Bruno (2011), A Corpus-based Contrastive Analysis for Defining Minimal Semantics of Inter-sentential Dependencies for Machine Translation, in GSCL-2011 Workshop "Contrastive Linguistics - Translation Studies - Machine Translation - What can w, GSCL-2011 Workshop "Contrastive Linguistics - Translation Studies - Machine Translation", Hamburg, DE.
Heuristic Search for Non-Bottom-Up Tree Structure Prediction
Gesmundo Andrea, Henderson James (2011), Heuristic Search for Non-Bottom-Up Tree Structure Prediction, in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP), Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)), Edinburgh, UK.
Building 'directional corpora' for unbiased contrastive analysis
Cartoni Bruno, Meyer Thomas (2011), Building 'directional corpora' for unbiased contrastive analysis, in Corpus Linguistics 2011, Proceedings of the Corpus Linguistics conference 2011, Birmingham, UK.
Disambiguating discourse connectives using parallel corpora: senses vs. translations
Meyer Thomas, Roze Charlotte, Cartoni Bruno, Danlos Laurence, Zufferey Sandrine, Popescu-Belis Andrei (2011), Disambiguating discourse connectives using parallel corpora: senses vs. translations, in Corpus Linguistics 2011, Proceedings of the Corpus Linguistics conference 2011, Birmingham, UK.
How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives
Cartoni Bruno, Zufferey Sandrine, Meyer Thomas, Popescu-Belis Andrei (2011), How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives, in 4th Workshop on Building and Using Comparable Corpora, Proceedings of 4th Workshop on Building and Using Comparable Corpora, Portland, OR.
Disambiguating Temporal-Contrastive Discourse Connectives for Machine Translation
Meyer Thomas (2011), Disambiguating Temporal-Contrastive Discourse Connectives for Machine Translation, in Proceedings of ACL-HLT 2011 Student Session, Proceedings of ACL-HLT 2011 Student Session, Portland, OR.
Multilingual Annotation and Disambiguation of Discourse Connectives for Machine Translation
Meyer Thomas, Popescu-Belis Andrei, Zufferey Sandrine, Cartoni Bruno (2011), Multilingual Annotation and Disambiguation of Discourse Connectives for Machine Translation, in 12th SIGdial Meeting on Discourse and Dialogue, Proceedings of 12th SIGdial Meeting on Discourse and Dialogue, Portland, OR.
A Corpus-Based Multilingual Model of Semantic-Pragmatic Description of Verb Tenses for the Improvement of MT
Grisot Cristina (2011), A Corpus-Based Multilingual Model of Semantic-Pragmatic Description of Verb Tenses for the Improvement of MT, in Chronos 10 (10th International Conference on Tense, Aspect, Modality and Evidentiality), Proceedings of the 10th Int. Conf. on Tense, Aspect, Modality and Evidentiality (Chronos 10), Birmingham, UK.
How much are verbal tenses conceptual or procedural?
Moeschler Jacques, Grisot Cristina, Cartoni Bruno (2011), How much are verbal tenses conceptual or procedural?, in Chronos 10 (10th International Conference on Tense, Aspect, Modality and Evidentiality), Proceedings of the 10th Int. Conf. on Tense, Aspect, Modality and Evidentiality (Chronos 10), Birmingham, UK.
Faster Cube Pruning
Gesmundo Andrea, Henderson James (2010), Faster Cube Pruning, in Proceedings of IWSLT 2010 (7th International Workshop on Spoken Language Translation), Proceedings of the 7th International Workshop on Spoken Language Translation (IWSLT)), Paris, FR.
Detecting Narrativity to Improve English to French Translation of Simple Past Verbs
Meyer T., Grisot C., Popescu-Belis A., Detecting Narrativity to Improve English to French Translation of Simple Past Verbs, in Proceedings of the 1st DiscoMT Workshop at ACL 2013 (51th Annual Meeting of the Association for Comp.
Heuristic Cube Pruning in Linear Time
Gesmundo A., Satta G., Henderson J., Heuristic Cube Pruning in Linear Time, in Proceedings of ACL 2012.
Lemmatising Serbian as Category Tagging with Bidirectional Sequence Classification
Gesmundo A., Samardzic T., Lemmatising Serbian as Category Tagging with Bidirectional Sequence Classification, in Proceedings of LREC 2012.
A multifactorial analysis of explicitation in translation
Zufferey S., Cartoni B, A multifactorial analysis of explicitation in translation, in Target.
Annotating the meaning of discourse connectives in multilingual corpora
Zufferey S., Degand L., Annotating the meaning of discourse connectives in multilingual corpora, in Corpus Linguistics and Linguistic Theory.
Are ACT’s score increasing with better translation quality?
Hajlaoui N., Are ACT’s score increasing with better translation quality?, in Proceedings of the 8th Workshop on Statistical Machine Translation (WMT) at ACL 2013.
Assessing the Accuracy of Discourse Connective Translations: Validation of an Automatic Metric.
Hajlaoui N., Popescu-Belis A., Assessing the Accuracy of Discourse Connective Translations: Validation of an Automatic Metric., in Proceedings of the 1st DiscoMT Workshop at ACL 2013 (51th Annual Meeting of the Association for Com.
Cross-linguistic variation in verb tenses: conceptual and procedural information
Grisot C., Cross-linguistic variation in verb tenses: conceptual and procedural information, in 19th International Congress of Linguistics.
Disambiguation of Tenses for Machine Translation: A referential and feature-based approach on the conceptual/procedural distinction
Grisot C., Moeschler J., Cartoni B., Disambiguation of Tenses for Machine Translation: A referential and feature-based approach on the conceptual/procedural distinction, in EPICS V Pragmatics Symposium.
Discourse-level Annotation over Europarl for Machine Translation: Connectives and Pronouns
Popescu-Belis Andrei, Meyer Thomas, Liyanapathirana Jeevanthi, Cartoni Bruno, Zufferey Sandrine, Discourse-level Annotation over Europarl for Machine Translation: Connectives and Pronouns, in LREC 2012, Proceedings of the eighth international conference on Language Resources and Evaluation (LREC), Instanbul, TR.
Empirical validations of multilingual annotation schemes for discourse relations
Zufferey S., Degand L., Popescu-Belis A., Sanders T., Empirical validations of multilingual annotation schemes for discourse relations, in Proceedings of ISA-8 (8th Workshop on Interoperable Semantic Annotation).
English and French causal connectives in contrast
Zufferey S., Cartoni B., English and French causal connectives in contrast, in Languages in Contrast.
Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies
Cartoni Bruno, Meyer Thomas, Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies, in LREC 2012, Proceedings of the eighth international conference on Language Resources and Evaluation (LREC), Instanbul, TR.
HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce
Gesmundo A., Tomeh N., HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce, in Proceedings of EACL 2012.
Implicitation of Discourse Connectives in (Machine) Translation
Meyer T., Webber B., Implicitation of Discourse Connectives in (Machine) Translation, in Proceedings of the 1st DiscoMT Workshop at ACL 2013 (51th Annual Meeting of the Association for Comp.
Lemmatisation as a Tagging Task
Gesmundo A., Samardzic T., Lemmatisation as a Tagging Task, in Proceedings of ACL 2012.
Machine Translation of Labeled Discourse Connectives
Meyer T., Popescu-Belis A., Hajlaoui N., Gesmundo A., Machine Translation of Labeled Discourse Connectives, in Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Ameri.
Machine Translation with Many Manually Labeled Discourse Connectives
Meyer T., Polakova L., Machine Translation with Many Manually Labeled Discourse Connectives, in Proceedings of the 1st DiscoMT Workshop at ACL 2013 (51th Annual Meeting of the Association for Comp.
The ALLEGRA corpus: a trilingual resource for Romansh, an under-represented language of Switzerland
Scherrer Yves, Cartoni Bruno, The ALLEGRA corpus: a trilingual resource for Romansh, an under-represented language of Switzerland, in Proceedings of LREC 2012, Proceedings of the eighth international conference on Language Resources and Evaluation (LREC), Instanbul, TR.
Translating English Discourse Connectives into Arabic: a Corpus-based Analysis and an Evaluation Metric
Hajlaoui N., Popescu-Belis A., Translating English Discourse Connectives into Arabic: a Corpus-based Analysis and an Evaluation Metric, in Proceedings of the CAASL4 Workshop at AMTA 2012 (Fourth Workshop on Computational Approaches to Arab.
Translating English Discourse Connectives into Arabic: a Corpus-based Analysis and an Evaluation Metric.
Hajlaoui N., Popescu-Belis A., Translating English Discourse Connectives into Arabic: a Corpus-based Analysis and an Evaluation Metric., in Proceedings of the CAASL4 Workshop at AMTA 2012 (Fourth Workshop on Computational Approaches to Arab.
Using the Europarl corpus for linguistic research
Cartoni B., Zufferey S., Meyer T., Using the Europarl corpus for linguistic research, in Belgian Journal of Linguistics.

Zusammenarbeit

Gruppe / Person Land
Formen der Zusammenarbeit
University of Edinburgh Grossbritannien und Nordirland (Europa)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Publikation
- Austausch von Mitarbeitern

Wissenschaftliche Veranstaltungen

Aktiver Beitrag

Titel Art des Beitrags Titel des Artikels oder Beitrages Datum Ort Beteiligte Personen
EPICS V Pragmatics Symposium 15.03.2012 Sevilla


Selber organisiert

Titel Datum Ort

Auszeichnungen

Titel Jahr
SNSF Ambizione grant for the MULDIS project 2013
Marie Curie IEF grant for the DISCOM project 2011

Verbundene Projekte

Nummer Titel Start Förderungsinstrument
113382 Pragmatique lexicale et non-lexicale de la causalité en français: aspects descriptifs, théoriques et expérimentaux 01.04.2007 Projektförderung (Abt. I-III)
148024 MULDIS - Discourse Connectives: From Multiple Languages to Multilingual Minds 01.04.2014 Ambizione
138140 Inducing Semantic Representations from Multiple Data Sources 01.10.2011 Projektförderung (Abt. I-III)
122643 Corpus-based Explorations of Cross-linguistic Syntactic and Semantic Role Parallelism 01.12.2008 Projektförderung (Abt. I-III)
141903 SIWIS: Spoken Interaction with Interpretation in Switzerland 01.12.2012 Sinergia
103318 Quality models and resources for the evaluation of machine translation 01.10.2004 Projektförderung (Abt. I-III)
137380 Towards a multilingual database of connectives 01.08.2011 International Exploratory Workshops
147653 MODERN: Modeling discourse entities and relations for coherent machine translation 01.08.2013 Sinergia

Abstract

Machine translation (MT) has made significant progress in the past decade, but its focus has remained on the translation of sentences considered individually. However, in order to ensure overall coherence throughout a translated text, an MT system must also consider and render correctly the items that depend on intersentential relations. The perceived coherence of a translated text, and therefore its overall quality, are mainly influenced by the following markers: pronouns, verb tense/mode/aspect, discourse connectives, and politeness/style/register. None of these markers can be reliably translated on a pure sentence-by-sentence basis. This project aims at extending the current statistical MT (SMT) approach by modeling these intersentential dependencies (ISDs), along the following five themes.Theme 1: Linguistic analysis. Linguistic theory and empirical studies based on corpora will provide the ground for a detailed study of the above-mentioned linguistic items. The study will provide clues about the most adapted representations for these dependencies and the most robust features for their automatic identification, as well as their translation, focusing mainly on the English/French pair. Methods from corpus linguistics will be used to assess which dependencies have the most impact on the perceived coherence of a translated text.Theme 2: Corpus data, annotation and test suites. Monolingual and parallel corpora to support the empirical work in Theme 1 will be prepared by semi-automatically extracting and annotating examples containing the cohesion markers under study. Annotated corpora will be used as training data (in Themes 3 and 4) and as test suites (in Theme 5) to evaluate the systems resulting from the project.Theme 3: Automatic identification of intersentential dependencies. In order to enrich the current SMT approach with information about ISDs, embodied in cohesion markers, we will implement automatic labeling modules that disambiguate the items studied under Theme 1, using features and labels inspired from the linguistic studies. These modules will use mostly surface features, as well as features obtained from joint syntactic parsing and semantic role analysis, to insert labels into the text to be translated, thus disambiguating certain lexical items in preparation for an SMT engine (Theme 4). The intersentential processing methods will be evaluated here in terms of their intrinsic performance.Theme 4: Statistical machine translation for ISD-labeled texts. Research under this theme will develop new SMT models that are trained over parallel corpora enriched with the labels defined in Theme 1 and produced in Themes 2 and 3. The first objective of this theme will be the empirical evaluation of the usefulness of ISD labeling in machine translation. For this objective, we will extend state-of-the-art phrase-based SMT models to exploit ISD annotations. The second objective is to develop novel SMT methods which better incorporate linguistic generalizations about ISDs, using synchronous parsing techniques.Theme 5: Evaluation methods for MT coherence and their application. The main objective under this theme is to design and apply metrics that assess the improvement in the coherence of MT output, in a principled way, focusing on intrinsic quality and on usefulness in specific contexts. The performance of past systems will be assessed both using the new metrics and current sentence-specific ones. Test suites will also be used to automate evaluation.The proposed project involves researchers in human language technology, machine learning, linguistics, and system evaluation, coming from three different groups with extensive contributions to the relevant fields. Their collaboration is grounded in several previous joint achievements, and will lead to the design of a robust, operational system. The project will thus significantly boost the dynamics of Swiss research in MT and will contribute to position it more firmly within the European and international community.
-