Project

Back to overview

On-demand Knowledge for Document-level Machine Translation (DOMAT)

Applicant Popescu-Belis Andrei
Number 175693
Funding scheme Project funding (Div. I-III)
Research institution IICT informatique et télécommunication HEIG-VD
Institution of higher education University of Applied Sciences and Arts Western Switzerland - HES-SO
Main discipline Information Technology
Start/End 01.09.2018 - 31.08.2022
Approved amount 552'127.00
Show all

All Disciplines (2)

Discipline
Information Technology
Applied linguistics

Keywords (6)

neural networks; coreference resolution; machine translation; discourse features; deep learning; text analysis

Lay Summary (French)

Lead
Les systèmes de traduction automatique ont atteint des niveaux de qualité appréciables, grâce aux modèles statistiques à n-grammes de mots, puis aux réseaux de neurones profonds avec de l’apprentissage automatique sur de très grandes quantités de données (textes déjà traduits). Toutefois, la qualité des traductions automatiques de textes complets reste en deçà des humains, notamment à cause des stratégies d’analyse locale utilisées. Intégrer des modules hétérogènes représente un défi considérable pour les approches actuelles, qui modélisent les textes de manière uniforme.
Lay summary

Ce projet vise à intégrer des algorithmes qui considèrent des dépendances à portée variable. Nous étudierons une approche permettant d’appliquer à la demande, lors du processus de traduction automatique, plusieurs types de contraintes à divers niveaux.  Nous visons ainsi à intégrer des résultats de nos précédents travaux ayant abouti à des modules de traitement spécifiques à certains types de relations entre phrases, et qui améliorent la traduction des connecteurs discursifs, des pronoms, des temps verbaux, ainsi que et la cohérence lexicale.  Nous développerons une stratégie pour invoquer ces modules lorsque la traduction du noyau statistique semblera sujette à caution selon des métriques estimant la qualité.

Les solutions développées offriront une approche formalisée pour corriger des erreurs dues à une analyse insuffisante des dépendances éloignées entre phrases.  Nos solutions montreront comment des phénomènes linguistiques complexes peuvent être appris par des systèmes, et utilisés simultanément, sans effets de bord, pour améliorer la qualité d’un texte.  Ces solutions pourront répondre aux besoins croissants de l’industrie de la traduction et de la localisation.

Direct link to Lay Summary Last update: 18.05.2018

Responsible applicant and co-applicants

Employees

Associated projects

Number Title Start Funding scheme
127510 Improving the coherence of machine translation output by modeling intersentential relations 01.03.2010 Sinergia
147653 MODERN: Modeling discourse entities and relations for coherent machine translation 01.08.2013 Sinergia

Abstract

Statistical and neural machine translation systems (in short, SMT and NMT) have reached significant quality and speed levels. Such systems use large amounts of monolingual and bilingual data to train their models and tune their meta-parameters. As a result, translating a sentence from a language unknown to a user can be done with acceptable quality, and hence a clearly perceived utility. However, the translation of complete texts is still far from publishable, and requires substantial post-editing by humans.One reason for this difference is that certain linguistic constraints cannot be reliably translated using only local information, especially when they apply across different sentences. The homogeneous models used by SMT or NMT systems - very large translation tables or connection weights - are a strength for robust and quick sentence translation, but impose a strong limitation when constraints of different ranges must be taken into account to translate a document.In recent years, I have pioneered a method to address document-level problems that degrade MT quality, drawing on specific linguistic knowledge as required by each problem. These methods improved the translation of discourse connectives or verb tenses based on sentence-level or document-level semantic features, or constrained the choice of referring expressions such as pronouns and noun phrases. For implementation, the methods took advantage of existing approaches, such as factored models, to integrate linguistic knowledge with SMT. However, integrating several solutions for dealing with document-level constraints into a unified system is not tractable with current approaches, due to the fact that knowledge sources are heterogeneous and are not tightly coupled with the MT systems. Moreover, the quality improvements brought by leveraging several distinct knowledge sources may not add up because of interaction between them. Finally, the need for computing all features for all words or sentences of a document raises strong efficiency issues.In the DOMAT project, we aim to design a novel approach for providing on-demand linguistic knowledge to statistical or neural MT systems. Both types of systems will be considered to provide comparison terms, as they both have strengths and weaknesses. The linguistic knowledge will be learned by specific processing modules, which will extract and output features in a format that is usable by SMT and NMT systems. To make this architecture operational, we will explore strategies to trigger the modules, for instance based on quality estimation or translation confidence. To populate the architecture, we will build several modules to extract document-level features that are relevant to translation, principally document structure (discourse relations) and coreference (including pronominal anaphora). The starting points for these modules will be our previous achievements in document-level SMT.The DOMAT project will mainly support two PhD theses at Idiap/EPFL, one on designing and comparing statistical and neural architectures for integrating and triggering on-demand knowledge sources in MT, and the other one designing such knowledge sources, which learn specific text-level constraints and output suitable data structures for NMT.The solutions developed in DOMAT will make tractable the demands for adequate, fluent and efficient translation of large documents, and will result in a principled approach for learning high-level linguistic knowledge to improve translation quality.
-