From parallel corpora to multilingual exercises - Making use of large text collections and crowdsourcing techniques for innovative autonomous language learning applications

Applicant Graën Johannes
Number 184212
Funding scheme Early Postdoc.Mobility
Research institution Grup de Recerca en Aprenentatge i Ensenyamen Departament de Traducció i Ciències del Llen Universitat Pompeu Fabra
Institution of higher education Institution abroad - IACH
Main discipline Applied linguistics
Start/End 01.04.2019 - 30.09.2020
Applied linguistics
Information Technology

NLP; crowdsourcing; language learning; parallel corpora; concordancers; word alignment; ICALL; second language acquisition

Lay Summary (German)

Neben regulärem Sprachunterricht können Computerprogramme zum eigenbestimmten Sprachlernen einen wertvollen Beitrag leisten. Unser Projekt befasst sich mit der Auswahl und Verarbeitung von Übersetzungen für Sprachlernanwendungen.
Lay summary

Inhalt und Ziel des Forschungsprojekts

Über die letzten Jahrzehnte hat im Sprachunterricht ein Wandel stattgefunden, von schwerpunktmässigem Grammatik- und Vokabellernen hin zu einem intuitiverem Umgang mit der Fremdsprache anhand von Beispielen. Grosse Sammlungen übersetzter Texte, die zunehmend frei verfügbar sind, bieten Sprachenlernern eine Vielzahl an Beispielen für Wortverwendung und Grammtik. Das Projekt hat zum Ziel, herauszufinden, welche dieser Übersetzungen sich am besten für das selbstbestimmte Sprachenlernen eignen.

Dazu werden wir Erkenntnisse aus anderen Bereichen mit den Urteilen von Sprachenlernern und -lehrern kombinieren. Die Eigenschaften, die ein gutes Beispiel aufweisen sollte, werden wir uns zunutze machen, um automatisiert passende Übersetzungen in grossen Textsammlungen ausfindig zu machen und daraus Übungen zu generieren.

Wissenschaftlicher und gesellschaftlicher Kontext des Forschungsprojekts

Sprachenlernen ist von gesamtgesellschaftlichem Interesse, sowohl auf internationaler als auch, z.B. im Falle der Schweiz, auf nationaler Ebene. Unser Projekt leistet einen Beitrag zum Verständnis kontrastiven Sprachenlernens und stellt Methoden und Datensammlungen für Sprachlernanwendung und weitere Forschung bereit.

Automatic Generation of Exercises for Second Language Learning from Parallel Corpus Data
Zanetti Arianna, Volodina Elena, Graën Johannes (2021), Automatic Generation of Exercises for Second Language Learning from Parallel Corpus Data, in International Journal of TESOL Studies.
Using Multilingual Resources to Evaluate CEFRLex for Learner Applications
Graën Johannes, Alfter David, Schneider Gerold (2020), Using Multilingual Resources to Evaluate CEFRLex for Learner Applications, in Proceedings of The 12th Language Resources and Evaluation Conference (LREC), Marseille346-355, European Language Resources Association (ELRA), Paris346-355.
Interconnecting lexical resources and word alignment: How do learners get on with particle verbs?
Alfter David, Graën Johannes (2019), Interconnecting lexical resources and word alignment: How do learners get on with particle verbs?, in Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku321-326, Linköping University Electronic Press, Linköping321-326.

Workshop on NLP for Computer Assisted Language Learning Talk given at a conference Parallel Corpora as a resource for Data-driven Language Learning 25.11.2020 online, Sweden Graën Johannes;
Texttechnologie-Kolloquium FS 2020 Individual talk Using Corpora for Language Learning 26.05.2020 online, Switzerland Graën Johannes;
Research Seminar of the Gr@el Group Individual talk De los corpus paralelos a las tareas de aprendizaje de idiomas 12.12.2019 Barcelona, Spain Graën Johannes;
Research Seminar of the Language Technology Group at University of Helsinki Individual talk Language learning on large parallel corpora: Aggregation and classification 19.11.2019 Helsinki, Finland Graën Johannes;
Nordic Conference on Computational Linguistics (NoDaLiDa) Talk given at a conference Interconnecting lexical resources and word alignment: How do learners get on with particle verbs? 01.10.2019 Turku, Finland Graën Johannes;
Corpus Textuais: Teoría e Práctica Individual talk La utilidad de los corpus paralelos para aprendices de lenguas extranjeras 01.07.2019 Santiago de Compostela, Spain Graën Johannes;
XI Congreso Internacional de Lingüística del Corpus Talk given at a conference Parallel Corpus Examples for Language Learning Applications 15.05.2019 Valencia, Spain Graën Johannes;
CLT (Centre for Language Technology) Retreat 2019 Individual talk From parallel corpora to multilingual exercises 07.05.2019 Bohusgården, Sweden Graën Johannes;
3rd Annual Meeting of the European Network for Combining Language Learning with Crowdsourcing Techniques Poster From parallel corpora to bilingual language learning exercises 01.04.2019 Lissabon, Portugal Graën Johannes;


Corpus linguistics has a long-lasting history; even before computers were invented, people analyzed written texts to gain insights into language use. Today, corpora are increasingly used in language learning: Language teachers and lexicographers search for examples that, for instance, illustrate best the use of a particular expression, while being self-contained, using plain language and not comprising element that deflect attention away from the expression in question. Language learners, once being instructed in corpus search techniques, can autonomously explore numerous authentic language examples and prove their assumptions true or false. This technique often referred to as data-driven learning (DDL).Recent research is concerned with the automatic conversion of corpus examples into exercises in computer-assisted language learning (CALL) applications. State-of-the-art natural language processing (NLP) methods provide a linguistic analysis of the corpus material. Research on technology use in second language acquisition (SLA) showed that the use of corpora in language learning improves learning efficiency.Most work on corpus use in language learning applications focuses on monolingual corpora. Very few parallel corpora with standard language content (in contrast to translated legal texts etc.) are publicly available. Among those, the recently curated OpenSubtitles2016 corpus stands out in terms of the size and number of available languages. The goal of the proposed project is to exploit this corpus (and possibly other corpora) to extract those translations that benefit language learners most.To this end, I will apply NLP methods to extract linguistic structures, such as syntactic relations, semantic relations, in particular word embeddings (Karlgren and Sahlgren 2001), and correspondences between words and multiword units of the respective languages by combining word alignment with other annotation layers. I expect that, in line with existing monolingual approaches, features such as lexical complexity and syntactic structures on both languages as well as the kind of correspondence on the lexical and syntactical level, alongside simpler features, such as sentence lengths and the anaphora use, are to be taken into account to determine the suitability of a corpus example for language learning purposes. Surveys on language teachers will be used to adjust the ranking of corpus examples.The generation of automatically evaluated language learning exercises poses a major problem: How can we reliably judge a given answer to be wrong? A common approach is not to allow free-form answers, but to provide a choice of possible answers, one of which is known to be the correct one, and several other so-called distractors, chosen so as to make them plausible, but guaranteeing their incompatibility with the given context. I will adapt this kind of exercise (among others) to parallel, word-aligned data, which opens up numerous new prospects in terms of answers and distractors (e.g., parallel answers consisting of the correct answer in one language and a distractor in the other). The exercises will be automatically generated by an application prototype that allows its users to give explicit feedback to each exercise. In addition, the application will log all user interactions. The observations crowdsourced in this way will allow us to study user interaction with our application and help towards improving it.The project comprehends the curation of a parallel corpus for several languages and the extraction of parallel (if possible also multiparallel) corpus examples that are suitable for language learning purposes. Both data sets will be made available so that other applications in corpus linguistics and CALL applications can make use of it. I expect that the crowdsourcing data helps to tailor example selection and exercise generation to the needs of language learners and provide a base for further development of CALL applications.