Project

Back to overview

From parallel corpora to multilingual exercises - Making use of large text collections and crowdsourcing techniques for innovative autonomous language learning applications

Applicant Graën Johannes
Number 184212
Funding scheme Early Postdoc.Mobility
Research institution Grup de Recerca en Aprenentatge i Ensenyamen Departament de Traducció i Ciències del Llen Universitat Pompeu Fabra
Institution of higher education Institution abroad - IACH
Main discipline Applied linguistics
Start/End 01.04.2019 - 30.09.2020
Show all

All Disciplines (2)

Discipline
Applied linguistics
Information Technology

Keywords (8)

NLP; crowdsourcing; language learning; parallel corpora; concordancers; word alignment; ICALL; second language acquisition

Lay Summary (German)

Lead
Neben regulärem Sprachunterricht können Computerprogramme zum eigenbestimmten Sprachlernen einen wertvollen Beitrag leisten. Unser Projekt befasst sich mit der Auswahl und Verarbeitung von Übersetzungen für Sprachlernanwendungen.
Lay summary

Inhalt und Ziel des Forschungsprojekts

Über die letzten Jahrzehnte hat im Sprachunterricht ein Wandel stattgefunden, von schwerpunktmässigem Grammatik- und Vokabellernen hin zu einem intuitiverem Umgang mit der Fremdsprache anhand von Beispielen. Grosse Sammlungen übersetzter Texte, die zunehmend frei verfügbar sind, bieten Sprachenlernern eine Vielzahl an Beispielen für Wortverwendung und Grammtik. Das Projekt hat zum Ziel, herauszufinden, welche dieser Übersetzungen sich am besten für das selbstbestimmte Sprachenlernen eignen.

Dazu werden wir Erkenntnisse aus anderen Bereichen mit den Urteilen von Sprachenlernern und -lehrern kombinieren. Die Eigenschaften, die ein gutes Beispiel aufweisen sollte, werden wir uns zunutze machen, um automatisiert passende Übersetzungen in grossen Textsammlungen ausfindig zu machen und daraus Übungen zu generieren.

Wissenschaftlicher und gesellschaftlicher Kontext des Forschungsprojekts

Sprachenlernen ist von gesamtgesellschaftlichem Interesse, sowohl auf internationaler als auch, z.B. im Falle der Schweiz, auf nationaler Ebene. Unser Projekt leistet einen Beitrag zum Verständnis kontrastiven Sprachenlernens und stellt Methoden und Datensammlungen für Sprachlernanwendung und weitere Forschung bereit.

Direct link to Lay Summary Last update: 09.01.2019

Responsible applicant and co-applicants

Name Institute

Abstract

Corpus linguistics has a long-lasting history; even before computers were invented, people analyzed written texts to gain insights into language use. Today, corpora are increasingly used in language learning: Language teachers and lexicographers search for examples that, for instance, illustrate best the use of a particular expression, while being self-contained, using plain language and not comprising element that deflect attention away from the expression in question. Language learners, once being instructed in corpus search techniques, can autonomously explore numerous authentic language examples and prove their assumptions true or false. This technique often referred to as data-driven learning (DDL).Recent research is concerned with the automatic conversion of corpus examples into exercises in computer-assisted language learning (CALL) applications. State-of-the-art natural language processing (NLP) methods provide a linguistic analysis of the corpus material. Research on technology use in second language acquisition (SLA) showed that the use of corpora in language learning improves learning efficiency.Most work on corpus use in language learning applications focuses on monolingual corpora. Very few parallel corpora with standard language content (in contrast to translated legal texts etc.) are publicly available. Among those, the recently curated OpenSubtitles2016 corpus stands out in terms of the size and number of available languages. The goal of the proposed project is to exploit this corpus (and possibly other corpora) to extract those translations that benefit language learners most.To this end, I will apply NLP methods to extract linguistic structures, such as syntactic relations, semantic relations, in particular word embeddings (Karlgren and Sahlgren 2001), and correspondences between words and multiword units of the respective languages by combining word alignment with other annotation layers. I expect that, in line with existing monolingual approaches, features such as lexical complexity and syntactic structures on both languages as well as the kind of correspondence on the lexical and syntactical level, alongside simpler features, such as sentence lengths and the anaphora use, are to be taken into account to determine the suitability of a corpus example for language learning purposes. Surveys on language teachers will be used to adjust the ranking of corpus examples.The generation of automatically evaluated language learning exercises poses a major problem: How can we reliably judge a given answer to be wrong? A common approach is not to allow free-form answers, but to provide a choice of possible answers, one of which is known to be the correct one, and several other so-called distractors, chosen so as to make them plausible, but guaranteeing their incompatibility with the given context. I will adapt this kind of exercise (among others) to parallel, word-aligned data, which opens up numerous new prospects in terms of answers and distractors (e.g., parallel answers consisting of the correct answer in one language and a distractor in the other). The exercises will be automatically generated by an application prototype that allows its users to give explicit feedback to each exercise. In addition, the application will log all user interactions. The observations crowdsourced in this way will allow us to study user interaction with our application and help towards improving it.The project comprehends the curation of a parallel corpus for several languages and the extraction of parallel (if possible also multiparallel) corpus examples that are suitable for language learning purposes. Both data sets will be made available so that other applications in corpus linguistics and CALL applications can make use of it. I expect that the crowdsourcing data helps to tailor example selection and exercise generation to the needs of language learners and provide a base for further development of CALL applications.
-