Project

Back to overview

Multilingual and Domain-Specific Information Retrieval

English title Multilingual and Domain-Specific Information Retrieval
Applicant Savoy Jacques
Number 129535
Funding scheme Project funding (Div. I-III)
Research institution Institut d'informatique Université de Neuchâtel
Institution of higher education University of Neuchatel - NE
Main discipline Information Technology
Start/End 01.03.2011 - 28.02.2014
Approved amount 171'510.00
Show all

Keywords (8)

Information retrieval; multilingual information retrieval; domain-specific IR; contextual retrieval; cross-lingual IR (CLIR); digital library; Information retrieval (IR); multilingual IR (MLIR)

Lay Summary (English)

Lead
Lay summary
In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Hindi, Marathi and other Indian languages. This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison for our tests. Translating an expression of a user need is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness, extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In doing so we could exclude any retrieval systems requiring extensive manual work
Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
A Quantitative Evaluation of Query Expansion in Domain Specific Information Retrieval
Akasereh Mitra (2013), A Quantitative Evaluation of Query Expansion in Domain Specific Information Retrieval, in REthinking the Information Boundaries, ASIST 2013, Montreal.
Ad Hoc Retrieval with Marathi Language
Akasereh Mitra Savoy Jacques (2013), Ad Hoc Retrieval with Marathi Language, in P. Majumder M. Mitra P. Bhattacharyya L. Subramaniam D. Contractor & P. Rosso (ed.), Springer-Verlag, Berlin, 23-37.
Classification avec style : Une application aux discours gouvernementaux
Savoy Jacques (2013), Classification avec style : Une application aux discours gouvernementaux, in Actes SNDI’2014, Nancy.
Information Retrieval with Hindi, Bengali, and Marathi Languages
Savoy Jacques Dolamic Ljiljana Akasereh Mitra (2013), Information Retrieval with Hindi, Bengali, and Marathi Languages, in P. Majumder M. Mitra P. Bhattacharyya L. Subramaniam D. Contractor & P. Rosso (ed.), Springer-Verlag, Berlin, 334-352.
La voix du Président américain (1934-2013).
Savoy Jacques (2013), La voix du Président américain (1934-2013)., in Proceesings Statistical Analysis of Textual Data JADT 2014, Paris.
Multilinguality, Multimodality, and Visualization
Petras Vivien Bogers Toine Toms Elaine Hall Mark Savoy Jacques Malak Piotr Adam Pawlowski Adam (2013), Multilinguality, Multimodality, and Visualization, in Forner P. Müller H. Paredos R. Rosso P. Stein B. (ed.), Springer, Heidelberg, 192-211.
UniNE at CLEF 2013
Akasereh Mitra Naji Nada Savoy Jacques (2013), UniNE at CLEF 2013.
Retrieval Effectiveness Study with Farsi Language
Akasereth Mitra, Savoy Jacques (2012), Retrieval Effectiveness Study with Farsi Language, in CORIA 2012, BordeauxCORIA - CIFED, Bordeaux.
UniNE at CLEF 2012
Akasereh Mitra Naji Nada Savoy Jacques (2012), UniNE at CLEF 2012.

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
REthinking Information Boundaries, ASIST - 2013 Talk given at a conference A Quantitative Evaluation of Query Expansion in Domain Specific Information Retrieval 01.11.2013 Montreal, Canada Akasereh Mitra; Savoy Jacques;
Cross-Lingual Evaluation Forum 2013 Talk given at a conference UniNE at CLEF 2013 23.09.2013 Valencia, Spain Savoy Jacques;
Promise Winter School 2013 Poster Culturage Heritage in CLEF 04.02.2013 Bressanone, Italy Akasereh Mitra;
Cross-Lingual Evaluation Forum 2012 Talk given at a conference UniNE at CLEF 2012. 17.09.2012 Rome, Italy Savoy Jacques; Akasereh Mitra;
CORIA 2012 Talk given at a conference Retrieval Effectiveness Study with Farsi Language. 21.03.2012 Bordeaux, France Akasereh Mitra; Savoy Jacques;


Self-organised

Title Date Place
CORIA 2013 03.04.2013 Neuchatel, Switzerland

Associated projects

Number Title Start Funding scheme
113273 Multilingual and Contextual Information Retrieval 01.01.2007 Project funding (Div. I-III)

Abstract

In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal (one PhD student over a three-year period) we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Turkish, Polish, Hindi, Marathi, Bengali and other Indian languages (e.g., Punjabi, Tamil, Telugu). This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison in our tests. Translating a user need expression is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). In the current project, we want to better integrate translation (with its underlying uncertainty) into the search process, using existing translation tools (bilingual dictionaries, MT systems, statistical models). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In so doing we could exclude any retrieval systems requiring extensive manual work (e.g., building a semantic network or large ontology, specialized multilingual thesaurus, etc.). This research proposal is based in part on our previous research projects in which we suggested different approaches taken to work with various European languages and in various domain-specific IR (SNSF Grant #200021-113273, ending in March 2010). This research grant will cover UniNE’s participation during the next CLEF campaigns together with our own participation in the forthcoming FIRE evaluation campaigns.
-