Project
Back to overview
Multilingual and Domain-Specific Information Retrieval
English title |
Multilingual and Domain-Specific Information Retrieval |
Applicant |
Savoy Jacques
|
Number |
129535 |
Funding scheme |
Project funding (Div. I-III)
|
Research institution |
Institut d'informatique Université de Neuchâtel
|
Institution of higher education |
University of Neuchatel - NE |
Main discipline |
Information Technology |
Start/End |
01.03.2011 - 28.02.2014 |
Approved amount |
171'510.00 |
Show all
Keywords (8)
Information retrieval; multilingual information retrieval; domain-specific IR; contextual retrieval; cross-lingual IR (CLIR); digital library; Information retrieval (IR); multilingual IR (MLIR)
Lay Summary (English)
Lead
|
|
Lay summary
|
In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Hindi, Marathi and other Indian languages. This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison for our tests. Translating an expression of a user need is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness, extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In doing so we could exclude any retrieval systems requiring extensive manual work
|
Responsible applicant and co-applicants
Employees
Publications
Akasereh Mitra (2013), A Quantitative Evaluation of Query Expansion in Domain Specific Information Retrieval, in
REthinking the Information Boundaries, ASIST 2013, Montreal.
Akasereh Mitra Savoy Jacques (2013), Ad Hoc Retrieval with Marathi Language, in P. Majumder M. Mitra P. Bhattacharyya L. Subramaniam D. Contractor & P. Rosso (ed.), Springer-Verlag, Berlin, 23-37.
Savoy Jacques (2013), Classification avec style : Une application aux discours gouvernementaux, in
Actes SNDI’2014, Nancy.
Savoy Jacques Dolamic Ljiljana Akasereh Mitra (2013), Information Retrieval with Hindi, Bengali, and Marathi Languages, in P. Majumder M. Mitra P. Bhattacharyya L. Subramaniam D. Contractor & P. Rosso (ed.), Springer-Verlag, Berlin, 334-352.
Savoy Jacques (2013), La voix du Président américain (1934-2013)., in
Proceesings Statistical Analysis of Textual Data JADT 2014, Paris.
Petras Vivien Bogers Toine Toms Elaine Hall Mark Savoy Jacques Malak Piotr Adam Pawlowski Adam (2013), Multilinguality, Multimodality, and Visualization, in Forner P. Müller H. Paredos R. Rosso P. Stein B. (ed.), Springer, Heidelberg, 192-211.
Akasereh Mitra Naji Nada Savoy Jacques (2013),
UniNE at CLEF 2013.
Akasereth Mitra, Savoy Jacques (2012), Retrieval Effectiveness Study with Farsi Language, in
CORIA 2012, BordeauxCORIA - CIFED, Bordeaux.
Akasereh Mitra Naji Nada Savoy Jacques (2012),
UniNE at CLEF 2012.
Scientific events
Active participation
Title |
Type of contribution |
Title of article or contribution |
Date |
Place |
Persons involved |
Self-organised
Associated projects
Number |
Title |
Start |
Funding scheme |
113273
|
Multilingual and Contextual Information Retrieval |
01.01.2007 |
Project funding (Div. I-III) |
Abstract
In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal (one PhD student over a three-year period) we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Turkish, Polish, Hindi, Marathi, Bengali and other Indian languages (e.g., Punjabi, Tamil, Telugu). This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison in our tests. Translating a user need expression is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). In the current project, we want to better integrate translation (with its underlying uncertainty) into the search process, using existing translation tools (bilingual dictionaries, MT systems, statistical models). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In so doing we could exclude any retrieval systems requiring extensive manual work (e.g., building a semantic network or large ontology, specialized multilingual thesaurus, etc.). This research proposal is based in part on our previous research projects in which we suggested different approaches taken to work with various European languages and in various domain-specific IR (SNSF Grant #200021-113273, ending in March 2010). This research grant will cover UniNE’s participation during the next CLEF campaigns together with our own participation in the forthcoming FIRE evaluation campaigns.
-