Zurück zur Übersicht

Detection of biological interactions from biomedical literature

Titel Englisch Detection of biological interactions from biomedical literature
Gesuchsteller/in Hess Michael
Nummer 118396
Förderungsinstrument Projektförderung (Abt. I-III)
Forschungseinrichtung Institut für Computerlinguistik Universität Zürich
Hochschule Universität Zürich - ZH
Hauptdisziplin Weitere Sprachen
Beginn/Ende 01.04.2008 - 31.12.2009
Bewilligter Betrag 114'046.00
Alle Daten anzeigen

Alle Disziplinen (2)

Weitere Sprachen
Medizinische Statistik

Keywords (4)

text mining; biomedical literature; genomics;

Lay Summary (Englisch)

Lay summary
We are living in an age where unprecedented amounts of information are available to almost everyone. However, the task of finding and absorbing reliable information relevant to any specific problem has become increasingly difficult. While a general solution to this problem is probably still remote, in specific domains it is becoming possible to deal with it, using novel techniques from the scientific fields of computational linguistics and text mining.In the domain of biomedicine, for instance, research scientists and companies are increasingly faced with the problem of efficiently locating, in the vast amount of published scientific results, the critical pieces of information that are needed in order to assess current and future research investment.The project "Detection of Biological Interactions from Biomedical Literature" (SNF grant 100015_118396) has been very successful in demonstrating the potential of advanced computational linguistic techniques towards the solution of the information overload problem described above.The main goal of the project, as described in the original proposal, was the following: "The proposed project aims at developing and refining automatic and semi-automatic methods for the discovery of interactions between biological entities from the scientific literature."The choice of this specific domain was also motivated by the existence of resources such as terminological databases and other knowledge repositories, which can support the process of literature-based discovery. This goal is currently attracting a significant amount of research and public funding.Systems currently used by biologists to search the literature are based on traditional information retrieval techniques, thus they typically deliver ranked lists of documents. Our approach aims at extracting the most relevant information, automatically extracted from the documents. Our method is based on a deep linguistic analysis of the literature using publicly available terminological resources (e.g. SwissProt, the protein knowledgebase developed at the Swiss Institute of Bioinformatics) and a full syntactic analyzer (developed at the University of Zurich).The main result of the project is a system for the detection of domain-specific relationships as mentioned in the scientific literature, in particular protein-protein interactions. The technologies developed so far have been validated by participation in two international shared evaluations (BioNLP event detection task, BioCreative). In the BioCreative challenge, our system obtained the best results among all participants in the task of detection of mentions of protein-protein interactions from the scientific literature. The results of the project have been described in eleven peer-reviewed publications (including a journal paper in the prestigious journal "Genome Biology").The project was also successful in establishing an academic-industrial collaboration with NITAS, the text mining unit at Novartis, Basel. Their support has been essential to provide the project with the necessary domain expertise. Moreover, additional funding from NITAS has allowed to expand the capacity of the project with one additional research position, which has helped to significantly increase its scientific output.
Direktlink auf Lay Summary Letzte Aktualisierung: 21.02.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Verbundene Projekte

Nummer Titel Start Förderungsinstrument
130558 Semi-automated semantic enrichment of biomedical literature 01.08.2010 Projekte


After the successful sequencing of the human genome, and the identification of the genes, the study of the structure and functions of proteins within the cellular environment (proteomics) is seen as the next major goal of molecular biology. The recently coined term interactomics refers to the study of protein-protein interactions, which play a major role in this enterprise. Research scientists and companies working in this domain are increasingly faced with the problem of efficiently locating, in the vast amount of published scientific results, the critical pieces of information that are needed in order to assess current and future research investment. The research community is therefore keen to adopt novel Text Mining solutions, which have the potential of supporting such a discovery process [1]. While there is broad consensus on the need for Text Mining, there is a lot of ongoing research aimed at establishing which of the many possible approaches are most suited for each specific research target. Most existing systems are based on traditional information retrieval techniques, possibly tailored to the characteristics of the domain (for example, making use of resources such as the Unified Medical Language System or the Genome Ontology), which typically deliver a ranked list of documents. While in some cases it might be possible to read all documents retrieved, in many others, due to time pressure, it would be preferable if the system were able to correctly pinpoint just the relevant information, maybe a paragraph, or a sentence, or even just highlight the mention of a specific fact (e.g. a protein-protein interaction). The proposed project aims at developing and refining (semi-)automatic methods for the discovery of interactions between biological entities from the scientific literature. Our methodology is centered on the dependency-based linguistic analysis of scientific articles. One of the advantages of dependency based syntactic representations is that they can be easily mapped into a semantic representation. Alternatively, by application of simple transformations, they can be used directly to match candidate answers with given queries, allowing easy identification of the arguments of complex relations. Full linguistic analysis, even of complex, highly technical language, is now becoming possible due to recent developments in parsing technology [2]. A parser (PRO3GRES), which integrates a number of these developments in a hybrid way, has been developed at the University of Zurich [3]. The parser is the core component of a text mining system developed at the University of Zurich (ONTOGENE Relation Miner) [4]. BioCreAtIvE (a critical assessment of text mining methods in molecular biology) is a community-wide effort for evaluating information extraction and text mining developments in the biological domain. It represents the most relevant international evaluation forum in this domain. Our team took part in the most recent edition, participating in two tasks relevant to the detection of protein-protein interactions from scientific literature. While complete results have not yet been released by the organizers, the partial results allow us to conclude that our system is the best in one of the tasks, and is certainly among the best in the second task.We ask the NSF for support in order to be able to continue our research activities in this area, in particular extending our Relation Miner system, and allowing us to participate in the next edition of BioCreAtIvE. More precisely, our goals are:1. to turn the ONTOGENE Relation Miner from a demonstrator system into a fully fledged prototype, which could be used directly by molecular biologists (through a web interface)2. to improve our results in the detection of protein interactions from biomedical literature, with the aim of establishing a lead in this area, to be verified by the participation at the next edition of BioCreAtIvE .The results obtained so far with limited resources prove that we are on the right track, and that we will be capable of turning the resources granted by the NSF into high-quality research output as well as tools which could benefit the wider community of molecular biology. The support of the Text Mining Unit at Novartis guarantees to the project the necessary domain expertise, as well as a real-world environment for the testing of the tools developed within the project.[1] Martin Krallinger and Alfonso Valencia. Text-mining and information-retrieval services for molecular biology. Genome Biology, 6(7):224, 2005.[2] Andrew B Clegg and Adrian J Shepherd. Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics, 8:24, 2007.[3] Gerold Schneider. Hybrid Long-Distance Functional Dependency Parsing. Doctoral Thesis, Institute of Computational Linguistics, University of Zurich, 2007.[4] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Michael Hess, and Martin Romacker. An Environment for Relation Mining over Richly Annotated Corpora: the case of GENIA. BMC Bioinformatics, 7(Suppl 3):S3, 2006.