Project

Back to overview

Significant Pattern Mining

English title Significant Pattern Mining
Applicant Borgwardt Karsten
Number 155913
Funding scheme SNSF Starting Grants
Research institution Computational Systems Biology Department of Biosystems, D-BSSE ETH Zürich
Institution of higher education ETH Zurich - ETHZ
Main discipline Information Technology
Start/End 01.05.2015 - 30.04.2020
Approved amount 1'420'850.00
Show all

Keywords (7)

Graph Mining; Correlation Search; Pattern Mining; Big Data; Computer Science ; Multiple Testing; Data Mining

Lay Summary (German)

Lead
Unsere Gesellschaft setzt große Hoffnungen auf 'Big Data': Durch die Analyse großer Datenmengen sollen bisher unbekannte Zusammenhänge entdeckt werden, die zur Verbesserung fast aller Lebensbereiche, vom Verkehrwesen bis zur Gesundheitsversorgung, beitragen.Bei der Suche nach Mustern in Big Data gibt es jedoch ein gravierendes wiederkehrendes Problem: Welche Muster hat der Zufall in diesen gigantischen “Datenbergen” erzeugt und welche stellen statistisch-signifikante Beobachtungen dar? Es herrscht ein Mangel an statistisch-fundierten Ansätzen, um diese Unterscheidung effizient auf großen Datenmengen durchführen zu können. Dieses Projekt widmet sich der Entwicklung und Erforschung solcher Ansätze.
Lay summary

Inhalt und Ziel des Forschungsprojekts

Unsere Gesellschaft setzt große Hoffnungen auf Big Data: Durch die Analyse großer Datenmengen sollen bisher unbekannte Zusammenhänge entdeckt werden, die zur Verbesserung fast aller Lebensbereiche, vom Verkehrwesen bis zur Gesundheitsversorgung, beitragen.

Bei der Suche nach Mustern in Big Data  gibt es jedoch ein gravierendes wiederkehrendes Problem: Welche Muster hat der Zufall in diesen gigantischen “Datenbergen” erzeugt und welche stellen statistisch-signifikante Beobachtungen dar?   Es herrscht ein Mangel an statistisch-fundierten Ansätzen, um diese Unterscheidung effizient auf großen Datenmengen durchführen zu können.

In diesem Projekt wollen wir neue Algorithmen entwickeln, die statistisch-signifikante Muster in großen Datenmengen entdecken können. Der Schlüssel zum Erfolg werden neue Algorithmen sein, die besonders effizient zu berechnen sind, teure Zwischenschritte vermeiden, die Anzahl der potenziellen Muster frühzeitig reduzieren und Abhängigkeiten zwischen den Mustern geschickt ausnutzen, um die nötigen Berechnungen zu beschleunigen.

                

Wissenschaftlicher und gesellschaftlicher Kontext des Forschungsprojekts    

Unsere Arbeit wird neue Algorithmen hervorbringen, um Muster in Big Data zu erkennen, und daraus neues Wissen über die zugrundeliegenden Systeme zu generieren. Sie ist daher für eine Vielzahl von Disziplinen, die Big Data nutzen, von Bedeutung, von der Logistik über das Finanzwesen bis hin zur Gesundheitsversorgung.  Gleichzeitig ergänzt unser Projekt nationale Vorhaben zur Stärkung der Forschung über “Big Data” in der Schweiz, wie z.B. das nationale Forschungsprogramm “Big Data” des Staatssekretariats für Bildung, Forschung und Innovation (SBFI).
Direct link to Lay Summary Last update: 26.08.2015

Responsible applicant and co-applicants

Employees

Publications

Publication
Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping
Höllerer Simon, Papaxanthos Laetitia, Gumpinger Anja Cathrin, Fischer Katrin, Beisel Christian, Borgwardt Karsten, Benenson Yaakov, Jeschek Markus (2020), Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping, in Nature Communications, 11(1), 3551-3551.
Prediction of cancer driver genes through network-based moment propagation of mutation scores
Gumpinger Anja C, Lage Kasper, Horn Heiko, Borgwardt Karsten (2020), Prediction of cancer driver genes through network-based moment propagation of mutation scores, in Bioinformatics, 36(Supplement), i508-i515.
Network-guided search for genetic heterogeneity between gene pairs
Gumpinger Anja C, Rieck Bastian, Grimm Dominik G, Int. Headache Geneti, BorgwardtKarsten (2020), Network-guided search for genetic heterogeneity between gene pairs, in Bioinformatics, 0.
Early prediction of circulatory failure in the intensive care unit using machine learning
Hyland Stephanie L., Faltys Martin, Hüser Matthias, Lyu Xinrui, Gumbsch Thomas, Esteban Cristóbal, Bock Christian, Horn Max, Moor Michael, Rieck Bastian, Zimmermann Marc, Bodenham Dean, Borgwardt Karsten, Rätsch Gunnar, Merz Tobias M. (2020), Early prediction of circulatory failure in the intensive care unit using machine learning, in Nature Medicine, 26(3), 364-373.
CASMAP: detection of statistically significant combinations of SNPs in association mapping
Llinares-López Felipe, Papaxanthos Laetitia, Roqueiro Damian, Bodenham Dean, Borgwardt Karsten (2019), CASMAP: detection of statistically significant combinations of SNPs in association mapping, in Bioinformatics, 35(15), 2680-2682.
Finding Statistically Significant Interactions between Continuous Features
Sugiyama Mahito, Borgwardt Karsten (2019), Finding Statistically Significant Interactions between Continuous Features, in Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, ChinaIJCAI, China.
Machine Learning for Biomarker Discovery: Significant Pattern Mining
Llinares-López Felipe, Borgwardt Karsten (2019), Machine Learning for Biomarker Discovery: Significant Pattern Mining, in Pržulj Nataša (ed.), Cambridge University Press, Cambridge, 313-368.
Association mapping in biomedical time series via statistically significant shapelet mining
Bock Christian, Gumbsch Thomas, Moor Michael, Rieck Bastian, Roqueiro Damian, Borgwardt Karsten (2018), Association mapping in biomedical time series via statistically significant shapelet mining, in Bioinformatics, 34(13), i438-i446.
Significant Pattern Mining for Biomarker Discovery
Llinares-LópezFelipe (2018), Significant Pattern Mining for Biomarker Discovery, ETH Zurich, Zurich.
Genome-wide genetic heterogeneity discovery with categorical covariates
Llinares-López Felipe, Papaxanthos Laetitia, Bodenham Dean, Roqueiro Damian, Borgwardt Karsten (2017), Genome-wide genetic heterogeneity discovery with categorical covariates, in Bioinformatics, 33(12), 1820-1828.
Finding significant combinations of features in the presence of categorical covariates
Papaxanthos Laetitia, Llenares-López Felipe, Bodenham Dean, Borgwardt Karsten (2016), Finding significant combinations of features in the presence of categorical covariates, in Advances in Neural Information Processing Systems 29, BarcelonaCurran Associates, Inc., Red Hook, NY.
Halting in Random Walk Kernels
Sugiyama Mahito, Borgwardt Karsten (2015), Halting in Random Walk Kernels, in Advances in Neural Information Processing Systems 28, Montréal, CanadaCurran Associates, Inc., Red Hook, NY.
Significant Subgraph Mining with Multiple Testing Correction
Sugiyama Mahito, López Felipe Llinares, Kasenburg Niklas, Borgwardt Karsten M. (2015), Significant Subgraph Mining with Multiple Testing Correction, in Proceedings of the 2015 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Philadelphia, PA.
Genome-wide detection of intervals of genetic heterogeneity associated with complex traits
Llinares-López Felipe, Grimm Dominik G., Bodenham Dean A., Gieraths Udo, Sugiyama Mahito, Rowan Beth, Borgwardt Karsten (2015), Genome-wide detection of intervals of genetic heterogeneity associated with complex traits, in Bioinformatics, 31(12), i240-i249.
Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing
Llinares-López Felipe, Sugiyama Mahito, Papaxanthos Laetitia, Borgwardt Karsten (2015), Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing, in the 21th ACM SIGKDD International Conference, Sydney, NSW, AustraliaACM, New York.

Collaboration

Group / person Country
Types of collaboration
Kobi Benenson, ETH Zürich Switzerland (Europe)
- Publication
Tobias Merz, Inselspital Switzerland (Europe)
- Publication
Markus Jeschek, ETH Zürich Switzerland (Europe)
- Publication
Kasper Lage, Broad Institute of MIT and Harvard United States of America (North America)
- Publication
Gunnar Rätsch, ETH Zürich Switzerland (Europe)
- Publication
Mahito Sugiyama, National Institute of Informatics Japan (Asia)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Dominik Grimm, Technical University of Munich, TUM Campus Straubing Germany (Europe)
- Publication
Heiko Horn, Broad Institute of MIT and Harvard United States of America (North America)
- Publication

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
ELLIS Health Workshop Talk given at a conference Deep learning enables accurate predictions of robisome biding site activity 21.10.2020 online, Germany Papaxanthos Laetitia;
Intelligent Systems for Molecular Biology (ISMB) 2020 Talk given at a conference Prediction of cancer driver genes through network-​based moment propagation of mutation scores 13.07.2020 online, Canada Gumpinger Anja Cathrin;
ISCVID Symposium Talk given at a conference Machine Learning for Personalized Medicine 04.06.2019 Lausanne, Switzerland Borgwardt Karsten;
'15th Current Topics in Bioinformatics' symposium at MDC Berlin Talk given at a conference Machine Learning for Biomarker Discovery in Clinical Time Series 20.05.2019 Berlin, Germany Borgwardt Karsten;
26th International Conference on Intelligent Systems for Molecular Biology Talk given at a conference Tutorial AM2: Machine learning methods in the analysis of genomic and clinical data 06.07.2018 Chicago, United States of America Llinares Lopez Felipe;
Conference on Intelligent Systems for Molecular Biology (ISMB) 2020 Talk given at a conference Association Mapping in Biomedical Time Series via Statistically Significant Shapelet Mining 06.07.2018 Chicago, United States of America Bock Christian;
Personalized Health Technologies and Translational Research Conference Talk given at a conference Association Mapping in Biomedical Time Series via Statistically Significant Shapelet Mining 18.06.2018 Zürich, Switzerland Bock Christian;
Seminar Series: Special Topics in Computer Graphics and Visualisation Individual talk Statistically Significant Shapelet Mining for Biomedical Time Series 11.06.2018 Heidelberg, Germany Rieck Bastian Alexander;
SFB/TRR 209 seminar at the University Hospital Tübingen Individual talk Machine Learning for Biomarker Discovery: Combinatorial Association Mapping 16.04.2018 Tübingen, Germany Borgwardt Karsten;
Seminar series 'Software Trends' at Hochschule Esslingen Individual talk Die 'Daten-​Medizin' 13.04.2018 Esslingen, Germany Borgwardt Karsten;
Fassberg Seminar Series at MPI Göttingen Individual talk Data Mining in the Life Sciences: Combinatorial Association Mapping 13.03.2018 Göttingen, Germany Borgwardt Karsten;
Workshop on Secure, Privacy-Conscious Data Sharing Talk given at a conference Personalized Swiss Sepsis Study 15.02.2018 Lausanne, Switzerland Borgwardt Karsten;
SIB Virtual Computational Biology Seminar Series Individual talk Significant Pattern Mining for Combinatorial Association Mapping 20.09.2017 online, Switzerland Borgwardt Karsten;
ISMB/ECCB 2017 Poster Genome-wide genetic heterogeneity discovery with categorical covariates 21.07.2017 Prague, Poland Papaxanthos Laetitia; Llinares Lopez Felipe;
10th International Conference on Multiple Comparison Procedures Individual talk Accounting for a categorical covariate in significant pattern minin 21.06.2017 Riverside, United States of America Llinares Lopez Felipe;
Distinguished Speaker Series at the Center for Bioinformatics Individual talk Combinatorial Association Mapping 10.05.2017 Saarbrücken, Germany Borgwardt Karsten;
IBT seminar at the Institute for Biomedical Engineering at ETH Zürich Individual talk Network Mining in Biology and Medicine 25.04.2017 Zürich, Switzerland Borgwardt Karsten;
Alfried Krupp-Symposium "From Machine Learning to Personalized Medicine" Talk given at a conference Significant Pattern Mining for Biomarker Discovery 21.10.2016 München, Germany Llinares Lopez Felipe;
Felix Klein Conference "Mathematical Methods in Big Data" at the Fraunhofer Institute for Industrial Mathematics ITWM Talk given at a conference Machine Learning for Personalized Medicine 30.09.2016 Kaiserslautern, Germany Borgwardt Karsten;
ECCB workshop on "Complex Network Analysis for Precision Medicine" Talk given at a conference Network Mining for Personalized Medicine 03.09.2016 The Hague, Netherlands Borgwardt Karsten;
Latsis Symposium on Personalized Medicine – Challenges and Opportunities Talk given at a conference Genome-wide genetic heterogeneity discovery with categorical covariates 27.06.2016 Zürich, Switzerland Papaxanthos Laetitia;
Latsis Symposium on Personalized Medicine – Challenges and Opportunities Talk given at a conference Network-​guided search for genetic heterogeneity between gene pairs 27.06.2016 Zürich, Switzerland Gumpinger Anja Cathrin;
WG2 COST Training School on Interactions in Complex Disease Analysis Talk given at a conference Machine Learning to Uncover Biological Interactions 20.05.2016 Antwerp, Belgium Papaxanthos Laetitia;
WG2 COST Training School on Interactions in Complex Disease Analysis Talk given at a conference Network-​guided search for genetic heterogeneity between gene pairs 27.04.2016 Antwerp, Belgium Gumpinger Anja Cathrin;
Computational Biology (BC2) seminar at the Biozentrum at the University of Basel Individual talk Machine Learning for Personalized Medicine 25.04.2016 Basel, Switzerland Llinares Lopez Felipe;
Computer Science Colloquium of the University of Basel Individual talk Significant Pattern Mining 21.04.2016 Basel, Switzerland Borgwardt Karsten;
NIPS 2015 Workshop on Machine Learning and Computational Biology Talk given at a conference Detecting significant high-order associations between genotype and phenotype while conditioning on covariates 12.12.2015 Montréal, Canada Papaxanthos Laetitia;
Seminar at TU Dortmund Individual talk Significant Pattern Mining 12.11.2015 Dortmund, Germany Borgwardt Karsten;
Meeting of the Competence Center for Personalized Medicine of ETH Zürich & the University of Zürich at Kartause Ittingen Talk given at a conference Machine Learning for Personalized Medicine 02.11.2015 Ittingen, Switzerland Borgwardt Karsten;
International Workshop on Data Mining in Bioinformatics (BIOKDD'15), in conjunction with KDD2015 Talk given at a conference Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing 10.08.2015 Sydney, Australia Papaxanthos Laetitia;


Knowledge transfer events

Active participation

Title Type of contribution Date Place Persons involved
Huawei-​ETH workshop Talk 25.05.2018 Zürich, Switzerland Borgwardt Karsten;
Seminar at Roche Basel Talk 18.04.2018 Basel, Switzerland Borgwardt Karsten;
Seminar at Google Research Zürich Talk 27.02.2018 Zürich, Switzerland Borgwardt Karsten;


Communication with the public

Communication Title Media Place Year
New media (web, blogs, podcasts, news feeds etc.) irculatory failure is predictable/Kreislaufversagen vorhersagen ETH press release German-speaking Switzerland International 2020
Media relations: print media, online media Algorithmus soll im Spital vor Kreislaufversagen warnen Wiener Zeitung International 2020
Media relations: print media, online media Digitaler Doktor: Algorithmus sagt 90 Prozent kritischer Kreislaufversagen voraus Kleine Zeitung International 2020
Media relations: print media, online media Kreislaufversagen ist präzise vorhersagbar Pressetext International 2020
Media relations: print media, online media 25 Persons for the Next 25 Years Focus International 2018
Media relations: print media, online media Augmented Science SNF Journal "Horizonte" German-speaking Switzerland Italian-speaking Switzerland Rhaeto-Romanic Switzerland Western Switzerland 2017
Media relations: print media, online media Der Schatzsucher Tagesanzeiger German-speaking Switzerland 2017
Media relations: print media, online media ETH Zürich ernennt Professor für Data Mining Netzwoche German-speaking Switzerland 2017
Media relations: print media, online media Medicine is awash in data Globe German-speaking Switzerland International 2017

Awards

Title Year
ELLIS Fellow & ELLIS Faculty Member 2019
ETH medal for outstanding PhD thesis 2019
25 Persons for the Next 25 Years by German news magazine "Focus" 2018
One of the "Top 40 unter 40" in "State and Society" in Germany, according to business journal Capital 2016
One of the "Top 40 under 40" in State and Society in Germany, according to business journal Capital 2015

Use-inspired outputs

Software

Name Year
CAsMap 2018


Abstract

Data Mining, the search for new knowledge in form of statistical depedencies and patterns in big data sets, is omnipresent in modern society, in science and technology as much as in industry and finance. One of its most important branches is Pattern Mining, that is finding groups of co-occuring elements in a collection of sets. For instance, keywords that co-occur in many documents may form a pattern, or groups of atoms that reoccur in molecules with a particular biological function. Data Mining has brought about a huge body of literature on how to efficiently discover such patterns, even in very large datasets.An unresolved open question is, however, to decide whether a given pattern is not only frequent, but statistically significantly enriched in a particular dataset or class of objects. This question is of essential relevance to all application domains of pattern mining, in particular the life sciences, as they are interested in selecting patterns for further experimental investigation and validation. It is our goal in this project to give an answer to this open problem of significant pattern mining.The reason why this important question remains unanswered so far is the multiple hypothesis testing problem: when assessing statistical significance, one has to account for the enormous number of hypotheses that were tested in the discovery process. While Statistics has developed numerous approaches to multiple hypotheses correction, their application is extremely difficult in Pattern Mining. This is due to the fact that even simple statistics, such as the number of tests, may be challenging to compute and that correcting for the huge number of tests performed may result in loss of statistical power in detecting true patterns.In this project, we propose strategies for Pattern Mining with multiple testing correction that preserve statistical power. Key to this breakthrough will be novel algorithms that avoid to compute expensive intermediate results, exclude non-testable hypotheses and exploit dependencies between tests. In this manner we plan to solve one of the big open problems in Data Mining.
-