Graph Mining, Correlation Search, Pattern Mining, Big Data, Computer Science , Multiple Testing, Data Mining
Sugiyama Mahito, Borgwardt Karsten (2015), Halting in Random Walk Kernels, in Advances in Neural Information Processing Systems 28
, Montréal, CanadaCurran Associates, Inc., Red Hook, NY.
Papaxanthos Laetitia, Llenares-López Felipe, Bodenham Dean, Borgwardt Karsten (2016), Finding significant combinations of features in the presence of categorical covariates, in Advances in Neural Information Processing Systems 29
, BarcelonaCurran Associates, Inc., Red Hook, NY.
Llinares-López Felipe, Sugiyama Mahito, Papaxanthos Laetitia, Borgwardt Karsten (2015), Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing, in the 21th ACM SIGKDD International Conference
, Sydney, NSW, AustraliaACM, New York.
Llinares-López Felipe, Grimm Dominik G., Bodenham Dean A., Gieraths Udo, Sugiyama Mahito, Rowan Beth, Borgwardt Karsten (2015), Genome-wide detection of intervals of genetic heterogeneity associated with complex traits, in Bioinformatics
, 31(12), i240-i249.
Sugiyama Mahito, López Felipe Llinares, Kasenburg Niklas, Borgwardt Karsten M. (2015), Significant Subgraph Mining with Multiple Testing Correction, in Proceedings of the 2015 SIAM International Conference on Data Mining
, Society for Industrial and Applied Mathematics, Philadelphia, PA.
Llinares-López Felipe, Papaxanthos Laetitia, Bodenham Dean, Roqueiro Damian, Borgwardt Karsten (accepted), Genome-wide genetic heterogeneity discovery with categorical covariates, in Bioinformatics
Data Mining, the search for new knowledge in form of statistical depedencies and patterns in big data sets, is omnipresent in modern society, in science and technology as much as in industry and finance. One of its most important branches is Pattern Mining, that is finding groups of co-occuring elements in a collection of sets. For instance, keywords that co-occur in many documents may form a pattern, or groups of atoms that reoccur in molecules with a particular biological function. Data Mining has brought about a huge body of literature on how to efficiently discover such patterns, even in very large datasets.
An unresolved open question is, however, to decide whether a given pattern is not only frequent, but statistically significantly enriched in a particular dataset or class of objects. This question is of essential relevance to all application domains of pattern mining, in particular the life sciences, as they are interested in selecting patterns for further experimental investigation and validation. It is our goal in this project to give an answer to this open problem of significant pattern mining.
The reason why this important question remains unanswered so far is the multiple hypothesis testing problem: when assessing statistical significance, one has to account for the enormous number of hypotheses that were tested in the discovery process. While Statistics has developed numerous approaches to multiple hypotheses correction, their application is extremely difficult in Pattern Mining. This is due to the fact that even simple statistics, such as the number of tests, may be challenging to compute and that correcting for the huge number of tests performed may result in loss of statistical power in detecting true patterns.
In this project, we propose strategies for Pattern Mining with multiple testing correction that preserve statistical power. Key to this breakthrough will be novel algorithms that avoid to compute expensive intermediate results, exclude non-testable hypotheses and exploit dependencies between tests. In this manner we plan to solve one of the big open problems in Data Mining.