This project will address a problem that has recently been highlighted as a major research challenge in data mining: the fact that the output of many data mining algorithms is typically (and with the present technology often unavoidably) too large for human understanding and interpretation, and therefore of little use for many practical purposes. This is the case in particular for the output of the important and broad class of frequent pattern mining algorithms, which are in the focal point of this proposal. (Such patterns can be subsets in sets, subgraphs in graphs, or subsequences, substrings, or approximate substrings in strings, and much more.)To address this challenge, we will develop new data mining approaches for effective database summarization and understanding. We will achieve this by combining, integrating, and developing state-of-the art ideas from data mining, statistical modeling, and optimization theory.The solution of this problem is likely to boost the impact of all frequent pattern mining techniques, of which the development required a considerable community effort, but of which the uptake in applied research and industry has remained limited so far. Therefore, this project may have a significantly non-linear effect on the research community: it has the potential to unleash a large amount of data mining expertise for practical application.To underline the impact of our theoretical results, and to ensure that they are used in practice, in a second part of this project we will be the first ones to apply our results to a number of case studies. We have intentionally chosen these case studies from three diverse domains: bioinformatics, text mining, and marketing. Each of these domains has a large user base, all of whom will become beneficiaries of this project. By developing these applications, we will be able to demonstrate the use of our newly developed methods to these user groups. The bioinformatics application we will tackle will be the search for transcriptional modules of genes, which are regulated by the same regulatory program under certain conditions. The data driven nature of bioinformatics makes data mining approaches very well suited, and our new methodologies will ensure their successful application with a minimal amount of expert knowledge required. Note that this is just one example of a bioinformatics application -- frequent pattern mining (subsets, subgraphs, subsequences, substrings...) is of use in many other branches of bioinformatics.In the text mining application we will search for interesting sets of words, sequences of words, or approximate sequences (strings) of words, occurring in a corpus of text documents. These text documents will be sets of news articles over a certain time span. We should point out that we are currently already gathering thousands of news articles every day for subsequent text mining analyses, and this in the context of another project being developed together with Prof. Nello Cristianini in the same department. The integration of this result with that project will provide our new methodology with a unique showcase demonstration.For the marketing application, the most obvious application would be the search for items in a supermarket store that are often sold together. Marketers are interested in this information, for example to allow them to optimize their promotion strategies. For example, they may choose to reduce the price of one product, and increase the margin on other products that are strongly associated with the former. Our collaboration with Unilever, who have such retail transaction data available for use in this project, will enable us to successfully complete this application.
|