A Framework for Objective Interestingness Measure Selection in Association Rule Mining
Thesis type 


Student  Bong Kok Keong 
Status  Finished 
Submitted in  2012 
Proposal on  31. Jan 2012 14:15 
Proposal room  Seminarraum I5 
Add proposal to calendar 
vCal iCal 
Presentation on  22. May 2012 16:00 
Presentation room  Seminarraum I5 
Add presentation to calendar 
vCal iCal 
Supervisor(s)  
Advisor(s) 

Association Mining is a process of mining frequent patterns that occur in a transaction database. Initially used as market basket analysis solution for retail businesses, it has grown to cover many other fields such as medicine, traffic estimation and anomaly detection. The two most wellknown algorithms in association mining is the Apriori and the FPGrowth algorithm. These two algorithms produce all association rules that satisfy the user defined minimum support and confidence threshold. This lead to an inherent problem where the algorithms generate huge amounts of rules because each rule is a permutation of the items in a set. For example, if a set of 3items fulfilled the minimum support, it will generate a total of 6 rules with 2 items and another 6 rules with 3 items. That is a total of 12 possible rules from 1 set of items. In order to further limit the amount of rules that is produced by the algorithm, data miners could opt to use more complex formula to calculate whether a rule is "interesting" or not. Note that support and confidence measure/formula mentioned earlier is 2 of the examples. These formula are refered to as Objective Interestingness Measures (OIMs).
Over the years, many OIMs have been devised and studied. Within the OIM taxonomy, they are separated by their origin and their origins vary from probability theories, information theories and others like paradox detection. There are also absolute OIMs and relative OIMs available for data miners to use. From there on, researchers have also defined the mathematical properties of these OIMs.
Data miners are left to decide on which OIMs to use based on their experience and assumption about the association rules that they would like to find. There are many ways to select the appropriate OIMs: manually examining the OIMs, using decision aids, looking for specific properties in the dataset, clustering the OIMs based on experiments, employing meta learning mechanisms to predict user selection, or even by applying brute force analysis. Apart from choosing which OIMs to use, some researchers can follow the path of aggregating the values of multiple OIMs to a single value, mining rules that satisfy multiple OIMs simultaneously or even using evolutionary algorithms to derive their own optimal formula.
Our proposal is confined in the field of choosing the appropriate OIM to be used. Since, the distribution of data is different from one dataset to another, we would like to automate the selection so that it can be applied with ease. Each association rule is methematically defined as a contingency table (between the antecedent and consequent of the association rule) and all the OIMs derive its final values from the contingency table's values. So, we hypothesize that we could select the appropriate OIM based on the distribution of values in the contingency tables. In addition, we could increase the selection methods performing by making the ruleset to be evaluated smaller while maintaining corelations between respective OIMs. Then we would take the largest cluster of OIMs and select the one that are most discriminant. It is important for the selected OIM because eventually the rules will be ranked before being presented to the user. If the OIM did not return mostly unique numbers, the user would need to evaluate more rules. The context of this thesis is the application of OIMs in a recommender system setting.
There are many ways to evalute the effectiveness of the OIM usage. These methods can be grouped into several groups: classification accuracy, verified by expert user, reduction in rule count, against other measures and against Contentbased Similarity. Like the usage of OIMs should be tailored to each domain, the preceeding methods are also tailored to their application context. The method of choice for us is that the selected OIM would offer improvement in terms of recommender systems precisionrecall. We would also prove that the rule sampling method that we used to speed up the selection also performs better than simplified methods and random sampling.
In conclusion, we have presented a method for automatic OIM selection that has lower computational requirements (due to the sampling method used) and performs better than the conventional supportconfidence framework (especially when there exists many OIMs that do not produce high unique values for all the rules). We share the common bottom line opinion where OIMs do not necessarily reflect user preference because ultimately "interestingness" is distinct from one individual to another. In other words, we could not equate rules that are interesting to an credit card fraud investigator to a person doing grocery shopping. In contrast, we position OIMs as a postprocessing tool that enables users and data miners alike, to find the rules/knowledge that they desire. Thus, as a future work, we could pursue the direction of "personalizing" OIMs for different purposes. The sampling process could also be improved by experimenting with different clustering parameters and algorithms.