Optimal and Robust Rule Set Generation

作者: Jiuyong Li

DOI:

关键词:

摘要: Abstract: The rapidly growing volume and complexity of modern databases makes the need for technologies to describe and summarise the information they contain increasingly important. Data mining is a process of extracting implicit, previously unknown and potentially useful patterns and relationships from data, and is widely used in industry and business applications. -- Rules characterise relationships among patterns in databases, and rule mining is one of the central tasks in data mining. There are fundamentally two categories of rules, namely association rules and classification rules. Traditionally, association rules are connected with transaction databases for market basket problems and classification rules are associated with relational databases for predictions. In this thesis, we will mainly focus on the use of association rules for predictions. -- An optimal rule set is a rule set that satisfies given optimality criteria. In this thesis we study two types of optimal rule sets, the informative association rule set and the optimal class association rule set, where the informative association rule set is used for market basket predictions and the class association rule set is used for the classification. A robust classification rule set is a rule set that is capable of providing more correct predictions than a traditional classification rule set on incomplete test data. -- Mining transaction databases for association rules usually generates a large number of rules, most of which are unnecessary when used for subsequent prediction. We define a rule set for a given transaction database that is significantly smaller than an association rule set but makes the same predictions as the complete association rule set. We call this rule set the informative rule set. The informative rule set is not constrained to particular target items; and it is smaller than the non-redundant association rule set. We characterise the relationships between the informative rule set and the non-redundant association rule set. We present an algorithm to directly generate the informative rule set without generating all frequent itemsets first, and that accesses databases less often than other direct methods. We show experimentally that the informative rule set is much smaller than both the association rule set and the non-redundant association rule set for a given database, and that it can be generated more efficiently. In addition, we discuss a new unsupervised discretization method to deal with numerical attributes in general association rule mining without target specification. Based on the analysis of the strengths and weaknesses of two commonly used unsupervised numerical attribute discretization methods, we present an adaptive numerical attribute merging algorithm that is shown better than both methods in general association rule mining. -- Relational databases are usually denser than transaction databases, so mining on them for class association rules, which is a set of association rules whose consequences are classes, may be difficult due to the combinatorial explosion. Based on the analysis of the prediction mechanism, we define an optimal class association rule set to be a subset of the complete class association rule set containing all potentially predictive rules. -- Using this rule set instead of the complete class association rule set we can avoid redundant computation that would otherwise be required for mining predictive association rules and hence improve the efficiency of the mining process significantly. We present an efficient algorithm for mining optimal class association rule sets using upward closure properties to prune weak rules before they are actually generated. We show theoretically the efficiency of the proposed algorithm will be greater than Apriori on dense databases, and confirm experimentally that it generates an optimal class association rule set, which is very much smaller than a complete class association rule set, in significantly less time than generating the complete class association rule set by Apriori. -- Traditional classification rule sets perform badly on test data that are not as complete as the training data. We study the problem of discovering more robust rule sets, i.e. we say a rule is more robust than another rule set if it is able to make more accurate predictions on test data with missing attribute values. We reveal a hierarchy of k-optimal rule sets where a k-optimal rule set with a large k is more robust, and they are more robust than a traditional classification rule set. We introduce two methods to find k-optimal rule sets, i.e. an optimal association rule mining approach and a heuristic approximate approach. We show experimentally that a k-optimal rule set generated from the optimal association rule mining approach performs better than that from the heuristic approximate approach and both rule sets perform significantly better than a typical classification rule set (C4.5Rules) on incomplete test data. -- Finally, we summarise the work discussed in this thesis, and suggest some future research directions.

参考文章(97)
Sholom M. Weiss, Nitin Indurkhya, Reduced complexity rule induction international joint conference on artificial intelligence. pp. 678- 684 ,(1991)
Stefanos Manganaris, Ramakrishnan Srikant, Kamal Ali, Partial classification using association rules knowledge discovery and data mining. pp. 115- 118 ,(1997)
Rajjan Shinghal, Micheline Kamber, Proposed interestingness measure for characteristic rules national conference on artificial intelligence. pp. 1393- 1393 ,(1996)
Ron Kohavi, Mehran Sahami, Error-based and entropy-based discretization of continuous features knowledge discovery and data mining. pp. 114- 119 ,(1996)
Mika Klemettinen, Heikki Mannila, Pirjo Ronkainen, Hannu T. T. Toivonen, Kimmo H t nen, Pruning and grouping of discovered association rules ,(1995)
Ramon López de Màntaras, Jesus Cerquides, Proposal and empirical comparison of a parallelizable distance-based discretization method knowledge discovery and data mining. pp. 139- 142 ,(1997)
Padhraic Smyth, Rodney M. Goodman, Information-theoretic rule induction european conference on artificial intelligence. pp. 357- 362 ,(1988)
Robert J. Hilderman, Howard J. Hamilton, Heuristics for Ranking the Interestingness of Discovered Knowledge pacific-asia conference on knowledge discovery and data mining. pp. 204- 210 ,(1999) , 10.1007/3-540-48912-6_28
Oren Etzioni, Richard Segal, Learning decision lists using homogeneous rules national conference on artificial intelligence. pp. 619- 625 ,(1994)
Dao-I Lin, Zvi M. Kedem, Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set extending database technology. pp. 105- 119 ,(1998) , 10.1007/BFB0100980