作者: Hoang Thanh Lam , Fabian Mörchen , Dmitriy Fradkin , Toon Calders
DOI: 10.1002/SAM.11192
关键词: Compression ratio 、 Minimum description length 、 Sequence database 、 Data mining 、 Data compression 、 Interpretability 、 Empirical research 、 Computer science 、 Data sequences 、 Redundancy (information theory) 、 Pattern recognition 、 Artificial intelligence
摘要: Pattern mining based on data compression has been successfully applied in many tasks. For itemset data, the Krimp algorithm minimumdescription length MDL principle was shown to be very effective solving redundancy issue descriptive pattern mining. However, for sequence of set frequent sequential patterns is not fully addressed literature. In this article, we study MDL-based algorithms non-redundant sets from a database. First, propose an encoding scheme compressing with patterns. Second, formulate problem most We show that intractable and belongs class inapproximable problems. Therefore, two heuristic algorithms. The first these uses two-phase approach similar data. To overcome performance issues candidate generation, also GoKrimp, directly mines by greedily extending until no additional benefit adding extension into dictionary. Since checks are computationally expensive dependency test which only chooses related events given pattern. This technique improves efficiency GoKrimp significantly while it still preserves quality conduct empirical eight datasets effectiveness our comparison state-of-the-art terms interpretability extracted patterns, run time, ratio, classification accuracy using discovered as features different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis Data Mining,