Mining Compressing Sequential Patterns

作者: Hoang Thanh Lam , Fabian Mörchen , Dmitriy Fradkin , Toon Calders

DOI: 10.1002/SAM.11192

关键词: Compression ratioMinimum description lengthSequence databaseData miningData compressionInterpretabilityEmpirical researchComputer scienceData sequencesRedundancy (information theory)Pattern recognitionArtificial intelligence

摘要: Pattern mining based on data compression has been successfully applied in many tasks. For itemset data, the Krimp algorithm minimumdescription length MDL principle was shown to be very effective solving redundancy issue descriptive pattern mining. However, for sequence of set frequent sequential patterns is not fully addressed literature. In this article, we study MDL-based algorithms non-redundant sets from a database. First, propose an encoding scheme compressing with patterns. Second, formulate problem most We show that intractable and belongs class inapproximable problems. Therefore, two heuristic algorithms. The first these uses two-phase approach similar data. To overcome performance issues candidate generation, also GoKrimp, directly mines by greedily extending until no additional benefit adding extension into dictionary. Since checks are computationally expensive dependency test which only chooses related events given pattern. This technique improves efficiency GoKrimp significantly while it still preserves quality conduct empirical eight datasets effectiveness our comparison state-of-the-art terms interpretability extracted patterns, run time, ratio, classification accuracy using discovered as features different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis Data Mining,

参考文章(37)
Lawrence B. Holder, Diane J. Cook, Surnjani Djoko, Substructure discovery in the SUBDUE system knowledge discovery and data mining. pp. 169- 180 ,(1994)
Fabian Mörchen, Dmitriy Fradkin, Robust Mining of Time Intervals with Semi-interval Partial Order Patterns. siam international conference on data mining. pp. 315- 326 ,(2010)
Kai Puolamäki, Gemma C. Garriga, Sami Hanhijärvi, Randomization Techniques for Graphs siam international conference on data mining. pp. 780- 791 ,(2009)
Matthijs van Leeuwen, Arno Siebes, StreamKrimp: Detecting Change in Data Streams european conference on machine learning. pp. 672- 687 ,(2008) , 10.1007/978-3-540-87479-9_62
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, Panayiotis Tsaparas, Assessing data mining results via swap randomization ACM Transactions on Knowledge Discovery From Data. ,vol. 1, pp. 14- ,(2007) , 10.1145/1297332.1297338
Christos Faloutsos, Vasileios Megalooikonomou, On data mining, compression, and Kolmogorov complexity Data Mining and Knowledge Discovery. ,vol. 15, pp. 3- 20 ,(2007) , 10.1007/S10618-006-0057-3
Christoph Ambühl, Monaldo Mastrolilli, Ola Svensson, Inapproximability Results for Maximum Edge Biclique, Minimum Linear Arrangement, and Sparsest Cut SIAM Journal on Computing. ,vol. 40, pp. 567- 596 ,(2011) , 10.1137/080729256
James A. Storer, Thomas G. Szymanski, Data compression via textual substitution Journal of the ACM. ,vol. 29, pp. 928- 951 ,(1982) , 10.1145/322344.322346
Fabian Moerchen, Michael Thies, Alfred Ultsch, Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression Knowledge and Information Systems. ,vol. 29, pp. 55- 80 ,(2011) , 10.1007/S10115-010-0329-5