Zips: mining compressing sequential patterns in streams

作者: Hoang Thanh Lam , Toon Calders , Jie Yang , Fabian Mörchen , Dmitriy Fradkin

DOI: 10.1145/2501511.2501520

关键词:

摘要: We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, MDL-based approach that selects patterns their capacity to compress data rather than frequency, was shown be remarkably effective meaningful and solving redundancy issue in frequent itemset sequence mining. The existing algorithms, however, either start from seed set of patterns, or require multiple passes through data. As such, approaches scale poorly are unsuitable large datasets. Therefore, our main contribution is proposal new, called Zips, does not requires only one scan over we extended Lempel-Ziv (LZ) compression algorithm three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning input, Zips codewords according usage words; more heaviliy used words get shorter code-lengths. Secondly, exploits also non-consecutive occurences compression. And, third, well-known space-saving evict unpromising dictionary. Experiments synthetic two real-world large-scale datasets show extracts compressing with similar quality state-of-the-art multi-pass algorithms proposed databases sequences. Moreover, scales linearly size streams all do not.

参考文章(16)
Lawrence B. Holder, Diane J. Cook, Surnjani Djoko, Substructure discovery in the SUBDUE system knowledge discovery and data mining. pp. 169- 180 ,(1994)
TH Cormen, RL Rivest, CE Leiserson, C Stein, Introduction to Algorithms, 2nd edition. ,(2001)
Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams Database Theory - ICDT 2005. pp. 398- 412 ,(2004) , 10.1007/978-3-540-30570-5_27
James A. Storer, Thomas G. Szymanski, Data compression via textual substitution Journal of the ACM. ,vol. 29, pp. 928- 951 ,(1982) , 10.1145/322344.322346
Thomas M. Cover, Joy A. Thomas, Elements of information theory ,(1991)
Peter D. Grünwald, The Minimum Description Length Principle ,(2007)
Hong Cheng, Xifeng Yan, Jiawei Han, Philip S. Yu, Direct Discriminative Pattern Mining for Effective Classification 2008 IEEE 24th International Conference on Data Engineering. pp. 169- 178 ,(2008) , 10.1109/ICDE.2008.4497425
Jilles Vreeken, Matthijs van Leeuwen, Arno Siebes, Krimp: mining itemsets that compress Data Mining and Knowledge Discovery. ,vol. 23, pp. 169- 214 ,(2011) , 10.1007/S10618-010-0202-X
Nikolaj Tatti, Jilles Vreeken, The long and the short of it Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12. pp. 462- 470 ,(2012) , 10.1145/2339530.2339606