作者: Hoang Thanh Lam , Toon Calders , Jie Yang , Fabian Mörchen , Dmitriy Fradkin
关键词:
摘要: We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, MDL-based approach that selects patterns their capacity to compress data rather than frequency, was shown be remarkably effective meaningful and solving redundancy issue in frequent itemset sequence mining. The existing algorithms, however, either start from seed set of patterns, or require multiple passes through data. As such, approaches scale poorly are unsuitable large datasets. Therefore, our main contribution is proposal new, called Zips, does not requires only one scan over we extended Lempel-Ziv (LZ) compression algorithm three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning input, Zips codewords according usage words; more heaviliy used words get shorter code-lengths. Secondly, exploits also non-consecutive occurences compression. And, third, well-known space-saving evict unpromising dictionary. Experiments synthetic two real-world large-scale datasets show extracts compressing with similar quality state-of-the-art multi-pass algorithms proposed databases sequences. Moreover, scales linearly size streams all do not.