Approximately detecting duplicates for streaming data using stable bloom filters

作者: Fan Deng , Davood Rafiei

DOI: 10.1145/1142473.1142477

关键词:

摘要: Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded is feasible streaming scenarios. Therefore, we target at approximately environments given a limited space. Based on well-known bitmap sketch, introduce structure, Stable Bloom Filter, and novel simple algorithm. The basic idea as follows: since there no way store the whole history of stream, SBF continuously evicts stale information so that has room for those more recent elements. After finding some properties analytically, show tight upper bound false positive rates guaranteed. our empirical study, compare alternative methods. results method superior terms both accuracy time effciency when fixed small space acceptable rate given.

参考文章(34)
David Maier, Tim Sheard, Peter A. Tucker, Applying Punctuation Schemes to Queries Over Continuous Data Streams. IEEE Data(base) Engineering Bulletin. ,vol. 26, pp. 33- 40 ,(2003)
Nesime Tatbul, Uğur Çetintemel, Stan Zdonik, Mitch Cherniack, Michael Stonebraker, Load shedding in a data stream manager very large data bases. pp. 309- 320 ,(2003) , 10.1016/B978-012722442-8/50035-5
Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti, Eliminating fuzzy duplicates in data warehouses very large data bases. pp. 586- 597 ,(2002) , 10.1016/B978-155860869-6/50058-5
Hector Garcia-Molina, Jennifer Widom, Jeffrey D. Ullman, Database System Implementation ,(1999)
Allan Heydon, Marc Najork, Mercator: A scalable, extensible Web crawler World Wide Web. ,vol. 2, pp. 219- 229 ,(1999) , 10.1023/A:1019213109274
Guy M. Lohman, Lothar F. Mackert, R* Optimizer Validation and Performance Evaluation for Distributed Queries very large data bases. pp. 537- 547 ,(1994)
Andrew S. Tanenbaum, Modern Operating Systems ,(1992)
Andrei Z. Broder, Marc Najork, Janet L. Wiener, Efficient URL caching for world wide web crawling Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 679- 689 ,(2003) , 10.1145/775152.775247
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom, Models and issues in data stream systems symposium on principles of database systems. pp. 1- 16 ,(2002) , 10.1145/543613.543615
Kyu-Young Whang, Brad T. Vander-Zanden, Howard M. Taylor, A linear-time probabilistic counting algorithm for database applications ACM Transactions on Database Systems. ,vol. 15, pp. 208- 229 ,(1990) , 10.1145/78922.78925