Efficient Computation of Frequent and Top-k Elements in Data Streams

作者: Ahmed Metwally , Divyakant Agrawal , Amr El Abbadi

DOI: 10.1007/978-3-540-30570-5_27

关键词: ComputationOrder (group theory)Set (abstract data type)Computer scienceData stream miningAlgorithmData streamIntegrated approachSpace (mathematics)Distribution (mathematics)

摘要: We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k′ elements, where k′≈ k, which are guaranteed to be the top-k'elements; and we use …

参考文章(48)
Prosenjit Bose, Pat Morin, Yihui Tang, Evangelos Kranakis, Bounds for Frequency Estimation of Packet Streams. SIROCCO. pp. 33- 42 ,(2003)
Gurmeet Singh Manku, Rajeev Motwani, Chapter 31 – Approximate Frequency Counts over Data Streams very large data bases. pp. 346- 357 ,(2002) , 10.1016/B978-155860869-6/50038-X
Jeffrey Scott Vitter, Yossi Matias, Min Wang, Dynamic Maintenance of Wavelet-Based Histograms very large data bases. pp. 101- 110 ,(2000)
Philippe Bonnet, Johannes Gehrke, Praveen Seshadri, Towards Sensor Database Systems mobile data management. pp. 3- 14 ,(2001) , 10.1007/3-540-44498-X_1
Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams international colloquium on automata languages and programming. ,vol. 312, pp. 693- 703 ,(2002) , 10.1016/S0304-3975(03)00400-6
Arvind Arasu, Shivnath Babu, Jennifer Widom, CQL: A Language for Continuous Queries over Streams and Relations database programming languages. pp. 1- 19 ,(2003) , 10.1007/978-3-540-24607-7_1
Lynne Stokes, Jeffrey F. Naughton, Peter J. Haas, S. Seshadri, Sampling-Based Estimation of the Number of Distinct Values of an Attribute very large data bases. pp. 311- 322 ,(1995)
Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro, Frequency Estimation of Internet Packet Streams with Limited Space european symposium on algorithms. pp. 348- 360 ,(2002) , 10.1007/3-540-45749-6_33
C. A. R. Hoare, Algorithm 65: find Communications of the ACM. ,vol. 4, pp. 321- 322 ,(1961) , 10.1145/366622.366647
Yunyue Zhu, Dennis Shasha, StatStream: statistical monitoring of thousands of data streams in real time very large data bases. pp. 358- 369 ,(2002) , 10.1016/B978-155860869-6/50039-1