Comparing data streams using Hamming norms (how to zero in)

作者: Piotr Indyk , Mayur Datar , Graham Cormode , S. Muthukrishnan

DOI:

关键词: Data stream miningComputer scienceNorm (mathematics)Hamming codeWireless sensor networkTheoretical computer scienceSimilarity (geometry)Data streamData miningScale (descriptive set theory)

摘要: Massive data streams are now fundamental to many processing applications. For example, Internet routers produce large scale diagnostic streams. Such rarely stored in traditional databases, and instead must be processed "on the fly" as they produced. Similarly, sensor networks multiple of observations from their sensors. There is growing focus on manipulating streams, hence, there a need identify basic operations interest managing support them efficiently. We propose computation Hamming norm operation interest. The formalises ideas that used throughout processing. When applied single stream, gives number distinct items present which statistic great databases. pair an important measure (dis)similarity: unequal item counts two norms have uses comparing streams. We novel approximation technique for estimating massive streams; this relies what we call "l0 sketch" prove its accuracy. We test our method quantity synthetic real stream data, show estimation accurate within few percentage points.

参考文章(30)
Piotr Indyk, Nick Koudas, S. Muthukrishnan, Identifying Representative Trends in Massive Time Series Data Sets Using Sketches very large data bases. pp. 363- 372 ,(2000)
Geoffrey M. Voelker, Stefan Savage, David Moore, Inferring internet denial-of-service activity usenix security symposium. pp. 2- 2 ,(2001)
Lynne Stokes, Jeffrey F. Naughton, Peter J. Haas, S. Seshadri, Sampling-Based Estimation of the Number of Distinct Values of an Attribute very large data bases. pp. 311- 322 ,(1995)
Jennifer Widom, Shivnath Babu, Lakshminarayan Subramanian, A Data Stream Management System for Network Traffic Management ,(2001)
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk, Mining database structure; or, how to build a data quality browser Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD '02. pp. 240- 251 ,(2002) , 10.1145/564691.564719
A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, NetScope: traffic engineering for IP networks IEEE Network. ,vol. 14, pp. 11- 19 ,(2000) , 10.1109/65.826367
Geoff Hulten, Laurie Spencer, Pedro Domingos, Mining time-changing data streams knowledge discovery and data mining. pp. 97- 106 ,(2001) , 10.1145/502512.502529
J. Bunge, M. Fitzpatrick, Estimating the Number of Species: A Review Journal of the American Statistical Association. ,vol. 88, pp. 364- 373 ,(1993) , 10.1080/01621459.1993.10594330
Philippe Flajolet, G. Nigel Martin, Probabilistic counting algorithms for data base applications Journal of Computer and System Sciences. ,vol. 31, pp. 182- 209 ,(1985) , 10.1016/0022-0000(85)90041-8
Chuck Cranor, Yuan Gao, Theodore Johnson, Vlaidslav Shkapenyuk, Oliver Spatscheck, Gigascope Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD '02. pp. 623- 623 ,(2002) , 10.1145/564691.564777