作者: Piotr Indyk , Mayur Datar , Graham Cormode , S. Muthukrishnan
DOI:
关键词: Data stream mining 、 Computer science 、 Norm (mathematics) 、 Hamming code 、 Wireless sensor network 、 Theoretical computer science 、 Similarity (geometry) 、 Data stream 、 Data mining 、 Scale (descriptive set theory)
摘要: Massive data streams are now fundamental to many processing applications. For example, Internet routers produce large scale diagnostic streams. Such rarely stored in traditional databases, and instead must be processed "on the fly" as they produced. Similarly, sensor networks multiple of observations from their sensors. There is growing focus on manipulating streams, hence, there a need identify basic operations interest managing support them efficiently. We propose computation Hamming norm operation interest. The formalises ideas that used throughout processing. When applied single stream, gives number distinct items present which statistic great databases. pair an important measure (dis)similarity: unequal item counts two norms have uses comparing streams. We novel approximation technique for estimating massive streams; this relies what we call "l0 sketch" prove its accuracy. We test our method quantity synthetic real stream data, show estimation accurate within few percentage points.