作者: S. Muthukrishnan , Yihua Wu
DOI: 10.7282/T33T9HM0
关键词: Data stream 、 Statistical model 、 Data stream mining 、 Change detection 、 Statistical parameter 、 Data mining 、 Statistical assumption 、 Sequential probability ratio test 、 Online model 、 Computer science
摘要: Streaming is an important paradigm for handling high-speed data sets that are too large to fit in main memory. Prior work streams has shown how estimate simple statistical parameters, such as histograms, heavy hitters, frequent moments, etc., on streams. This dissertation focuses a number of more sophisticated analyses performed near real-time, using limited resources. First, we present model stream parametrically; particular, hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models stream. It yields algorithms fast, space-efficient, provide accuracy guarantees. We also design fast methods perform online validation at streaming speeds. The second contribution this addresses the problem modeling individual’s behaviors via “signature” nodes communication graphs. develop formal framework usage signatures graphs identify fundamental properties natural signature schemes. justify these by showing they impact set applications. then explore several schemes our evaluate them real terms properties. provides insights into suitable desired Finally, studies detection changes with unknown distributions. adapt sound method sequential probability ratio test case, without independence assumption. The resulting algorithm works seamlessly window limitations inherent prior work, highly effective detecting quickly. Furthermore, formulate extend solution local change not been addressed earlier. As concrete applications techniques, complement analytic algorithmic results experiments network traffic demonstrate practicality line speeds, potential power techniques mining.