Streaming techniques for statistical modeling

作者: S. Muthukrishnan , Yihua Wu

DOI: 10.7282/T33T9HM0

关键词: Data streamStatistical modelData stream miningChange detectionStatistical parameterData miningStatistical assumptionSequential probability ratio testOnline modelComputer science

摘要: Streaming is an important paradigm for handling high-speed data sets that are too large to fit in main memory. Prior work streams has shown how estimate simple statistical parameters, such as histograms, heavy hitters, frequent moments, etc., on streams. This dissertation focuses a number of more sophisticated analyses performed near real-time, using limited resources. First, we present model stream parametrically; particular, hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models stream. It yields algorithms fast, space-efficient, provide accuracy guarantees. We also design fast methods perform online validation at streaming speeds. The second contribution this addresses the problem modeling individual’s behaviors via “signature” nodes communication graphs. develop formal framework usage signatures graphs identify fundamental properties natural signature schemes. justify these by showing they impact set applications. then explore several schemes our evaluate them real terms properties. provides insights into suitable desired Finally, studies detection changes with unknown distributions. adapt sound method sequential probability ratio test case, without independence assumption. The resulting algorithm works seamlessly window limitations inherent prior work, highly effective detecting quickly. Furthermore, formulate extend solution local change not been addressed earlier. As concrete applications techniques, complement analytic algorithmic results experiments network traffic demonstrate practicality line speeds, potential power techniques mining.

参考文章(80)
W. Willinger, V. Paxson, WHERE MATHEMATICS MEETS THE INTERNET Notices of the American Mathematical Society. ,vol. 45, pp. 961- 970 ,(1998)
Benoit B. Mandelbrot, Fractals and Scaling in Finance Springer New York. ,(1997) , 10.1007/978-1-4757-2763-0
Graham Cormode, S. Muthukrishnan, Summarizing and mining Skewed data streams siam international conference on data mining. pp. 44- 55 ,(2005)
Alberto Belussi, Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension very large data bases. pp. 299- 310 ,(1995)
Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams international colloquium on automata languages and programming. ,vol. 312, pp. 693- 703 ,(2002) , 10.1016/S0304-3975(03)00400-6
Peter L. Bartlett, Shai Ben-David, Sanjeev R. Kulkarni, Learning Changing Concepts by Exploiting the Structure of Change Machine Learning. ,vol. 41, pp. 153- 174 ,(2000) , 10.1023/A:1007604202679
Paul Barford, Azer Bestavros, Adam Bradley, Mark Crovella, Changes in Web client access patterns: Characteristics and caching implications World Wide Web. ,vol. 2, pp. 15- 28 ,(1999) , 10.1023/A:1019236319752
Jeffrey Baumes, M Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, William Wallace, Mohammed Zaki, Finding Hidden Group Structure in a Stream of Communications Intelligence and Security Informatics. pp. 201- 212 ,(2006) , 10.1007/11760146_18
João Gama, Pedro Medas, Gladys Castillo, Pedro Rodrigues, Learning with Drift Detection Advances in Artificial Intelligence – SBIA 2004. pp. 286- 295 ,(2004) , 10.1007/978-3-540-28645-5_29
Corinna Cortes, Daryl Pregibon, Signature-Based Methods for Data Streams Data Mining and Knowledge Discovery. ,vol. 5, pp. 167- 182 ,(2001) , 10.1023/A:1011464915332