Declaring independence via the sketching of sketches

作者: Piotr Indyk , Andrew McGregor

DOI:

关键词:

摘要: We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be first this natural problem. In centralized model, we a stream pairs (i,j) ∈ [n]2 whose frequencies define joint distribution (X,Y). distributed each coordinate pair may appear separately stream. present range algorithms for approximating what extent X and Y are independent, i.e., how close is product marginals. various measures closeness including e1, e2, mutual information between Y. Our based on "sketching sketches", composing small-space linear synopses distributions. Perhaps ironically, biggest technical challenges that arise relate ensuring different components estimates sufficiently independent.

参考文章(27)
T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, P. White, Testing random variables for independence and identity international conference on cluster computing. pp. 442- 451 ,(2001) , 10.1109/SFCS.2001.959920
Sudipto Guha, Andrew McGregor, Lower Bounds for Quantile Estimation in Random-Order and Multi-pass Streaming Automata, Languages and Programming. pp. 704- 715 ,(2007) , 10.1007/978-3-540-73420-8_61
Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing ,(1999)
Sudipto Guha, Piotr Indyk, Andrew McGregor, Sketching information divergences conference on learning theory. pp. 424- 438 ,(2007) , 10.1007/978-3-540-72927-3_31
Kevin L. Chang, Ravi Kannan, The space complexity of pass-efficient algorithms for clustering symposium on discrete algorithms. pp. 1157- 1166 ,(2006) , 10.5555/1109557.1109685
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom, Models and issues in data stream systems symposium on principles of database systems. pp. 1- 16 ,(2002) , 10.1145/543613.543615
Ziv Bar-Yossef, D. Sivakumar, Ravi Kumar, Reductions in streaming algorithms, with an application to counting triangles in graphs symposium on discrete algorithms. pp. 623- 632 ,(2002) , 10.5555/545381.545464
Moses Charikar, Chandra Chekuri, Tomas Feder, Rajeev Motwani, Incremental Clustering and Dynamic Information Retrieval SIAM Journal on Computing. ,vol. 33, pp. 1417- 1440 ,(2004) , 10.1137/S0097539702418498
Siddharth Suri, Andrew McGregor, Sampath Kannan, Joan Feigenbaum, Jian Zhang, Graph distances in the streaming model: the value of space symposium on discrete algorithms. pp. 745- 754 ,(2005) , 10.5555/1070432.1070537