作者: Piotr Indyk , Andrew McGregor
DOI:
关键词:
摘要: We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be first this natural problem. In centralized model, we a stream pairs (i,j) ∈ [n]2 whose frequencies define joint distribution (X,Y). distributed each coordinate pair may appear separately stream. present range algorithms for approximating what extent X and Y are independent, i.e., how close is product marginals. various measures closeness including e1, e2, mutual information between Y. Our based on "sketching sketches", composing small-space linear synopses distributions. Perhaps ironically, biggest technical challenges that arise relate ensuring different components estimates sufficiently independent.