Streaming and sublinear approximation of entropy and information distances

作者: Suresh Venkatasubramanian , Andrew McGregor , Sudipto Guha

DOI: 10.5555/1109557.1109637

关键词: Bounded functionHellinger distanceSublinear functionMathematicsBinary logarithmUpper and lower boundsEntropy (information theory)Discrete mathematicsProperty testingRandom permutation

摘要: In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard lp norms. this paper we design streaming and sublinear time property testing algorithms for entropy various distances.Batu et al posed the problem of with respect to Jensen-Shannon distance. We present optimal estimating bounded, symmetric f-divergences (including divergence Hellinger distance) between distributions in frameworks. Along way, close a (log n)/H gap upper lower bounds H, yielding an algorithm over all values entropy. data stream setting (sublinear space), give first distribution. Our runs polylogarithmic space yields asymptotic constant factor approximation scheme. An integral part is interesting use F0 (the number distinct elements set) estimation algorithm; also provide other results along space/time/approximation tradeoff curve.Our have structural implications that connect constrained algorithms. The mediating model random order model, assumes input permutation multiset was considered by Munro Paterson 1980. show any combined oracle calculating invariant functions can be simulated single pass. This addresses question raised Feigenbaum regarding relationship Further, polylog-space PTAS one pass stream. bound cannot achieved (generalized testing) model.

参考文章(32)
Christos H. Papadimitriou, Ziv Bar-Yossef, The complexity of massive data set computations University of California at Berkeley. ,(2002)
N. Čencov, Statistical Decision Rules and Optimal Inference Translations of Mathematical#N# Monographs. ,(2000) , 10.1090/MMONO/053
S. M. Ali, S. D. Silvey, A General Class of Coefficients of Divergence of One Distribution from Another Journal of the Royal Statistical Society: Series B (Methodological). ,vol. 28, pp. 131- 142 ,(1966) , 10.1111/J.2517-6161.1966.TB00626.X
Igor Vajda, Friedrich Liese, Convex statistical distances ,(1987)
Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams international colloquium on automata languages and programming. ,vol. 312, pp. 693- 703 ,(2002) , 10.1016/S0304-3975(03)00400-6
Bernard Chazelle, Ronitt Rubinfeld, Luca Trevisan, Approximating the Minimum Spanning Tree Weight in Sublinear Time international colloquium on automata languages and programming. pp. 190- 200 ,(2001) , 10.1007/3-540-48224-5_16
Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro, Frequency Estimation of Internet Packet Streams with Limited Space european symposium on algorithms. pp. 348- 360 ,(2002) , 10.1007/3-540-45749-6_33
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, Luca Trevisan, Counting Distinct Elements in a Data Stream randomization and approximation techniques in computer science. pp. 1- 10 ,(2002) , 10.1007/3-540-45726-7_1