Algorithmic approaches to statistical questions

作者: Gregory John Valiant , Christos Papadimitriou

DOI:

关键词:

摘要: In this dissertation, we apply the computational perspective to three basic statistical questions which underlie and abstract several of challenges encountered in analysis today's large datasets. Estimating Statistical Properties Given a sample drawn from an unknown distribution, specific property distribution that hope estimate, how should one compute what size is necessary guarantee with high probability, computed estimate accurate? We focus on natural class properties, includes number distinct elements, entropy, distance metrics between pairs distributions, including total variational (also known as or e1 distance). Such properties are easy if comparison complexity underlying but can be done given relatively few samples? We show sublinear estimation possible: for any constant e > 0, above tasks accomplished using samples O nlogn , probability success 1 – o(1). Additionally, prove some results about algorithmic structure optimal estimators, provide experimental evidence suggesting our estimators perform very well practice. Complementing these positive results, matching information theoretic lower bounds, establishing up factors. Previously, no explicit had been described tasks. Finding Correlations Identifying Relevant Variables: Perhaps most type present dataset correlation. How much computation required find correlated variables? One certainly brute-force search through all variables, each pair, correlation estimated efficiently. But there sub-quadratic time algorithm finding More generally, suppose has data set where label function small variables. If have n perhaps number, k = 3, 4, 5,..., relevant variables used predict labels. termed k-junta . quickly As above, could simply over possible subsets k, taking roughly O(nk). Can significantly more efficiently? Learning Mixtures Gaussians: A mixture model (with, example, two components) generated via following process: point, w1, point p1, remaining 1-w1 second p2. Supposing such efficiently deduce components, p1 p2 mixture? accurately cluster points according they originated? special case component, p 2 Gaussian problem learning model, is, perhaps, (and practically relevant) starting tackling question recovering mixtures general families distributions. obtain handle problem, describe which, GMM provably returns accurate estimates runtime polynomial parameters—the dimension space, inverse desired accuracy recovered components. was known, even univariate just (Abstract shortened by UMI.)

参考文章(136)
B. Efron, C. Stein, The Jackknife Estimate of Variance Annals of Statistics. ,vol. 9, pp. 586- 596 ,(1981) , 10.1214/AOS/1176345462
Zvika Brakerski, Vinod Vaikuntanathan, Efficient Fully Homomorphic Encryption from (Standard) LWE 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science. pp. 97- 106 ,(2011) , 10.1109/FOCS.2011.12
Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla, On synopses for distinct-value estimation under multiset operations Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07. pp. 199- 210 ,(2007) , 10.1145/1247480.1247504
F. Gotze, On the Rate of Convergence in the Multivariate CLT Annals of Probability. ,vol. 19, pp. 724- 739 ,(1991) , 10.1214/AOP/1176990448
Michael Kearns, Efficient noise-tolerant learning from statistical queries Journal of the ACM. ,vol. 45, pp. 983- 1006 ,(1998) , 10.1145/293347.293351
S. Meiser, Point Location in Arrangements of Hyperplanes Information & Computation. ,vol. 106, pp. 286- 303 ,(1993) , 10.1006/INCO.1993.1057
Bero Roos, On the Rate of Multivariate Poisson Convergence Journal of Multivariate Analysis. ,vol. 69, pp. 120- 134 ,(1999) , 10.1006/JMVA.1998.1789
L. Paninski, Estimating entropy on m bins given fewer than m samples IEEE Transactions on Information Theory. ,vol. 50, pp. 2200- 2203 ,(2004) , 10.1109/TIT.2004.833360
D. J. Thompson, D. G. Horvitz, A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. ,vol. 47, pp. 663- 685 ,(1952) , 10.2307/2280784
Paul Valiant, Testing Symmetric Properties of Distributions SIAM Journal on Computing. ,vol. 40, pp. 1927- 1968 ,(2011) , 10.1137/080734066