作者: Gregory John Valiant , Christos Papadimitriou
DOI:
关键词:
摘要: In this dissertation, we apply the computational perspective to three basic statistical questions which underlie and abstract several of challenges encountered in analysis today's large datasets. Estimating Statistical Properties Given a sample drawn from an unknown distribution, specific property distribution that hope estimate, how should one compute what size is necessary guarantee with high probability, computed estimate accurate? We focus on natural class properties, includes number distinct elements, entropy, distance metrics between pairs distributions, including total variational (also known as or e1 distance). Such properties are easy if comparison complexity underlying but can be done given relatively few samples? We show sublinear estimation possible: for any constant e > 0, above tasks accomplished using samples O nlogn , probability success 1 – o(1). Additionally, prove some results about algorithmic structure optimal estimators, provide experimental evidence suggesting our estimators perform very well practice. Complementing these positive results, matching information theoretic lower bounds, establishing up factors. Previously, no explicit had been described tasks. Finding Correlations Identifying Relevant Variables: Perhaps most type present dataset correlation. How much computation required find correlated variables? One certainly brute-force search through all variables, each pair, correlation estimated efficiently. But there sub-quadratic time algorithm finding More generally, suppose has data set where label function small variables. If have n perhaps number, k = 3, 4, 5,..., relevant variables used predict labels. termed k-junta . quickly As above, could simply over possible subsets k, taking roughly O(nk). Can significantly more efficiently? Learning Mixtures Gaussians: A mixture model (with, example, two components) generated via following process: point, w1, point p1, remaining 1-w1 second p2. Supposing such efficiently deduce components, p1 p2 mixture? accurately cluster points according they originated? special case component, p 2 Gaussian problem learning model, is, perhaps, (and practically relevant) starting tackling question recovering mixtures general families distributions. obtain handle problem, describe which, GMM provably returns accurate estimates runtime polynomial parameters—the dimension space, inverse desired accuracy recovered components. was known, even univariate just (Abstract shortened by UMI.)