作者: Vladimir Braverman , Brian Tagiku , Rafail Ostrovsky , Alan Roytman , Adam Meyerson
关键词:
摘要: One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention literature addressed streaming variant this problem, culminating a series results (Har-Peled and Mazumdar; Frahling Sohler; Frahling, Monemizadeh, Chen) that produced (1 + e)-approximation for clustering setting. Unfortunately, since optimizing objective Max-SNP hard, all algorithms achieve must take time exponential k unless P=NP.Thus, to avoid dependence on k, some additional assumptions be made guarantee high quality approximation polynomial running time. A paper Ostrovsky, Rabani, Schulman, Swamy (FOCS 2006) introduced very natural assumption data separability: closely reflects how used practice allowed authors create high-quality non-streaming setting with even large values k. Their work left open important question: are similar possible setting? This question we answer paper, albeit using substantially different techniques.We show near-optimal algorithm high-dimensional Euclidean space sublinear memory single pass, under same separability assumption. Our offers significant improvements both over previous while yielding asymptotically best-possible performance (assuming fully P ≠ NP).The novel techniques develop along way imply number results: provide high-probability online facility location (in contrast, Meyerson's FOCS 2001 gave bounds only expectation); constant method general class semi-metric problems; improve (even without σ-separability) by logarithmic factor requirements constant-approximation k-median; finally design "re-sampling method" convert any [1 O(σ2)]-approximation σ-separable data.