Streaming k-means on well-clusterable data

作者: Vladimir Braverman , Brian Tagiku , Rafail Ostrovsky , Alan Roytman , Adam Meyerson

DOI: 10.5555/2133036.2133039

关键词:

摘要: One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention literature addressed streaming variant this problem, culminating a series results (Har-Peled and Mazumdar; Frahling Sohler; Frahling, Monemizadeh, Chen) that produced (1 + e)-approximation for clustering setting. Unfortunately, since optimizing objective Max-SNP hard, all algorithms achieve must take time exponential k unless P=NP.Thus, to avoid dependence on k, some additional assumptions be made guarantee high quality approximation polynomial running time. A paper Ostrovsky, Rabani, Schulman, Swamy (FOCS 2006) introduced very natural assumption data separability: closely reflects how used practice allowed authors create high-quality non-streaming setting with even large values k. Their work left open important question: are similar possible setting? This question we answer paper, albeit using substantially different techniques.We show near-optimal algorithm high-dimensional Euclidean space sublinear memory single pass, under same separability assumption. Our offers significant improvements both over previous while yielding asymptotically best-possible performance (assuming fully P ≠ NP).The novel techniques develop along way imply number results: provide high-probability online facility location (in contrast, Meyerson's FOCS 2001 gave bounds only expectation); constant method general class semi-metric problems; improve (even without σ-separability) by logarithmic factor requirements constant-approximation k-median; finally design "re-sampling method" convert any [1 O(σ2)]-approximation σ-separable data.

参考文章(36)
C. Chatfield, J. H. Durran, Statistics and Probability Applied statistics. ,vol. 20, pp. 94- 95 ,(1971) , 10.2307/2346637
Geoffrey H. Ball, David J. Hall, ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION Stanford Research Institute. ,(1965)
Steven J. Phillips, Acceleration of K-Means and Related Clustering Algorithms algorithm engineering and experimentation. pp. 166- 177 ,(2002) , 10.1007/3-540-45643-0_13
Avrim Blum, Anupam Gupta, Maria-Florina Balcan, Approximate clustering without the approximation symposium on discrete algorithms. pp. 1068- 1077 ,(2009) , 10.5555/1496770.1496886
Khaled Alsabti, Sanjay Ranka, Vineet Singh, An efficient k-means clustering algorithm ,(1997)
Allen Gersho, Robert M. Gray, Vector Quantization and Signal Compression ,(1991)
Khaled AlSabti, Sanjay Ranka, Vineet Singh, An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering pacific asia conference on knowledge discovery and data mining. pp. 355- 359 ,(1999) , 10.1007/3-540-48912-6_47
A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review ACM Computing Surveys. ,vol. 31, pp. 264- 323 ,(1999) , 10.1145/331499.331504
Walter D. Fisher, On Grouping for Maximum Homogeneity Journal of the American Statistical Association. ,vol. 53, pp. 789- 798 ,(1958) , 10.1080/01621459.1958.10501479