Dimensionality reduction for k-means clustering

作者: Cameron N. Musco

DOI:

关键词: k-means clusteringHeuristicsPrincipal component analysisDimensionality reductionCluster analysisRandom projectionTheoretical computer scienceData scienceSupervisorSketchComputer science

摘要: In this thesis we study dimensionality reduction techniques for approximate k-means clustering. Given a large dataset, consider how to quickly compress smaller dataset (a sketch), such that solving the clustering problem on sketch will give an approximately optimal solution original dataset. First, provide exposition of technical results [CEM15], which show provably accurate is possible using common as principal component analysis, random projection, and sampling. We next present empirical evaluations supplement our theoretical results. algorithms, along with heuristics based these indeed perform well in practice. Finally, discuss extensions work neurally plausible algorithms reduction. This joint Michael Cohen, Samuel Elder, Nancy Lynch, Christopher Musco, Madalina Persu. Thesis Supervisor: A. Lynch Title: Professor Electrical Engineering Computer Science

参考文章(81)
Dimitris Papailiopoulos, Alexandros Dimakis, Megasthenis Asteris, Nonnegative Sparse PCA with Provable Guarantees international conference on machine learning. pp. 1728- 1736 ,(2014)
Sid Ray, Rose H Turi, Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation international conference on advances in pattern recognition. pp. 137- 143 ,(2000)
Michael B. Cohen, Jelani Nelson, David P. Woodruff, Optimal approximate matrix product in terms of stable rank arXiv: Data Structures and Algorithms. ,(2015)
Jan-Philipp W. Kappmeier, Daniel R. Schmidt, Melanie Schmidt, Solving k-means on High-Dimensional Big Data symposium on experimental and efficient algorithms. pp. 259- 270 ,(2015) , 10.1007/978-3-319-20086-6_20
Mina Ghashami, Edo Liberty, Jeff M. Phillips, David P. Woodruff, Frequent Directions: Simple and Deterministic Matrix Sketching SIAM Journal on Computing. ,vol. 45, pp. 1762- 1792 ,(2016) , 10.1137/15M1009718
Rosa I. Arriaga, Santosh Vempala, An algorithmic theory of learning: robust concepts and random projection foundations of computer science. ,vol. 63, pp. 161- 182 ,(1999) , 10.1007/S10994-006-6265-7
Venkatesan Guruswami, Ali Kemal Sinop, Optimal column-based low-rank matrix reconstruction symposium on discrete algorithms. ,vol. 2012, pp. 1207- 1214 ,(2012) , 10.5555/2095116.2095211
Andrea Vattani, Kamalika Chaudhuri, Sanjoy Dasgupta, Learning Mixtures of Gaussians Using the k-Means Algorithm arXiv: Learning. ,(2009)
Lloyd N. Trefethen, David Bau, Numerical Linear Algebra ,(1997)