Estimating the number of clusters in a data set via the gap statistic

作者: Robert Tibshirani , Guenther Walther , Trevor Hastie

DOI: 10.1111/1467-9868.00293

关键词:

摘要: We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in set data. The technique uses output any clustering algorithm (e.g. K-means or hierarchical), comparing change within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed proposal and simulation study shows gap statistic usually outperforms other methods have been proposed literature.

参考文章(16)
A. D. Gordon, Null Models in Cluster Validation Springer, Berlin, Heidelberg. pp. 32- 44 ,(1996) , 10.1007/978-3-642-79999-0_3
Richard A Olshen, Charles J Stone, Leo Breiman, Jerome H Friedman, Classification and regression trees ,(1983)
Glenn W. Milligan, Martha C. Cooper, An examination of procedures for determining the number of clusters in a data set Psychometrika. ,vol. 50, pp. 159- 179 ,(1985) , 10.1007/BF02294245
Wai Chan, Sudhakar Dharmadhikari, Kumar Joag-Dev, Unimodality, convexity, and applications Journal of the American Statistical Association. ,vol. 84, pp. 846- ,(1989) , 10.2307/2289694
Douglas T Ross, Uwe Scherf, Michael B Eisen, Charles M Perou, Christian Rees, Paul Spellman, Vishwanath Iyer, Stefanie S Jeffrey, Matt Van de Rijn, Mark Waltham, Alexander Pergamenschikov, JC Lee, Deval Lashkari, Dari Shalon, Timothy G Myers, John N Weinstein, David Botstein, Patrick O Brown, None, Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics. ,vol. 24, pp. 227- 235 ,(2000) , 10.1038/73432
Kathryn Roeder, A Graphical Technique for Determining the Number of Components in a Mixture of Normals Journal of the American Statistical Association. ,vol. 89, pp. 487- 495 ,(1994) , 10.1080/01621459.1994.10476772
Chris Fraley, Adrian E Raftery, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis The Computer Journal. ,vol. 41, pp. 578- 588 ,(1998) , 10.1093/COMJNL/41.8.578
F. H. C. Marriott, Practical problems in a method of cluster analysis. Biometrics. ,vol. 27, pp. 501- 514 ,(1971) , 10.2307/2528592
T. Calinski, J. Harabasz, A dendrite method for cluster analysis Communications in Statistics-theory and Methods. ,vol. 3, pp. 1- 27 ,(1974) , 10.1080/03610927408827101
A. J. Scott, M. J. Symons, Clustering methods based on likelihood ratio criteria. Biometrics. ,vol. 27, pp. 387- ,(1971) , 10.2307/2529003