reval: a Python package to determine the best number of clusters with stability-based relative clustering validation.

作者: Michael V. Lombardo , Isotta Landi , Veronica Mandelli

DOI:

关键词:

摘要: Determining the number of clusters that best partitions a dataset can be challenging task because 1) lack priori information within an unsupervised learning framework; and 2) absence unique clustering validation approach to evaluate solutions. Here we present reval: Python package leverages stability-based relative methods determine Statistical software, both in R Python, usually rely on internal metrics, such as silhouette index, select fits data. Meanwhile, open-source software solutions easily implement techniques are lacking. Internal exploit characteristics data itself produce result, whereas approaches attempt leverage unknown underlying distribution points looking for replicable generalizable solution. The implementation further theory by enriching already available used investigate results different situations distributions. This work aims at contributing this effort developing method selects solution one replicates, via supervised learning, unseen subsets works with multiple classification algorithms, hence allowing assessment stability mechanisms.

参考文章(15)
Robert Tibshirani, Trevor Hastie, Jerome H. Friedman, The Elements of Statistical Learning ,(2001)
Malika Charrad, Nadia Ghazzali, Véronique Boiteau, Azam Niknafs, NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set Journal of Statistical Software. ,vol. 61, pp. 1- 36 ,(2014) , 10.18637/JSS.V061.I06
Peter J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis Journal of Computational and Applied Mathematics. ,vol. 20, pp. 53- 65 ,(1987) , 10.1016/0377-0427(87)90125-7
Robert L. Thorndike, Who belongs in the family? Psychometrika. ,vol. 18, pp. 267- 276 ,(1953) , 10.1007/BF02289263
Asa Ben-Hur, Andre Elisseeff, Isabelle Guyon, A stability based method for discovering structure in clustered data. pacific symposium on biocomputing. pp. 6- 17 ,(2001) , 10.1142/9789812799623_0002
Robert Tibshirani, Guenther Walther, Cluster Validation by Prediction Strength Journal of Computational and Graphical Statistics. ,vol. 14, pp. 511- 528 ,(2005) , 10.1198/106186005X59243
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay, Scikit-learn: Machine Learning in Python Journal of Machine Learning Research. ,vol. 12, pp. 2825- 2830 ,(2011)
Tilman Lange, Volker Roth, Mikio L. Braun, Joachim M. Buhmann, Stability-based validation of clustering solutions Neural Computation. ,vol. 16, pp. 1299- 1323 ,(2004) , 10.1162/089976604773717621
Marcel Brun, Chao Sima, Jianping Hua, James Lowey, Brent Carroll, Edward Suh, Edward R. Dougherty, Model-based evaluation of clustering validation measures Pattern Recognition. ,vol. 40, pp. 807- 824 ,(2007) , 10.1016/J.PATCOG.2006.06.026
Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta, clValid: An R Package for Cluster Validation Journal of Statistical Software. ,vol. 25, pp. 1- 22 ,(2008) , 10.18637/JSS.V025.I04