Clustering categorical data: A stability analysis framework

作者: I.H. Jarman , T.A. Etchells , P.J.G. Lisboa , C.M Beynon , J.D. Martin-Guerrero

DOI: 10.1109/CIDM.2011.5949452

关键词:

摘要: Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm a popular choice, but K-means not generally appropriate for categorical data. A specific extension of the k-modes algorithm. Both these partition clustering methods are sensitive initialization prototypes, which creates difficulty selecting best solution given problem. In addition, number clusters can be issue. Further, method especially prone instability when presented with ‘noisy’ data, since calculation mode lacks smoothing effect mean. This often case real-world datasets, instance domain Public Health, resulting solutions that radically different depending on and therefore lead interpretations. paper presents two methodologies. addresses sensitivity initializations using generic landscape mapping k-mode solutions. second methodology utilizes map stabilize discrete by drawing consensus sample order separate signal from noise components. Results benchmark soybean disease dataset, artificially generated dataset study involving Health

参考文章(23)
P. Arabie, Cluster analysis in marketing research Advanced methods in marketing research. pp. 160- 189 ,(1994)
Hai-yong Liao, Michael K. Ng, Categorical data clustering with automatic selection of cluster number Fuzzy Information and Engineering. ,vol. 1, pp. 5- 25 ,(2009) , 10.1007/S12543-009-0001-5
Asa Ben-Hur, Andre Elisseeff, Isabelle Guyon, A stability based method for discovering structure in clustered data. pacific symposium on biocomputing. pp. 6- 17 ,(2001) , 10.1142/9789812799623_0002
P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering Large Graphs via the Singular Value Decomposition Machine Learning. ,vol. 56, pp. 9- 33 ,(2004) , 10.1023/B:MACH.0000033113.59016.96
Anil K. Jain, Data clustering: 50 years beyond K-means international conference on pattern recognition. ,vol. 31, pp. 651- 666 ,(2010) , 10.1016/J.PATREC.2009.09.011
Jianchao Fan, Min Han, Jun Wang, Single point iterative weighted fuzzy C-means clustering algorithm for remote sensing image segmentation Pattern Recognition. ,vol. 42, pp. 2527- 2540 ,(2009) , 10.1016/J.PATCOG.2009.04.013
Yvonne M Bishop, Stephen E Fienberg, Paul W Holland, None, Discrete Multivariate Analysis: Theory and Practice ,(1975)
Yvonne Bonomo, Jenny Proimos, Substance misuse: alcohol, tobacco, inhalants, and other drugs BMJ. ,vol. 330, pp. 777- 780 ,(2005) , 10.1136/BMJ.330.7494.777