CURE

作者: Sudipto Guha , Rajeev Rastogi , Kyuseok Shim

DOI: 10.1145/276304.276312

关键词: Data stream clusteringCURE data clustering algorithmCluster (physics)k-medians clusteringCluster analysisOutlierComputer scienceCorrelation clusteringSingle-linkage clusteringDatabase

摘要: Clustering, in data mining, is useful for discovering groups and identifying interesting distributions the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes similar sizes, or are very fragile presence of outliers. We propose a new algorithm called CURE that more robust to outliers, identifies having non-spherical wide variances size. achieves this by representing each cluster certain fixed number points generated selecting well scattered from then shrinking them toward center specified fraction. Having than one representative point per allows adjust geometry helps dampen effects To handle large databases, employs combination random sampling partitioning. A sample drawn set first partitioned partition partially clustered. The partial clustered second pass yield desired clusters. Our experimental results confirm quality produced much better those found existing algorithms. Furthermore, they demonstrate partitioning enable not only outperform but also scale databases without sacrificing quality.

参考文章(14)
Raymond T. Ng, Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining very large data bases. pp. 144- 155 ,(1994)
Hannu Toivonen, Sampling Large Databases for Association Rules very large data bases. pp. 134- 145 ,(1996)
Richard C. Dubes, Anil K. Jain, Algorithms for clustering data ,(1988)
Jerome H. Friedman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time ACM Transactions on Mathematical Software. ,vol. 3, pp. 209- 226 ,(1977) , 10.1145/355744.355745
Clark F. Olson, Parallel algorithms for hierarchical clustering parallel computing. ,vol. 21, pp. 1313- 1325 ,(1995) , 10.1016/0167-8191(95)00017-I
Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: an efficient data clustering method for very large databases international conference on management of data. ,vol. 25, pp. 103- 114 ,(1996) , 10.1145/233269.233324
Hans-Peter Kriegel, Martin Ester, Xiaowei Xu, A database interface for clustering in large spatial databases knowledge discovery and data mining. pp. 94- 99 ,(1995)
Jeffrey S. Vitter, Random sampling with a reservoir ACM Transactions on Mathematical Software. ,vol. 11, pp. 37- 57 ,(1985) , 10.1145/3147.3165
Nick Roussopoulos, Timos K. Sellis, Christos Faloutsos, The R+-Tree: A Dynamic Index for Multi-Dimensional Objects very large data bases. pp. 507- 518 ,(1987) , 10.1184/R1/6610748.V1