CURE

作者： Sudipto Guha , Rajeev Rastogi , Kyuseok Shim

关键词: Data stream clustering 、 CURE data clustering algorithm 、 Cluster (physics) 、 k-medians clustering 、 Cluster analysis 、 Outlier 、 Computer science 、 Correlation clustering 、 Single-linkage clustering 、 Database

摘要: Clustering, in data mining, is useful for discovering groups and identifying interesting distributions the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes similar sizes, or are very fragile presence of outliers. We propose a new algorithm called CURE that more robust to outliers, identifies having non-spherical wide variances size. achieves this by representing each cluster certain fixed number points generated selecting well scattered from then shrinking them toward center specified fraction. Having than one representative point per allows adjust geometry helps dampen effects To handle large databases, employs combination random sampling partitioning. A sample drawn set first partitioned partition partially clustered. The partial clustered second pass yield desired clusters. Our experimental results confirm quality produced much better those found existing algorithms. Furthermore, they demonstrate partitioning enable not only outperform but also scale databases without sacrificing quality.

参考文章(14)

Raymond T. Ng, Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining very large data bases. pp. 144- 155 ,(1994)

Hannu Toivonen, Sampling Large Databases for Association Rules very large data bases. pp. 134- 145 ,(1996)

Richard C. Dubes, Anil K. Jain, Algorithms for clustering data ,(1988)

Jerome H. Friedman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time ACM Transactions on Mathematical Software. ,vol. 3, pp. 209- 226 ,(1977) , 10.1145/355744.355745

Clark F. Olson, Parallel algorithms for hierarchical clustering parallel computing. ,vol. 21, pp. 1313- 1325 ,(1995) , 10.1016/0167-8191(95)00017-I

Hanan Samet, The design and analysis of spatial data structures ,(1989)

Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: an efficient data clustering method for very large databases international conference on management of data. ,vol. 25, pp. 103- 114 ,(1996) , 10.1145/233269.233324

Hans-Peter Kriegel, Martin Ester, Xiaowei Xu, A database interface for clustering in large spatial databases knowledge discovery and data mining. pp. 94- 99 ,(1995)

Jeffrey S. Vitter, Random sampling with a reservoir ACM Transactions on Mathematical Software. ,vol. 11, pp. 37- 57 ,(1985) , 10.1145/3147.3165

10.

Nick Roussopoulos, Timos K. Sellis, Christos Faloutsos, The R+-Tree: A Dynamic Index for Multi-Dimensional Objects very large data bases. pp. 507- 518 ,(1987) , 10.1184/R1/6610748.V1

CURE

来源期刊

我的账户

CURE

来源期刊

相似文章 10

我的账户