作者: Sudipto Guha , Rajeev Rastogi , Kyuseok Shim
关键词: Data stream clustering 、 CURE data clustering algorithm 、 Cluster (physics) 、 k-medians clustering 、 Cluster analysis 、 Outlier 、 Computer science 、 Correlation clustering 、 Single-linkage clustering 、 Database
摘要: Clustering, in data mining, is useful for discovering groups and identifying interesting distributions the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes similar sizes, or are very fragile presence of outliers. We propose a new algorithm called CURE that more robust to outliers, identifies having non-spherical wide variances size. achieves this by representing each cluster certain fixed number points generated selecting well scattered from then shrinking them toward center specified fraction. Having than one representative point per allows adjust geometry helps dampen effects To handle large databases, employs combination random sampling partitioning. A sample drawn set first partitioned partition partially clustered. The partial clustered second pass yield desired clusters. Our experimental results confirm quality produced much better those found existing algorithms. Furthermore, they demonstrate partitioning enable not only outperform but also scale databases without sacrificing quality.