Criterion functions for document clustering

作者: George Karypis , Ying Zhao , Ding-Zhu Du

DOI:

关键词:

摘要: Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation browsing mechanisms by organizing large amounts of information into a small number meaningful clusters. In this thesis, we focus on class that treat the problem as optimization process which seeks to maximize or minimize particular criterion function defined over entire solution. In present comprehensive study desirable characteristics feasibility various functions under different requirements raised real world applications. particular, seven global for documents datasets, three are introduced us. The first part thesis consists detailed experimental evaluation using 15 datasets partitional approaches, followed theoretical analysis functions. Our shows more robust difference cluster tightness produce balanced clusters tend perform well. new among ones achieving best overall results. We further discuss how hierarchical soft solutions. We six nine agglomerative methods twelve datasets. A algorithms, constrained algorithm, is also proposed achieves four functions, derive their soft-clustering extensions, involving analyze characteristics. Finally, extend incorporate prior knowledge natural topics existing Specifically, define topic-driven clustering, organizes collection according given set topics. propose schemes consider similarity between relationship themselves simultaneously. results show efficient effective with topic prototypes levels specificity.

参考文章(40)
Isidore Rigoutsos, Dennis Shasha, Kaizhong Zhang, Bruce Shapiro, Xiong Wang, Jason T. L. Wang, Sitaram Dikshitulu, Automated discovery of active motifs in three dimensional molecules knowledge discovery and data mining. pp. 89- 95 ,(1997)
Xiaofeng He, Chris Ding, Ming Gu, Hongyuan Zha, Horst Simon, Spectral min-max cut for graph partitioning and data clustering ,(2001)
Anthony K. H. Tung Jiawei Han, Michelin Kamber, Spatial clustering methods in data mining : A survey Geographic data mining and knowledge discovery. pp. 188- 217 ,(2001)
George Karypis, Bamshad Mobasher, Eui-Hong Han, Vipin Kumar, Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results. IEEE Data(base) Engineering Bulletin. ,vol. 21, pp. 15- 22 ,(1998)
R. C. T. Lee, Clustering Analysis and Its Applications Springer, Boston, MA. pp. 169- 292 ,(1981) , 10.1007/978-1-4613-9883-7_4
Daniel Boley, Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery. ,vol. 2, pp. 325- 344 ,(1998) , 10.1023/A:1009740529316
Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, Jerome Moore, None, Document Categorization and Query Generation on the World Wide WebUsing WebACE Artificial Intelligence Review. ,vol. 13, pp. 365- 391 ,(1999) , 10.1023/A:1006592405320
Alexander Strehl, Joydeep Ghosh, A Scalable Approach to Balanced, High-Dimensional Clustering of Market-Baskets ieee international conference on high performance computing data and analytics. pp. 525- 536 ,(2000) , 10.1007/3-540-44467-X_48
John Stutz, Peter Cheeseman, Bayesian classification (AutoClass): theory and results knowledge discovery and data mining. pp. 153- 180 ,(1996)