作者: George Karypis , Ying Zhao , Ding-Zhu Du
DOI:
关键词:
摘要: Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation browsing mechanisms by organizing large amounts of information into a small number meaningful clusters. In this thesis, we focus on class that treat the problem as optimization process which seeks to maximize or minimize particular criterion function defined over entire solution. In present comprehensive study desirable characteristics feasibility various functions under different requirements raised real world applications. particular, seven global for documents datasets, three are introduced us. The first part thesis consists detailed experimental evaluation using 15 datasets partitional approaches, followed theoretical analysis functions. Our shows more robust difference cluster tightness produce balanced clusters tend perform well. new among ones achieving best overall results. We further discuss how hierarchical soft solutions. We six nine agglomerative methods twelve datasets. A algorithms, constrained algorithm, is also proposed achieves four functions, derive their soft-clustering extensions, involving analyze characteristics. Finally, extend incorporate prior knowledge natural topics existing Specifically, define topic-driven clustering, organizes collection according given set topics. propose schemes consider similarity between relationship themselves simultaneously. results show efficient effective with topic prototypes levels specificity.