作者: Raymond J. Mooney , Sugato Basu
DOI:
关键词:
摘要: Clustering is one of the most common data mining tasks, used frequently for categorization and analysis in both industry academia. The focus our research on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, present probabilistic models develop algorithms based these empirically validate their performances by extensive experiments sets different domains, e.g., text analysis, hand-written character recognition, bioinformatics. In many domains applied, some knowledge available form labeled (specifying category to which an instance belongs) pairwise constraints instances whether two should same clusters). first analyze effective methods incorporating supervision prototype-based algorithms, propose variants well-known KMeans algorithm that improve performance with limited data. We then problem show studied framework a well-defined generative model Hidden Markov Random Field. We derive efficient KMeans-type iterative algorithm, HMRF-KMeans, optimizing objective function defined HMRF model. also give convergence guarantees large class distortion measures (e.g., squared Euclidean distance, KL divergence, cosine distance). Finally, active learning acquiring maximally informative interactive query-driven framework, constraints. Other interesting problems discuss thesis include (1) semi-supervised graph-based using kernels, (2) using overlapping data, (3) integration constraint distance-based model, (4) model selection techniques use automatically select right number clusters.