Semi-supervised clustering: probabilistic models, algorithms and experiments

作者: Raymond J. Mooney , Sugato Basu

DOI:

关键词:

摘要: Clustering is one of the most common data mining tasks, used frequently for categorization and analysis in both industry academia. The focus our research on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, present probabilistic models develop algorithms based these empirically validate their performances by extensive experiments sets different domains, e.g., text analysis, hand-written character recognition, bioinformatics. In many domains applied, some knowledge available form labeled (specifying category to which an instance belongs) pairwise constraints instances whether two should same clusters). first analyze effective methods incorporating supervision prototype-based algorithms, propose variants well-known KMeans algorithm that improve performance with limited data. We then problem show studied framework a well-defined generative model Hidden Markov Random Field. We derive efficient KMeans-type iterative algorithm, HMRF-KMeans, optimizing objective function defined HMRF model. also give convergence guarantees large class distortion measures (e.g., squared Euclidean distance, KL divergence, cosine distance). Finally, active learning acquiring maximally informative interactive query-driven framework, constraints. Other interesting problems discuss thesis include (1) semi-supervised graph-based using kernels, (2) using overlapping data, (3) integration constraint distance-based model, (4) model selection techniques use automatically select right number clusters.

参考文章(100)
Shivakumar Vaithyanathan, Ashutosh Garg, David Gondek, Clustering with Model-level Constraints. siam international conference on data mining. pp. 126- 137 ,(2005)
Daniel Boley, Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery. ,vol. 2, pp. 325- 344 ,(1998) , 10.1023/A:1009740529316
Usama Fayyad, Cory Reina, P. S. Bradley, Initialization of iterative refinement clustering algorithms knowledge discovery and data mining. pp. 194- 198 ,(1998)
Kamal Nigam, Andrew McCallum, Employing EM and Pool-Based Active Learning for Text Classification international conference on machine learning. pp. 350- 358 ,(1998)
Alexander Strehl, Joydeep Ghosh, A Scalable Approach to Balanced, High-Dimensional Clustering of Market-Baskets ieee international conference on high performance computing data and analytics. pp. 525- 536 ,(2000) , 10.1007/3-540-44467-X_48
Joydeep Ghosh, Raymond Mooney, Alexander Strehl, Impact of Similarity Measures on Web-page Clustering ,(2000)
Yoav Freund, H. Sebastian Seung, Eli Shamir, Naftali Tishby, Selective Sampling Using the Query by Committee Algorithm Machine Learning. ,vol. 28, pp. 133- 168 ,(1997) , 10.1023/A:1007330508534
Julian Besag, On the statistical analysis of dirty pictures Journal of the royal statistical society series b-methodological. ,vol. 48, pp. 259- 279 ,(1986) , 10.1111/J.2517-6161.1986.TB01412.X