The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

作者: Thomas Hofmann

DOI:

关键词: Unsupervised learningArtificial intelligenceNatural language processingAutomatic summarizationWord (computer architecture)Computer scienceAbstraction (linguistics)Text miningLatent class modelMachine learningInformation access

摘要: This paper presents a novel statistical latent class model for text mining and interactive information access. The described learning architecture, called Cluster-Abstraction Model (CAM), is purely data driven utilizes contact-specific word occurrence statistics. In an intertwined fashion, the CAM extracts hierarchical relations between groups of documents as well abstractive organization keywords. An annealed version Expectation-Maximization (EM) algorithm maximum likelihood estimation parameters derived. benefits retrieval automated cluster summarization are investigated experimentally.

参考文章(9)
Thomas Hofmann, Jan Puzicha, Statistical Models for Co-occurrence Data Massachusetts Institute of Technology. ,(1998)
F. Jelinek, Interpolated estimation of Markov source parameters from sparse data Proc. Workshop on Pattern Recognition in Practice, 1980. pp. 381- 397 ,(1980)
W. Bruce Croft, Clustering large files of documents using the single-link method Journal of the Association for Information Science and Technology. ,vol. 28, pp. 341- 344 ,(1977) , 10.1002/ASI.4630280606
Kenneth Rose, Eitan Gurewitz, Geoffrey C. Fox, Statistical mechanics and phase transitions in clustering. Physical Review Letters. ,vol. 65, pp. 945- 948 ,(1990) , 10.1103/PHYSREVLETT.65.945
N. Jardine, C.J. van Rijsbergen, The use of hierarchic clustering in information retrieval Information Storage and Retrieval. ,vol. 7, pp. 217- 240 ,(1971) , 10.1016/0020-0271(71)90051-9
A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum Likelihood from Incomplete Data Via theEMAlgorithm Journal of the Royal Statistical Society: Series B (Methodological). ,vol. 39, pp. 1- 22 ,(1977) , 10.1111/J.2517-6161.1977.TB01600.X
L. Douglas Baker, Andrew Kachites McCallum, Distributional clustering of words for text classification international acm sigir conference on research and development in information retrieval. pp. 96- 103 ,(1998) , 10.1145/290941.290970
Fernando Pereira, Naftali Tishby, Lillian Lee, Distributional clustering of English words Proceedings of the 31st annual meeting on Association for Computational Linguistics -. pp. 183- 190 ,(1993) , 10.3115/981574.981598
Peter Willett, Recent trends in hierarchic document clustering: a critical review Information Processing and Management. ,vol. 24, pp. 577- 597 ,(1988) , 10.1016/0306-4573(88)90027-1