Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

作者: Ivan Stankov , Diman Todorov , Rossitza Setchi

DOI: 10.3233/KES-130267

关键词:

摘要: The aim of document clustering is to produce coherent clusters similar documents. Clustering algorithms rely on text normalisation techniques represent and cluster Although most perform well in specific knowledge domains, processing cross-domain repositories still a challenge. This paper attempts address this It investigates the performance sk-means algorithm across by comparing coherence produced with semantic-based traditional TF-IDF-based representations. evaluation conducted 20 different generic sub-domains thousand documents, each randomly selected from Reuters21578 corpus. experimental results obtained demonstrate improved using semantically enhanced stemmer SETS, when compared Porter stemmer. In addition, shown be resistant noise, which often introduced index aggregation stage, stage that acquires features

参考文章(31)
Benjamin CM Fung, Ke Wang, Martin Ester, None, Hierarchical Document Clustering Encyclopedia of Data Warehousing and Mining. pp. 555- 559 ,(2009) , 10.4018/978-1-59140-557-3.CH105
Edward Grefenstette, Analysing Document Similarity Measures Department of Computer Science. ,(2009)
Rossitza Setchi, Qiao Tang, Carole Bouchard, Ontology-Based Concept Indexing of Images Knowledge-Based and Intelligent Information and Engineering Systems. pp. 293- 300 ,(2009) , 10.1007/978-3-642-04595-0_36
David D. Lewis, Text representation for intelligent text retrieval: a classification-oriented view Text-based intelligent systems. pp. 179- 197 ,(1992)
Gerard Salton, Michael J. McGill, Introduction to Modern Information Retrieval ,(1983)
Wessel Kraaij, Renée Pohlmann, Viewing stemming as recall enhancement international acm sigir conference on research and development in information retrieval. pp. 40- 48 ,(1996) , 10.1145/243199.243209
Peter J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis Journal of Computational and Applied Mathematics. ,vol. 20, pp. 53- 65 ,(1987) , 10.1016/0377-0427(87)90125-7
A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review ACM Computing Surveys. ,vol. 31, pp. 264- 323 ,(1999) , 10.1145/331499.331504
Igor V. Cadez, Scott Gaffney, Padhraic Smyth, A general probabilistic framework for clustering individuals and objects knowledge discovery and data mining. pp. 140- 149 ,(2000) , 10.1145/347090.347119