Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

作者： Ivan Stankov , Diman Todorov , Rossitza Setchi

关键词:

摘要: The aim of document clustering is to produce coherent clusters similar documents. Clustering algorithms rely on text normalisation techniques represent and cluster Although most perform well in specific knowledge domains, processing cross-domain repositories still a challenge. This paper attempts address this It investigates the performance sk-means algorithm across by comparing coherence produced with semantic-based traditional TF-IDF-based representations. evaluation conducted 20 different generic sub-domains thousand documents, each randomly selected from Reuters21578 corpus. experimental results obtained demonstrate improved using semantically enhanced stemmer SETS, when compared Porter stemmer. In addition, shown be resistant noise, which often introduced index aggregation stage, stage that acquires features

参考文章(31)

Benjamin CM Fung, Ke Wang, Martin Ester, None, Hierarchical Document Clustering Encyclopedia of Data Warehousing and Mining. pp. 555- 559 ,(2009) , 10.4018/978-1-59140-557-3.CH105

Anna Huang, Similarity Measures for Text Document Clustering ,(2008)

Edward Grefenstette, Analysing Document Similarity Measures Department of Computer Science. ,(2009)

Rossitza Setchi, Qiao Tang, Carole Bouchard, Ontology-Based Concept Indexing of Images Knowledge-Based and Intelligent Information and Engineering Systems. pp. 293- 300 ,(2009) , 10.1007/978-3-642-04595-0_36

David D. Lewis, Text representation for intelligent text retrieval: a classification-oriented view Text-based intelligent systems. pp. 179- 197 ,(1992)

Gerard Salton, Michael J. McGill, Introduction to Modern Information Retrieval ,(1983)

Wessel Kraaij, Renée Pohlmann, Viewing stemming as recall enhancement international acm sigir conference on research and development in information retrieval. pp. 40- 48 ,(1996) , 10.1145/243199.243209

Peter J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis Journal of Computational and Applied Mathematics. ,vol. 20, pp. 53- 65 ,(1987) , 10.1016/0377-0427(87)90125-7

A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review ACM Computing Surveys. ,vol. 31, pp. 264- 323 ,(1999) , 10.1145/331499.331504

10.

Igor V. Cadez, Scott Gaffney, Padhraic Smyth, A general probabilistic framework for clustering individuals and objects knowledge discovery and data mining. pp. 140- 149 ,(2000) , 10.1145/347090.347119

Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

来源期刊

我的账户

Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

来源期刊

相似文章 3

Comparing Apples to Apple: The Effects of Stemmers on Topic Models

Exploring User Experience with Image Schemas, Sentiments, and Semantics

Text mining and semantics: a systematic mapping study

我的账户