作者: Ivan Stankov , Diman Todorov , Rossitza Setchi
DOI: 10.3233/KES-130267
关键词:
摘要: The aim of document clustering is to produce coherent clusters similar documents. Clustering algorithms rely on text normalisation techniques represent and cluster Although most perform well in specific knowledge domains, processing cross-domain repositories still a challenge. This paper attempts address this It investigates the performance sk-means algorithm across by comparing coherence produced with semantic-based traditional TF-IDF-based representations. evaluation conducted 20 different generic sub-domains thousand documents, each randomly selected from Reuters21578 corpus. experimental results obtained demonstrate improved using semantically enhanced stemmer SETS, when compared Porter stemmer. In addition, shown be resistant noise, which often introduced index aggregation stage, stage that acquires features