作者: Jian Hu , Lujun Fang , Yang Cao , Hua-Jun Zeng , Hua Li
关键词:
摘要: Most traditional text clustering methods are based on "bag of words" (BOW) representation frequency statistics in a set documents. BOW, however, ignores the important information semantic relationships between key terms. To overcome this problem, several have been proposed to enrich with external resource past, such as WordNet. However, many these approaches suffer from some limitations: 1) WordNet has limited coverage and lack effective word-sense disambiguation ability; 2) enrichment strategies, which append or replace document terms their hypernym synonym, overly simple. In paper, deficiencies, we first propose way build concept thesaurus relations (synonym, hypernym, associative relation) extracted Wikipedia. Then, develop unified framework leverage order enhance content similarity measure for clustering. The experimental results Reuters OHSUMED datasets show that help Wikipedia thesaurus, performance our method is improved compared previous methods. addition, optimized weights concepts tuned few labeled data users provided, can be further improved.