作者: Xiaohua Hu , Xiaodan Zhang , Caimei Lu , E. K. Park , Xiaohua Zhou
关键词:
摘要: In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information each document. For instance, if two use different collections core words to represent same topic, they may be falsely assigned clusters due lack shared words, although probably synonyms or semantically associated in other forms. The most common way solve this problem is enrich document representation with background knowledge an ontology. There major issues for approach: (1) coverage ontology limited, even WordNet Mesh, (2) using terms replacement additional features cause loss, introduce noise. paper, we present a novel method address these by enriching Wikipedia concept and category information. We develop approaches, exact match relatedness-match, map concepts, further categories. Then clustered based on similarity metric which combines content information, well experimental results proposed framework three datasets (20-newsgroup, TDT2, LA Times) show that performance improves significantly concepts