Enhancing text clustering by leveraging Wikipedia semantics

作者: Jian Hu , Lujun Fang , Yang Cao , Hua-Jun Zeng , Hua Li

DOI: 10.1145/1390334.1390367

关键词:

摘要: Most traditional text clustering methods are based on "bag of words" (BOW) representation frequency statistics in a set documents. BOW, however, ignores the important information semantic relationships between key terms. To overcome this problem, several have been proposed to enrich with external resource past, such as WordNet. However, many these approaches suffer from some limitations: 1) WordNet has limited coverage and lack effective word-sense disambiguation ability; 2) enrichment strategies, which append or replace document terms their hypernym synonym, overly simple. In paper, deficiencies, we first propose way build concept thesaurus relations (synonym, hypernym, associative relation) extracted Wikipedia. Then, develop unified framework leverage order enhance content similarity measure for clustering. The experimental results Reuters OHSUMED datasets show that help Wikipedia thesaurus, performance our method is improved compared previous methods. addition, optimized weights concepts tuned few labeled data users provided, can be further improved.

参考文章(23)
Michael Strube, Simone Paolo Ponzetto, Deriving a large scale taxonomy from Wikipedia national conference on artificial intelligence. pp. 1440- 1445 ,(2007)
Evgeniy Gabrilovich, Shaul Markovitch, Feature generation for text categorization using world knowledge international joint conference on artificial intelligence. pp. 1048- 1053 ,(2005)
Andreas Hotho, Steffen Staab, Gerd Stumme, WordNet improves text document clustering international acm sigir conference on research and development in information retrieval. pp. 541- ,(2003)
Michael Strube, Simone Paolo Ponzetto, WikiRelate! computing semantic relatedness using wikipedia national conference on artificial intelligence. pp. 1419- 1424 ,(2006)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
Razvan C. Bunescu, Marius Pasca, Using Encyclopedic Knowledge for Named Entity Disambiguation conference of the european chapter of the association for computational linguistics. ,(2006)
L Alfonso Urena-López, Manuel Buenaga, Jose M Gomez, None, Integrating Linguistic Resources in TC through WSD Computers and The Humanities. ,vol. 35, pp. 215- 230 ,(2001) , 10.1023/A:1002632712378
Belén Díaz-Agudo, Manuel de Buenaga Rodríguez, José María Gómez Hidalgo, Using WordNet to Complement Training Information in Text Categorization arXiv: Computation and Language. pp. 353- ,(1997)
Víctor Pàmies, Open Directory Project Softcatalà (http://www.softcatala.org/). ,(2003)
George Karypis, Michael Steinbach, Vipin Kumar, A Comparison of Document Clustering Techniques ,(2000)