Exploiting Wikipedia as external knowledge for document clustering

作者: Xiaohua Hu , Xiaodan Zhang , Caimei Lu , E. K. Park , Xiaohua Zhou

DOI: 10.1145/1557019.1557066

关键词:

摘要: In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information each document. For instance, if two use different collections core words to represent same topic, they may be falsely assigned clusters due lack shared words, although probably synonyms or semantically associated in other forms. The most common way solve this problem is enrich document representation with background knowledge an ontology. There major issues for approach: (1) coverage ontology limited, even WordNet Mesh, (2) using terms replacement additional features cause loss, introduce noise. paper, we present a novel method address these by enriching Wikipedia concept and category information. We develop approaches, exact match relatedness-match, map concepts, further categories. Then clustered based on similarity metric which combines content information, well experimental results proposed framework three datasets (20-newsgroup, TDT2, LA Times) show that performance improves significantly concepts

参考文章(14)
Andreas Hotho, Steffen Staab, Gerd Stumme, WordNet improves text document clustering international acm sigir conference on research and development in information retrieval. pp. 541- ,(2003)
Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, Xiaohua Zhou, A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering Advances in Databases: Concepts, Systems and Applications. pp. 115- 126 ,(2007) , 10.1007/978-3-540-71703-4_12
George Karypis, Michael Steinbach, Vipin Kumar, A Comparison of Document Clustering Techniques ,(2000)
Shi Zhong, Joydeep Ghosh, Generative model-based document clustering: a comparative study Knowledge and Information Systems. ,vol. 8, pp. 374- 384 ,(2005) , 10.1007/S10115-004-0194-1
Yinghao Li, Wing Pong Robert Luk, Kei Shiu Edward Ho, Fu Lai Korris Chung, None, Improving weak ad-hoc queries using wikipedia asexternal corpus international acm sigir conference on research and development in information retrieval. pp. 797- 798 ,(2007) , 10.1145/1277741.1277914
Somnath Banerjee, Krishnan Ramanathan, Ajay Gupta, Clustering short texts using wikipedia international acm sigir conference on research and development in information retrieval. pp. 787- 788 ,(2007) , 10.1145/1277741.1277909
Evgeniy Gabrilovich, Shaul Markovitch, Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge national conference on artificial intelligence. pp. 1301- 1306 ,(2006)
Pu Wang, Carlotta Domeniconi, Building semantic kernels for text classification using wikipedia Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08. pp. 713- 721 ,(2008) , 10.1145/1401890.1401976
Evgeniy Gabrilovich, Shaul Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis international joint conference on artificial intelligence. pp. 1606- 1611 ,(2007)