Building semantic kernels for text classification using wikipedia

作者: Pu Wang , Carlotta Domeniconi

DOI: 10.1145/1401890.1401976

关键词:

摘要: Document classification presents difficult challenges due to the sparsity and high dimensionality of text data, complex semantics natural language. The traditional document representation is a word-based vector (Bag Words, or BOW), where each dimension associated with term dictionary containing all words that appear in corpus. Although simple commonly used, this has several limitations. It essential embed semantic information conceptual patterns order enhance prediction capabilities algorithms. In paper, we overcome shortages BOW approach by embedding background knowledge derived from Wikipedia into kernel, which then used enrich documents. Our empirical evaluation real data sets demonstrates our successfully achieves improved accuracy respect technique, other recently developed methods.

参考文章(24)
Evgeniy Gabrilovich, Shaul Markovitch, Feature generation for text categorization using world knowledge international joint conference on artificial intelligence. pp. 1048- 1053 ,(2005)
Andreas Hotho, Steffen Staab, Gerd Stumme, WordNet improves text document clustering international acm sigir conference on research and development in information retrieval. pp. 541- ,(2003)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
Nello Cristianini, John Shawe-Taylor, Kernel Methods for Pattern Analysis ,(2004)
Razvan C. Bunescu, Marius Pasca, Using Encyclopedic Knowledge for Named Entity Disambiguation conference of the european chapter of the association for computational linguistics. ,(2006)
L Alfonso Urena-López, Manuel Buenaga, Jose M Gomez, None, Integrating Linguistic Resources in TC through WSD Computers and The Humanities. ,vol. 35, pp. 215- 230 ,(2001) , 10.1023/A:1002632712378
Belén Díaz-Agudo, Manuel de Buenaga Rodríguez, José María Gómez Hidalgo, Using WordNet to Complement Training Information in Text Categorization arXiv: Computation and Language. pp. 353- ,(1997)
S. K. M. Wong, Wojciech Ziarko, Patrick C. N. Wong, Generalized vector spaces model in information retrieval Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '85. pp. 18- 25 ,(1985) , 10.1145/253495.253506