A multi-layer text classification framework based on two-level representation model

作者: Jiali Yun , Liping Jing , Jian Yu , Houkuan Huang

DOI: 10.1016/J.ESWA.2011.08.027

关键词:

摘要: Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text more difficult to be analyzed because it contains complicated both syntactic semantic information. In this paper, we propose a two-level representation model (2RM) represent for representing information other Each document, level, represented as term vector where value each component frequency inverse document frequency. The Wikipedia concepts related terms level are used level. Meanwhile, designed multi-layer classification framework (MLCLA) make use 2RM model. MLCLA three classifiers. Among them, two classifiers applied on parallel. outputs these will combined input third classifier, so that final results can obtained. Experimental benchmark sets (20Newsgroups, Reuters-21578 Classic3) have shown proposed plus improves performance by comparing with existing flat models (Term-based VSM, Term Semantic Kernel Model, Concept-based Concept Model Term+Concept VSM) methods.

参考文章(33)
Evgeniy Gabrilovich, Shaul Markovitch, Feature generation for text categorization using world knowledge international joint conference on artificial intelligence. pp. 1048- 1053 ,(2005)
Andreas Hotho, Steffen Staab, Gerd Stumme, WordNet improves text document clustering international acm sigir conference on research and development in information retrieval. pp. 541- ,(2003)
Olena Medelyan, David N. Milne, Ian H. Witten, Topic indexing with Wikipedia AAAI Press. pp. 19- 24 ,(2008)
Anna Huang, David Milne, Eibe Frank, Ian H. Witten, Clustering Documents Using a Wikipedia-Based Concept Representation Advances in Knowledge Discovery and Data Mining. pp. 628- 636 ,(2009) , 10.1007/978-3-642-01307-2_62
BSCH OLKOPF, C Burges, A Smola, Advances in kernel methods: support vector learning international conference on neural information processing. ,(1999) , 10.5555/299094
r;ribeiro-neto bueza-yates (b), Modern Information Retrieval ,(1999)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Tom M Mitchell, None, The Role of Unlabeled Data in Supervised Learning Springer, Dordrecht. pp. 103- 111 ,(2004) , 10.1007/978-1-4020-2783-3_7
David Milne, Ian H. Witten, An effective, low-cost measure of semantic relatedness obtained from Wikipedia links AAAI Press. pp. 25- 30 ,(2008)
Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, Xiaohua Zhou, Exploiting Wikipedia as external knowledge for document clustering Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. pp. 389- 396 ,(2009) , 10.1145/1557019.1557066