A General Framework of Feature Selection for Text Categorization

作者: Hongfang Jing , Bin Wang , Yahui Yang , Yan Xu

DOI: 10.1007/978-3-642-03070-3_49

关键词: dBFSPattern recognitionText categorizationFeature selectionArtificial intelligenceMachine learningComputer scienceInformation gain

摘要: Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on corpora used and may not be accurate. This paper proposes a novel framework called Distribution-Based Feature Selection (DBFS) based distribution difference of features. generalizes most state-of-the-art including OCFS, MI, ECE, IG, CHI OR. The many can estimated theoretical analysis using components this framework. Besides, DBFS sheds light merits drawbacks existing methods. In addition, helps to select suitable specific domains. Moreover, weighted model is given that unbalanced datasets derived. experimental show they more effective than CHI, IG OCFS both balanced datasets.

参考文章(32)
Gaëlle Legrand, Nicolas Nicoloyannis, Feature Selection Method Using Preferences Aggregation Machine Learning and Data Mining in Pattern Recognition. pp. 203- 217 ,(2005) , 10.1007/11510888_21
Marko Robnik-Šikonja, Igor Kononenko, Theoretical and Empirical Analysis of ReliefF and RReliefF Machine Learning. ,vol. 53, pp. 23- 69 ,(2003) , 10.1023/A:1025667309714
Shusaku Tsumoto, Lech Polkowski, Tsau Young Lin, Rough set methods and applications: new developments in knowledge discovery in information systems Physica-Verlag GmbH. ,(2000)
Marko Grobelnik, Dunja Mladenic, Feature Selection for Unbalanced Class Distribution and Naive Bayes international conference on machine learning. pp. 258- 267 ,(1999)
Pat Langley, Selection of Relevant Features in Machine Learning national conference on artificial intelligence. pp. 1- 5 ,(1994) , 10.21236/ADA292575
George H John, Ron Kohavi, Karl Pfleger, None, Irrelevant Features and the Subset Selection Problem Machine Learning Proceedings 1994. pp. 121- 129 ,(1994) , 10.1016/B978-1-55860-335-6.50023-4
A. Salappa, M. Doumpos, C. Zopounidis, Feature selection algorithms in classification problems: an experimental evaluation Optimization Methods & Software. ,vol. 22, pp. 199- 212 ,(2007) , 10.1080/10556780600881910
Yiming Yang, Xin Liu, A re-examination of text categorization methods international acm sigir conference on research and development in information retrieval. pp. 42- 49 ,(1999) , 10.1145/312624.312647
Man-Wai Mak, Sun-Yuan Kung, Fusion of feature selection methods for pairwise scoring SVM Neurocomputing. ,vol. 71, pp. 3104- 3113 ,(2008) , 10.1016/J.NEUCOM.2008.04.024
Zhaohui Zheng, Xiaoyun Wu, Rohini Srihari, Feature selection for text categorization on imbalanced data ACM SIGKDD Explorations Newsletter. ,vol. 6, pp. 80- 89 ,(2004) , 10.1145/1007730.1007741