High-performing feature selection for text classification

作者: Monica Rogati , Yiming Yang

DOI: 10.1145/584792.584911

关键词:

摘要: This paper reports a controlled study on large number of filter feature selection methods for text classification. Over 100 variants five major criteria were examined using four well-known classification algorithms: Naive Bayesian (NB) approach, Rocchio-style classifier, k-nearest neighbor (kNN) method and Support Vector Machine (SVM) system. Two benchmark collections chosen as the testbeds: Reuters-21578 small portion Reuters Corpus Version 1 (RCV1), making new results comparable to published results. We found that based chi2 statistics consistently outperformed those other (including information gain) all classifiers both data collections, further increase in performance was obtained by combining uncorrelated high-performing methods.The we only 3% available features are among best reported, including with full set.

参考文章(9)
Guy W. Mineau, Pascal Soucy, A simple feature selection method for text classification international joint conference on artificial intelligence. pp. 897- 902 ,(2001)
Antonin Rozsypal, Miroslav Kubat, Using the Genetic Algorithm to Reduce the Size of a Nearest-Neighbor Classifier and to Select Relevant Attributes international conference on machine learning. pp. 449- 456 ,(2001)
Sanmay Das, Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection international conference on machine learning. pp. 74- 81 ,(2001)
Eric P. Xing, Richard M. Karp, Michael I. Jordan, Feature selection for high-dimensional genomic microarray data international conference on machine learning. pp. 601- 608 ,(2001)
George H John, Ron Kohavi, Karl Pfleger, None, Irrelevant Features and the Subset Selection Problem Machine Learning Proceedings 1994. pp. 121- 129 ,(1994) , 10.1016/B978-1-55860-335-6.50023-4
Mehran Sahami, Daphne Koller, Toward optimal feature selection international conference on machine learning. pp. 284- 292 ,(1996)
L. Douglas Baker, Andrew Kachites McCallum, Distributional clustering of words for text classification international acm sigir conference on research and development in information retrieval. pp. 96- 103 ,(1998) , 10.1145/290941.290970
Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization international conference on machine learning. pp. 412- 420 ,(1997)
T. Joachims, Making large-scale support vector machine learning practical Advances in Kernel Methods : Support Vector Machines. ,(1998)