摘要: This paper reports a controlled study on large number of filter feature selection methods for text classification. Over 100 variants five major criteria were examined using four well-known classification algorithms: Naive Bayesian (NB) approach, Rocchio-style classifier, k-nearest neighbor (kNN) method and Support Vector Machine (SVM) system. Two benchmark collections chosen as the testbeds: Reuters-21578 small portion Reuters Corpus Version 1 (RCV1), making new results comparable to published results. We found that based chi2 statistics consistently outperformed those other (including information gain) all classifiers both data collections, further increase in performance was obtained by combining uncorrelated high-performing methods.The we only 3% available features are among best reported, including with full set.