Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

作者: George Forman

DOI: 10.1007/3-540-45681-3_13

关键词:

摘要: Good feature selection is essential for text classification to make it tractable machine learning, and improve performance. This study benchmarks the performance of twelve metrics across 229 problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines. The results are analyzed various objectives. For best accuracy, F-measure or recall, findings reveal an outstanding new metric, "Bi-Normal Separation" (BNS). precision alone, however, Information Gain (IG) was superior. A evaluation methodology offered that focuses on needs data mining practitioner who seeks choose one two try mostly likely have single dataset at hand. analysis determined, example, IG Chi-Squared correlated failures precision, paired with BNS a better choice.

参考文章(7)
Marko Grobelnik, Dunja Mladenic, Feature Selection for Unbalanced Class Distribution and Naive Bayes international conference on machine learning. pp. 258- 267 ,(1999)
Eui-Hong Han, George Karypis, Centroid-Based Document Classification: Analysis and Experimental Results european conference on principles of data mining and knowledge discovery. pp. 424- 431 ,(2000) , 10.1007/3-540-45372-5_46
James A. Hanley, The Robustness of the "Binormal" Assumptions Used in Fitting ROC Curves Medical Decision Making. ,vol. 8, pp. 197- 203 ,(1988) , 10.1177/0272989X8800800308
Yiming Yang, Xin Liu, A re-examination of text categorization methods international acm sigir conference on research and development in information retrieval. pp. 42- 49 ,(1999) , 10.1145/312624.312647
George Forman, An extensive empirical study of feature selection metrics for text classification Journal of Machine Learning Research. ,vol. 3, pp. 1289- 1305 ,(2003)
Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization international conference on machine learning. pp. 412- 420 ,(1997)
George Forman, Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification european conference on principles of data mining and knowledge discovery. pp. 150- 162 ,(2002) , 10.1007/3-540-45681-3_13