Feature Selection for Text Categorisation

作者: Øystein Løhre Garnes

DOI:

关键词:

摘要: Text categorization is the task of discovering category or class text documents belongs to, in other words spotting correct topic for documents. While there today exists many machine learning schemes building automatic classifiers, these are typically resource demanding and do not always achieve best results when given whole contents A popular solution to problems called feature selection. The features (e.g. terms) a document collection weights based on simple scheme, then ranked by weights. Next, each represented using only top features, few percent features. classifier built considerably less time, might even improve accuracy. In situations where can belong one series categories, either build multi-class use set all split problem into binary tasks (deciding if not) create subset category/classifier. Many selection metrics have been suggested over last decades, including supervised methods that make manually pre-categorized training documents, unsupervised need same type be categorized. look promising, has lack large-scale comparison experiments. Also, several proposed two years. Moreover, most evaluations conducted instead as this often gives better results, although with joint used operational environments. report, we present from 16 (in addition random selection) various sizes. Of these, 5 were , 11 supervised. All tested both Naive Bayes (NB) Support Vector Machine (SVM) classifier. We experiments 20 non-overlapping method produced sets common categories. also combined evaluated their efforts. found classical had performance, Chi Square, Information Gain Mutual Information. Square variant GSS coefficient was among performers. Odds Ratio showed excellent performance NB, but SVM. three Collection Frequency, Frequency Inverse Document Term performances close group. Bi-Normal Separation metric smallest subsets. weirdness factor performed times than selection, performing Some combination achieved alone, majority did not. performers square classified more together alone.Four five combinations increase included BNS metric.

参考文章(29)
Huan Liu, Hiroshi Motoda, None, Computational Methods of Feature Selection Chapman and Hall/CRC. ,(2007) , 10.1201/9781584888796
Bao-Gang Hu, Shuang-Hong Yang, Efficient feature selection in the presence of outliers and noises asia information retrieval symposium. pp. 184- 191 ,(2008) , 10.5555/1786374.1786399
Norbert Fuhr, Kostas Tzeras, Gerhard Knorz, Stephan Hartmann, Michael Schwantner, Gerhard Lustig, AIR/X - A rule-based multistage indexing system for Iarge subject fields. Intelligent Text and Image Handling. pp. 606- 623 ,(1991)
Luigi Galavotti, Fabrizio Sebastiani, Maria Simi, Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization european conference on research and advanced technology for digital libraries. ,vol. 1923, pp. 59- 68 ,(2000) , 10.1007/3-540-45268-0_6
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Maria Fernanda Caropreso, Fabrizio Sebastiani, Stan Matwin, A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization Text databases & document management. pp. 78- 102 ,(2001)
Dunja Mladenić, Feature subset selection in text-learning european conference on machine learning. pp. 95- 100 ,(1998) , 10.1007/BFB0026677
Fei Song, Jihong Cai, Maximum entropy modeling with feature selection for text categorization asia information retrieval symposium. pp. 549- 554 ,(2008) , 10.5555/1786374.1786451
Mehran Sahami, Daphne Koller, Hierarchically Classifying Documents Using Very Few Words international conference on machine learning. pp. 170- 178 ,(1997)