作者: Øystein Løhre Garnes
DOI:
关键词:
摘要: Text categorization is the task of discovering category or class text documents belongs to, in other words spotting correct topic for documents. While there today exists many machine learning schemes building automatic classifiers, these are typically resource demanding and do not always achieve best results when given whole contents A popular solution to problems called feature selection. The features (e.g. terms) a document collection weights based on simple scheme, then ranked by weights. Next, each represented using only top features, few percent features. classifier built considerably less time, might even improve accuracy. In situations where can belong one series categories, either build multi-class use set all split problem into binary tasks (deciding if not) create subset category/classifier. Many selection metrics have been suggested over last decades, including supervised methods that make manually pre-categorized training documents, unsupervised need same type be categorized. look promising, has lack large-scale comparison experiments. Also, several proposed two years. Moreover, most evaluations conducted instead as this often gives better results, although with joint used operational environments. report, we present from 16 (in addition random selection) various sizes. Of these, 5 were , 11 supervised. All tested both Naive Bayes (NB) Support Vector Machine (SVM) classifier. We experiments 20 non-overlapping method produced sets common categories. also combined evaluated their efforts. found classical had performance, Chi Square, Information Gain Mutual Information. Square variant GSS coefficient was among performers. Odds Ratio showed excellent performance NB, but SVM. three Collection Frequency, Frequency Inverse Document Term performances close group. Bi-Normal Separation metric smallest subsets. weirdness factor performed times than selection, performing Some combination achieved alone, majority did not. performers square classified more together alone.Four five combinations increase included BNS metric.