Evaluation of Feature Selection Approaches for Urdu Text Categorization

作者: Tehseen Zia , , Qaiser Abbas , Muhammad Pervez Akhtar

DOI: 10.5815/IJISA.2015.06.03

关键词: Artificial intelligenceFeature selectionInformation gain ratioSupport vector machineFeature (computer vision)Model selectionStatistical classificationC4.5 algorithmComputer sciencePattern recognitionDecision tree

摘要: Efficient feature selection is an important phase of designing effective text categorization system. Various methods have been proposed for selecting dissimilar sets. It often essential to evaluate that which method more a given task and what size set model choice. Aim this paper answer these questions Urdu Five widely used were examined using six well-known classification algorithms: naive Bays (NB), k-nearest neighbor (KNN), support vector machines (SVM) with linear, polynomial radial basis kernels decision tree (i.e. J48). The study was conducted over two test collections: EMILLE collection collection. We observed three i.e. information gain, Chi statistics, symmetrical uncertain, performed uniformly in most the cases if not all. Moreover, we found no single best all classifiers. While gain ratio out-performed others J48, has shown top performance KNN SVM kernels. Overall, linear any including statistics or symmetric uncertain turned-out be first choice across other combinations classifiers on moderate On hand, its advantage small sized corpus.

参考文章(26)
Kashif Riaz, Rule-Based Named Entity Recognition in Urdu meeting of the association for computational linguistics. pp. 126- 135 ,(2010)
Qaiser Abbas, Building a hierarchical annotated corpus of urdu: the URDU.KON-TB treebank international conference on computational linguistics. pp. 66- 79 ,(2012) , 10.1007/978-3-642-28604-9_6
Marko Grobelnik, Dunja Mladenic, Feature Selection for Unbalanced Class Distribution and Naive Bayes international conference on machine learning. pp. 258- 267 ,(1999)
Hwee Tou Ng, Wei Boon Goh, Kok Leong Low, Feature selection, perceptron learning, and a usability case study for text categorization international acm sigir conference on research and development in information retrieval. ,vol. 31, pp. 67- 73 ,(1997) , 10.1145/258525.258537
Norbert Fuhr, Chris Buckley, A probabilistic learning approach for document indexing international acm sigir conference on research and development in information retrieval. ,vol. 9, pp. 223- 248 ,(1991) , 10.1145/125187.125189
Monica Rogati, Yiming Yang, High-performing feature selection for text classification conference on information and knowledge management. pp. 659- 661 ,(2002) , 10.1145/584792.584911
Abbas Raza Ali, Maliha Ijaz, Urdu text classification frontiers of information technology. pp. 21- ,(2009) , 10.1145/1838002.1838025
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, Craig G. Nevill-Manning, KEA: practical automatic keyphrase extraction acm international conference on digital libraries. pp. 254- 255 ,(1999) , 10.1145/313238.313437
Yiming Yang, Expert network: effective and efficient learning from human decisions in text categorization and retrieval international acm sigir conference on research and development in information retrieval. pp. 13- 22 ,(1994) , 10.5555/188490.188496
Y. H. Li, Classification of Text Documents The Computer Journal. ,vol. 41, pp. 537- 546 ,(1998) , 10.1093/COMJNL/41.8.537