Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study

作者: Dengya Zhu , Kok Wai Wong

DOI: 10.1007/978-3-319-12637-1_60

关键词: Benchmark (computing)Set (abstract data type)Feature (machine learning)Computer scienceBoosting methods for object categorizationNaive Bayes classifierText categorizationFeature selectionAdaBoostMachine learningArtificial intelligence

摘要: Naive Bayes(NB), kNN and Adaboost are three commonly used text classifiers. Evaluation of these classifiers involves a variety factors to be considered including benchmark used, feature selections, parameter settings algorithms, the measurement criteria employed. Researchers have demonstrated that some algorithms outperform others on corpus, however, labeling corpus bias two concerns in categorization. This paper focuses evaluating by using an automatically generated document set which is labelled group experts alleviate subjectiveness labelling, at same time examine how performance influenced selection number features selected.

参考文章(13)
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Charu C. Aggarwal, ChengXiang Zhai, A survey of text classification algorithms Mining Text Data. pp. 163- 222 ,(2012) , 10.1007/978-1-4614-3223-4_6
Robert E. Schapire, Yoram Singer, Amit Singhal, Boosting and Rocchio applied to text filtering international acm sigir conference on research and development in information retrieval. pp. 215- 223 ,(1998) , 10.1145/290941.290996
Robert E. Schapire, Yoram Singer, BoosTexter: A Boosting-based Systemfor Text Categorization Machine Learning. ,vol. 39, pp. 135- 168 ,(2000) , 10.1023/A:1007649029923
David Hickam, William Hersh, Chris Buckley, T. J. Leone, OHSUMED: an interactive retrieval evaluation and new large test collection for research international acm sigir conference on research and development in information retrieval. pp. 192- 201 ,(1994) , 10.5555/188490.188557
Dmitry Davidov, Evgeniy Gabrilovich, Shaul Markovitch, Parameterized generation of labeled datasets for text categorization based on a hierarchical directory Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04. pp. 250- 257 ,(2004) , 10.1145/1008992.1009036
Yiming Yang, An Evaluation of Statistical Approaches to Text Categorization Information Retrieval. ,vol. 1, pp. 69- 90 ,(1999) , 10.1023/A:1009982220290
Fabrizio Sebastiani, Machine learning in automated text categorization ACM Computing Surveys. ,vol. 34, pp. 1- 47 ,(2002) , 10.1145/505282.505283
Dengya Zhu, Heinz Dreher, Characteristics and Uses of Labeled Datasets - ODP Case Study semantics, knowledge and grid. pp. 227- 234 ,(2010) , 10.1109/SKG.2010.84
Yoav Freund, Robert Schapire, Naoki Abe, A Short Introduction to Boosting ,(1999)