Discretizing Continuous Attributes in AdaBoost for Text Categorization

作者: Pio Nardiello , Fabrizio Sebastiani , Alessandro Sperduti

DOI: 10.1007/3-540-36618-0_23

关键词: Text categorizationDiscretizationWeightingBoosting (machine learning)Artificial intelligencePattern recognitionComputer scienceAdaBoostMachine learningCategorizationBoosting methods for object categorizationEntropy (information theory)

摘要: We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While former is a realization well-known ADABOOST algorithm specifically aimed at multilabel categorization, latter generalization based idea learning committee classifier sub-committees. Both have been among best performers categorization experiments so far. A problem use both that they require documents to be represented by binary vectors, indicating presence or absence terms document. As consequence, these cannot take full advantage "weighted" representations (consisting vectors continuous attributes) are customary information retrieval tasks, provide much more significant rendition document's content than representations. In this paper we address exploiting potential weighted context ADABOOST-like discretizing attributes through application entropy-based discretization methods. present experimental results Reuters-21578 collection, showing version with discretized outperforms traditional representations.

参考文章(17)
Ron Kohavi, Mehran Sahami, Error-based and entropy-based discretization of continuous features knowledge discovery and data mining. pp. 114- 119 ,(1996)
Randy Kerber, ChiMerge: discretization of numeric attributes national conference on artificial intelligence. pp. 123- 128 ,(1992)
Luigi Galavotti, Fabrizio Sebastiani, Maria Simi, Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization european conference on research and advanced technology for digital libraries. ,vol. 1923, pp. 59- 68 ,(2000) , 10.1007/3-540-45268-0_6
Keki B. Irani, Usama M. Fayyad, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning international joint conference on artificial intelligence. ,vol. 2, pp. 1022- 1027 ,(1993)
James Dougherty, Ron Kohavi, Mehran Sahami, Supervised and Unsupervised Discretization of Continuous Features Machine Learning Proceedings 1995. pp. 194- 202 ,(1995) , 10.1016/B978-1-55860-377-6.50032-3
J. R. Quinlan, Improved use of continuous attributes in C4.5 Journal of Artificial Intelligence Research. ,vol. 4, pp. 77- 90 ,(1996) , 10.1613/JAIR.279
Gerard Salton, Christopher Buckley, Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management. ,vol. 24, pp. 323- 328 ,(1988) , 10.1016/0306-4573(88)90021-0
Piero d'Altan, John-Jules Ch Meyer, Roelf Johannes Wieringa, An integrated framework for ought-to-be and ought-to-do constraints Artificial Intelligence and Law. ,vol. 4, pp. 77- 111 ,(1996) , 10.1007/BF00116787
Yoav Freund, Robert E Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting conference on learning theory. ,vol. 55, pp. 119- 139 ,(1997) , 10.1006/JCSS.1997.1504
Robert E. Schapire, Yoram Singer, Improved boosting algorithms using confidence-rated predictions conference on learning theory. ,vol. 37, pp. 80- 91 ,(1998) , 10.1145/279943.279960