Learning when training data are costly: the effect of class distribution on tree induction

作者: G. M. Weiss , F. Provost

DOI: 10.1613/JAIR.1199

关键词:

摘要: For large, real-world inductive learning problems, the number of training examples often must be limited due to costs associated with procuring, preparing, and storing and/or computational from them. In such circumstances, one question practical importance is: if only n can selected, in what proportion should classes represented? this article we help answer by analyzing, for a fixed training-set size, relationship between class distribution data performance classification trees induced these data. We study twenty-six sets and, each, determine best learning. The naturally occurring is shown generally perform well when classifier evaluated using undifferentiated error rate (0/1 loss). However, area under ROC curve used evaluate performance, balanced well. Since neither choices always generates best-performing classifier, introduce "budget-sensitive" progressive sampling algorithm selecting based on each example. An empirical analysis shows that resulting set yields classifiers good (nearly-optimal) performance.

参考文章(33)
Stan Matwin, Miroslav Kubat, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. international conference on machine learning. pp. 179- 186 ,(1997)
Salvatore J. Stolfo, Philip K. Chan, Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection knowledge discovery and data mining. pp. 164- 168 ,(1998)
Robert C. Holte, Bruce W. Porter, Liane E. Acker, Concept learning and the problem of small disjuncts international joint conference on artificial intelligence. pp. 813- 818 ,(1989)
Charles Elkan, The foundations of cost-sensitive learning international joint conference on artificial intelligence. pp. 973- 978 ,(2001)
David D. Jensen, Paul R. Cohen, Multiple Comparisons in Induction Algorithms Machine Learning. ,vol. 38, pp. 309- 338 ,(2000) , 10.1023/A:1007631014630
Jeffrey P. Bradford, Clayton Kunz, Ron Kohavi, Cliff Brunk, Carla E. Brodley, Pruning decision trees with misclassification costs european conference on machine learning. pp. 131- 136 ,(1998) , 10.1007/BFB0026682
David D. Lewis, Jason Catlett, Heterogeneous Uncertainty Sampling for Supervised Learning Machine Learning Proceedings 1994. pp. 148- 156 ,(1994) , 10.1016/B978-1-55860-335-6.50026-X
Tom Fawcett, Foster Provost, Adaptive Fraud Detection Data Mining and Knowledge Discovery. ,vol. 1, pp. 291- 316 ,(1997) , 10.1023/A:1009700419189
Foster Provost, R Fawcett, T, Kohavi, The Case against Accuracy Estimation for Comparing Induction Algorithms international conference on machine learning. pp. 445- 453 ,(1998)
William W. Cohen, Yoram Singer, A simple, fast, and effective rule learner national conference on artificial intelligence. pp. 335- 342 ,(1999)