作者: G. M. Weiss , F. Provost
DOI: 10.1613/JAIR.1199
关键词:
摘要: For large, real-world inductive learning problems, the number of training examples often must be limited due to costs associated with procuring, preparing, and storing and/or computational from them. In such circumstances, one question practical importance is: if only n can selected, in what proportion should classes represented? this article we help answer by analyzing, for a fixed training-set size, relationship between class distribution data performance classification trees induced these data. We study twenty-six sets and, each, determine best learning. The naturally occurring is shown generally perform well when classifier evaluated using undifferentiated error rate (0/1 loss). However, area under ROC curve used evaluate performance, balanced well. Since neither choices always generates best-performing classifier, introduce "budget-sensitive" progressive sampling algorithm selecting based on each example. An empirical analysis shows that resulting set yields classifiers good (nearly-optimal) performance.