作者: Randall Wald , Taghi M. Khoshgoftaar , Alireza Fazelpour
关键词: Naive Bayes classifier 、 Data modeling 、 Statistics 、 Class (biology) 、 Bioinformatics 、 Computer science 、 Support vector machine 、 Value (computer science) 、 Field (computer science) 、 Random forest 、 Sample size determination
摘要: A major challenge facing data-mining practitioners in the field of bioinformatics is class imbalance, which occurs when instances one (called majority class) vastly outnumber other (minority) classes. This can result models with increased bias towards (minority-class predicted as being class). Data sampling, a process changes dataset through removing or adding to improve balance, be used performance such on imbalanced data. However, it not clear what target balance level should data and influence imbalance alone has classification (compared issues difficulty learning from size). To resolve this, we propose Balance-Aware Subsampling technique, allows researchers directly compare different levels while keeping all factors (such size actual question) constant. Thus, any attributed solely chosen level. We demonstrate this technique using six datasets bioinformatics, also consider three subsample sizes (that is, for building model) so observe effect parameter performance. Our results show that within each average AUC value increases increases. The key exception 20:80 (minority:majority) level, decreases 80 120. find size, minority distribution increases, although does completely hold 40 (in case, Naive Bayes Random Forest learners greater at 35:65 than 50:50), general there significant improvement between 50:50 levels. Overall, by Subsampling, are able how affects isolated factors.