The use of balance-aware subsampling for bioinformatics datasets

作者: Randall Wald , Taghi M. Khoshgoftaar , Alireza Fazelpour

DOI: 10.1109/IRI.2013.6642489

关键词: Naive Bayes classifierData modelingStatisticsClass (biology)BioinformaticsComputer scienceSupport vector machineValue (computer science)Field (computer science)Random forestSample size determination

摘要: A major challenge facing data-mining practitioners in the field of bioinformatics is class imbalance, which occurs when instances one (called majority class) vastly outnumber other (minority) classes. This can result models with increased bias towards (minority-class predicted as being class). Data sampling, a process changes dataset through removing or adding to improve balance, be used performance such on imbalanced data. However, it not clear what target balance level should data and influence imbalance alone has classification (compared issues difficulty learning from size). To resolve this, we propose Balance-Aware Subsampling technique, allows researchers directly compare different levels while keeping all factors (such size actual question) constant. Thus, any attributed solely chosen level. We demonstrate this technique using six datasets bioinformatics, also consider three subsample sizes (that is, for building model) so observe effect parameter performance. Our results show that within each average AUC value increases increases. The key exception 20:80 (minority:majority) level, decreases 80 120. find size, minority distribution increases, although does completely hold 40 (in case, Naive Bayes Random Forest learners greater at 35:65 than 50:50), general there significant improvement between 50:50 levels. Overall, by Subsampling, are able how affects isolated factors.

参考文章(19)
Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano, An exploration of learning when data is noisy and imbalanced intelligent data analysis. ,vol. 15, pp. 215- 236 ,(2011) , 10.3233/IDA-2010-0464
Anna V. Ivshina, Joshy George, Oleg Senko, Benjamin Mow, Thomas C. Putti, Johanna Smeds, Thomas Lindahl, Yudi Pawitan, Per Hall, Hans Nordgren, John E.L. Wong, Edison T. Liu, Jonas Bergh, Vladimir A. Kuznetsov, Lance D. Miller, Genetic Reclassification of Histologic Grade Delineates New Clinical Subtypes of Breast Cancer Cancer Research. ,vol. 66, pp. 10292- 10301 ,(2006) , 10.1158/0008-5472.CAN-05-4414
Ahmad Abu Shanab, Taghi M. Khoshgoftaar, Randall Wald, Amri Napolitano, Impact of noise and data sampling on stability of feature ranking techniques for biological datasets information reuse and integration. pp. 415- 422 ,(2012) , 10.1109/IRI.2012.6303039
Ashish Anand, Ganesan Pugalenthi, Gary B. Fogel, P. N. Suganthan, An approach for classification of highly imbalanced data using weighting and undersampling Amino Acids. ,vol. 39, pp. 1385- 1391 ,(2010) , 10.1007/S00726-010-0595-2
Xing-Ming Zhao, Xin Li, Luonan Chen, Kazuyuki Aihara, Protein classification with imbalanced data Proteins: Structure, Function, and Bioinformatics. ,vol. 70, pp. 1125- 1132 ,(2007) , 10.1002/PROT.21870
Lara Lusa, Rok Blagus, The Class-Imbalance Problem for High-Dimensional Class Prediction international conference on machine learning and applications. ,vol. 2, pp. 123- 126 ,(2012) , 10.1109/ICMLA.2012.223
Taghi Khoshgoftaar, David Dittman, Randall Wald, Alireza Fazelpour, First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques 2012 11th International Conference on Machine Learning and Applications. ,vol. 2, pp. 151- 157 ,(2012) , 10.1109/ICMLA.2012.192
Ali Al-Shahib, Rainer Breitling, David Gilbert, Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence Applied Bioinformatics. ,vol. 4, pp. 195- 203 ,(2005) , 10.2165/00822942-200504030-00004
Sridhar Ramaswamy, Ken N. Ross, Eric S. Lander, Todd R. Golub, A molecular signature of metastasis in primary solid tumors. Nature Genetics. ,vol. 33, pp. 49- 54 ,(2003) , 10.1038/NG1060
Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano, RUSBoost: A Hybrid Approach to Alleviating Class Imbalance systems man and cybernetics. ,vol. 40, pp. 185- 197 ,(2010) , 10.1109/TSMCA.2009.2029559