Bias correction for selecting the minimal-error classifier from many machine learning models

作者: Ying Ding , Shaowu Tang , Serena G. Liao , Jia Jia , Steffi Oesterreich

DOI: 10.1093/BIOINFORMATICS/BTU520

关键词: Machine learningWeighted arithmetic meanData miningSample size determinationSelection biasTraining setTest dataComputer scienceClassifier (UML)Artificial intelligenceGeneralization errorCurve fittingWord error rate

摘要: Motivation: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that generalizable predict independent testing data. When test datasets are not available, cross-validation used estimate error rate. Many methods and it well known no universally best method exists general. It has been common practice apply many report produces smallest Theoretically, such procedure selection bias. Consequently, clinical studies with moderate sample sizes (e.g. n = 30–60) risk reporting falsely small rate could be validated later cohorts. Results: In this article, we illustrated probabilistic framework of problem explored statistical asymptotic properties. We proposed new bias correction based on curve fitting by inverse power law (IPL) compared three existing methods: nested cross-validation, weighted mean Tibshirani-Tibshirani procedure. All were simulation datasets, five size real two large breast cancer datasets. The result showed IPL outperforms other smaller variance, an additional advantage extrapolate estimates for larger sizes, practical feature recommend whether more samples should recruited improve accuracy. An R package ‘MLbias’ all source files publicly available. Availability implementation: tsenglab.biostat.pitt.edu/software.htm. Contact: ude.ttip@gnestc Supplementary information: Supplementary available at Bioinformatics online.

参考文章(16)
Christoph Bernau, Thomas Augustin, Anne-Laure Boulesteix, Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms. Biometrics. ,vol. 69, pp. 693- 702 ,(2013) , 10.1111/BIOM.12041
Bradley Efron, Empirical Bayes Estimates for Large-Scale Prediction Problems Journal of the American Statistical Association. ,vol. 104, pp. 1015- 1028 ,(2009) , 10.1198/JASA.2009.TM08523
Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, Stefan Gräf, Gavin Ha, Gholamreza Haffari, Ali Bashashati, Roslin Russell, Steven McKinney, METABRIC Group Co-chairs Caldas Carlos Aparicio Samuel saparicio@ bccrc. ca 17 18 c, Writing committee Curtis† Christina 15 16 Shah Sohrab P. 17 18 Caldas Carlos Aparicio Samuel saparicio@ bccrc. ca 17 18 e, Steering committee Brenton James D. 15 16 Ellis Ian 19 Huntsman David 17 18 Pinder Sarah 20 Purushotham Arnie 20 Murphy Leigh 21 Caldas Carlos Aparicio Samuel saparicio@ bccrc. ca 17 18 j, British Columbia Cancer Agency Aparicio Samuel saparicio@ bccrc. ca 17 18 b Chia Steven 18 Gelmon Karen 18 Huntsman David 17 18 McKinney Steven 17 18 Speers Caroline 18 Turashvili Gulisa 17 18 Watson Peter 17 18 21, University of Nottingham Ellis Ian 19 Blamey Roger 19 Green Andrew 19 Macmillan Douglas 19 Rakha Emad 19, King’s College London Purushotham Arnie 20 Gillett Cheryl 20 Grigoriadis Anita 20 Pinder Sarah 20 de Rinaldis Emanuele 20 Tutt Andy 20, Manitoba Institute of Cell Biology Murphy Leigh 21 Parisien Michelle 21 Troup Sandra 21, British Columbia Cancer Agency Aparicio Samuel saparicio@ bccrc. ca 17 18 b Turashvili Gulisa 17 18 Bell Lynda 18 Chow Katie 18 Gale Nadia 18 Huntsman David 17 18 Kovalik Maria 18 Ng Ying 18 Prentice Leah 18, British Columbia Cancer Agency Aparicio Samuel saparicio@ bccrc. ca 17 18 b Shah Sohrab P. 17 18 Bashashati Ali 17 Ha Gavin 17 Haffari Gholamreza 17 McKinney Steven 17 18, None, The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups Nature. ,vol. 486, pp. 346- 352 ,(2012) , 10.1038/NATURE10983
David B. Allison, Xiangqin Cui, Grier P. Page, Mahyar Sabripour, Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. ,vol. 7, pp. 55- 65 ,(2006) , 10.1038/NRG1749
M Slawski, M Daumer, A-L Boulesteix, CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data BMC Bioinformatics. ,vol. 9, pp. 439- 439 ,(2008) , 10.1186/1471-2105-9-439
Ryan J. Tibshirani, Robert Tibshirani, A bias correction for the minimum error rate in cross-validation The Annals of Applied Statistics. ,vol. 3, pp. 822- 829 ,(2009) , 10.1214/08-AOAS224
Sayan Mukherjee, Pablo Tamayo, Simon Rogers, Ryan Rifkin, Anna Engle, Colin Campbell, Todd R. Golub, Jill P. Mesirov, Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology. ,vol. 10, pp. 119- 142 ,(2003) , 10.1089/106652703321825928
M. R. Yousefi, J. Hua, C. Sima, E. R. Dougherty, Reporting bias when using real data sets to analyze classification performance Bioinformatics. ,vol. 26, pp. 68- 76 ,(2010) , 10.1093/BIOINFORMATICS/BTP605
D. Berrar, I. Bradbury, W. Dubitzky, Avoiding model selection bias in small-sample genomic datasets Bioinformatics. ,vol. 22, pp. 1245- 1250 ,(2006) , 10.1093/BIOINFORMATICS/BTL066