作者: Ying Ding , Shaowu Tang , Serena G. Liao , Jia Jia , Steffi Oesterreich
DOI: 10.1093/BIOINFORMATICS/BTU520
关键词: Machine learning 、 Weighted arithmetic mean 、 Data mining 、 Sample size determination 、 Selection bias 、 Training set 、 Test data 、 Computer science 、 Classifier (UML) 、 Artificial intelligence 、 Generalization error 、 Curve fitting 、 Word error rate
摘要: Motivation: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that generalizable predict independent testing data. When test datasets are not available, cross-validation used estimate error rate. Many methods and it well known no universally best method exists general. It has been common practice apply many report produces smallest Theoretically, such procedure selection bias. Consequently, clinical studies with moderate sample sizes (e.g. n = 30–60) risk reporting falsely small rate could be validated later cohorts. Results: In this article, we illustrated probabilistic framework of problem explored statistical asymptotic properties. We proposed new bias correction based on curve fitting by inverse power law (IPL) compared three existing methods: nested cross-validation, weighted mean Tibshirani-Tibshirani procedure. All were simulation datasets, five size real two large breast cancer datasets. The result showed IPL outperforms other smaller variance, an additional advantage extrapolate estimates for larger sizes, practical feature recommend whether more samples should recruited improve accuracy. An R package ‘MLbias’ all source files publicly available. Availability implementation: tsenglab.biostat.pitt.edu/software.htm. Contact: ude.ttip@gnestc Supplementary information: Supplementary available at Bioinformatics online.