Benchmarking 50 classification algorithms on 50 gene-expression datasets

作者: Stephen R. Piccolo , Nathan P. Golightly , Dustin B. Miller , Avery Mecham , Jérémie L. Johnson

DOI: 10.1101/2021.05.07.442940

关键词:

摘要: By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include with particular disease subtype, good (or poor) prognosis, or most least) likely to respond therapy. Diverse types of biomarkers have been proposed assigning subgroups. For example, DNA variants in tumors show promise as biomarkers; however, exhibit considerable genomic heterogeneity. As an alternative, transcriptomic measurements reflect the downstream effects and epigenomic variations. However, high-throughput technologies generate thousands per patient, complex dependencies exist among genes, so it may be infeasible classify traditional statistical models. Machine-learning classification algorithms help this problem. hundreds exist, support diverse hyperparameters, is difficult researchers know which are optimal gene-expression biomarkers. We performed benchmark comparison, applying 50 datasets (143 class variables). evaluated that represent machine-learning methodologies implemented general-purpose, open-source, libraries. When available, we combined clinical predictors data. Additionally, performing hyperparameter optimization feature selection nested cross-validation folds. Kernel- ensemble-based consistently outperformed other algorithms; even top-performing poorly some cases. Hyperparameter typically improved predictive performance, univariate feature-selection sophisticated methods. Together, our findings illustrate algorithm performance varies considerably when factors held constant thus critical step biomarker studies.

参考文章(109)
Sandrine Dudoit, Jane Fridlyand, Classification in microarray experiments Chapman and Hall/CRC. ,(2003) , 10.1201/9780203011232.CH3
Leonard G Gomella, Xiaolong S Liu, Edouard J Trabulsi, Wm Kevin Kelly, Ronald Myers, Timothy Showalter, Adam Dicker, Richard Wender, None, Screening for prostate cancer: the current evidence and guidelines controversy. Canadian Journal of Urology. ,vol. 18, pp. 5875- 5883 ,(2011)
Brian Campbell Vickery, Techniques of information retrieval ,(1970)
Marc Sumner, Eibe Frank, Mark Hall, Speeding Up Logistic Model Tree Induction Knowledge Discovery in Databases: PKDD 2005. pp. 675- 683 ,(2005) , 10.1007/11564126_72
Sung-Bae Cho, Hong-Hee Won, Machine learning in DNA microarray analysis for cancer classification asia pacific bioinformatics conference. pp. 189- 198 ,(2003)
John C. Platt, Fast training of support vector machines using sequential minimal optimization Advances in kernel methods. pp. 185- 208 ,(1999)
Ramón Díaz-Uriarte, Sara Alvarez de Andrés, Gene selection and classification of microarray data using random forest BMC Bioinformatics. ,vol. 7, pp. 3- 3 ,(2006) , 10.1186/1471-2105-7-3
BSCH OLKOPF, C Burges, A Smola, Advances in kernel methods: support vector learning international conference on neural information processing. ,(1999) , 10.5555/299094
Ron Kohavi, The power of decision tables european conference on machine learning. ,vol. 912, pp. 174- 189 ,(1995) , 10.1007/3-540-59286-5_57
William W. Cohen, Fast Effective Rule Induction Machine Learning Proceedings 1995. pp. 115- 123 ,(1995) , 10.1016/B978-1-55860-377-6.50023-2