Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications

作者: Jochen Kruppa , Yufeng Liu , Hans-Christian Diener , Theresa Holste , Christian Weimar

DOI: 10.1002/BIMJ.201300077

关键词:

摘要: Machine learning methods are applied to three different large datasets, all dealing with probability estimation problems for dichotomous or multicategory data. Specifically, we investigate k-nearest neighbors, bagged nearest random forests trees, and support vector machines the kernels of Bessel, linear, Laplacian, radial basis type. Comparisons made logistic regression. The dataset from German Stroke Study Collaboration three-category outcome variables allows, in particular, temporal external validation. other two datasets freely available UCI repository provide variables. One them, Cleveland Clinic Foundation Heart Disease dataset, uses data one clinic training clinics validation, while other, thyroid disease allows validation by separating into test date recruitment study. For variables, use receiver operating characteristics, areas under curve values bootstrapped 95% confidence intervals, Hosmer-Lemeshow-type figures as comparison criteria. outcomes, calculated bootstrap Brier scores intervals also compared them through bootstrapping. In a supplement, R code performing analyses forest Random Jungle, version 2.1.0. show promising performance over constructed models. They simple apply serve an alternative approach multinomial regression analysis.

参考文章(54)
Ramón Díaz-Uriarte, Sara Alvarez de Andrés, Gene selection and classification of microarray data using random forest BMC Bioinformatics. ,vol. 7, pp. 3- 3 ,(2006) , 10.1186/1471-2105-7-3
Daniela Wenzel, Antonia Zapf, Difference of two dependent sensitivities and specificities: Comparison of various approaches Biometrical Journal. ,vol. 55, pp. 705- 718 ,(2013) , 10.1002/BIMJ.201200186
K. A. Horn, P. J. Compton, L. Lazarus, J. R. Quinlan, Inductive knowledge acquisition: a case study Proceedings of the Second Australian Conference on Applications of expert systems. pp. 137- 156 ,(1987)
Yi Lin, Support Vector Machines and the Bayes Rule in Classification Data Mining and Knowledge Discovery. ,vol. 6, pp. 259- 275 ,(2002) , 10.1023/A:1015469627679
Alexandros Karatzoglou, David Meyer, Kurt Hornik, Support Vector Machines in R Journal of Statistical Software. ,vol. 15, pp. 1- 28 ,(2006) , 10.18637/JSS.V015.I09
Jochen Kruppa, Yufeng Liu, Gérard Biau, Michael Kohler, Inke R. König, James D. Malley, Andreas Ziegler, Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory Biometrical Journal. ,vol. 56, pp. 534- 563 ,(2014) , 10.1002/BIMJ.201300068
Guangyong Zou, Allan Donner, A simple alternative confidence interval for the difference between two proportions. Controlled Clinical Trials. ,vol. 25, pp. 3- 12 ,(2004) , 10.1016/J.CCT.2003.08.010
Gerhard Gillmann, Christoph Erwin Minder, None, On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models Methods of Information in Medicine. ,vol. 48, pp. 306- 310 ,(2009) , 10.3414/ME0571
Mousumi Banerjee, Ying Ding, Anne-Michelle Noone, Identifying representative trees from ensembles Statistics in Medicine. ,vol. 31, pp. 1601- 1616 ,(2012) , 10.1002/SIM.4492
Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid, Sarbjit Sandhu, Kern H. Guppy, Stella Lee, Victor Froelicher, International application of a new probability algorithm for the diagnosis of coronary artery disease American Journal of Cardiology. ,vol. 64, pp. 304- 310 ,(1989) , 10.1016/0002-9149(89)90524-9