Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory

作者: Jochen Kruppa , Yufeng Liu , Gérard Biau , Michael Kohler , Inke R. König

DOI: 10.1002/BIMJ.201300068

关键词:

摘要: Probability estimation for binary and multicategory outcome using logistic multinomial regression has a long-standing tradition in biostatistics. However, biases may occur if the model is misspecified. In contrast, probabilities individuals can be estimated consistently with machine learning approaches, including k-nearest neighbors (k-NN), bagged nearest (b-NN), random forests (RF), support vector machines (SVM). Because methods are rarely used by applied biostatisticians, primary goal of this paper to explain concept probability these summarize recent theoretical findings. k-NN, b-NN, RF embedded into class nonparametric machines; therefore, we start construction estimates review results on consistency rates convergence. SVMs, repeatedly solving classification problems. For SVMs problem then dichotomous estimation. Next extend algorithms estimating outcomes discuss approaches SVM. simulation studies dependent variables demonstrate general validity compare it regression. each method fails at least one scenario. We conclude discussion failures give recommendations selecting tuning methods. Applications real data example code provided companion article (doi:10.1002/bimj.201300077).

参考文章(89)
Daniel Enache, Gerhard Arminger, Statistical Models and Artificial Neural Networks Data Analysis and Information Systems. pp. 243- 260 ,(1996) , 10.1007/978-3-642-80098-6_21
László Györfi, Michael Kohler, Adam Krzyżak, Harro Walk, A distribution-free theory of nonparametric regression Published in <b>2002</b> in New York NY) by Springer. ,(2002) , 10.1007/B97848
Foster Provost, Pedro Domingos, Tree Induction for Probability-Based Ranking Machine Learning. ,vol. 52, pp. 199- 215 ,(2003) , 10.1023/A:1024099825458
Ting-Fan Wu, Chih-Jen Lin, Ruby Weng, None, Probability Estimates for Multi-class Classification by Pairwise Coupling Journal of Machine Learning Research. ,vol. 5, pp. 975- 1005 ,(2004) , 10.5555/1005332.1016791
Ramón Díaz-Uriarte, Sara Alvarez de Andrés, Gene selection and classification of microarray data using random forest BMC Bioinformatics. ,vol. 7, pp. 3- 3 ,(2006) , 10.1186/1471-2105-7-3
Marc G. Genton, Classes of kernels for machine learning: a statistics perspective international conference on artificial intelligence and statistics. ,vol. 2, pp. 299- 312 ,(2002) , 10.5555/944790.944815
Nicolai Meinshausen, Quantile Regression Forests Journal of Machine Learning Research. ,vol. 7, pp. 983- 999 ,(2006)
Michael Kohler, Universal consistency of local polynomial kernel regression estimates Annals of the Institute of Statistical Mathematics. ,vol. 54, pp. 879- 899 ,(2002) , 10.1023/A:1022427805425
Yi Lin, Support Vector Machines and the Bayes Rule in Classification Data Mining and Knowledge Discovery. ,vol. 6, pp. 259- 275 ,(2002) , 10.1023/A:1015469627679
Jochen Kruppa, Yufeng Liu, Hans-Christian Diener, Theresa Holste, Christian Weimar, Inke R. König, Andreas Ziegler, Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications Biometrical Journal. ,vol. 56, pp. 564- 583 ,(2014) , 10.1002/BIMJ.201300077