作者: Jochen Kruppa , Yufeng Liu , Hans-Christian Diener , Theresa Holste , Christian Weimar
关键词:
摘要: Machine learning methods are applied to three different large datasets, all dealing with probability estimation problems for dichotomous or multicategory data. Specifically, we investigate k-nearest neighbors, bagged nearest random forests trees, and support vector machines the kernels of Bessel, linear, Laplacian, radial basis type. Comparisons made logistic regression. The dataset from German Stroke Study Collaboration three-category outcome variables allows, in particular, temporal external validation. other two datasets freely available UCI repository provide variables. One them, Cleveland Clinic Foundation Heart Disease dataset, uses data one clinic training clinics validation, while other, thyroid disease allows validation by separating into test date recruitment study. For variables, use receiver operating characteristics, areas under curve values bootstrapped 95% confidence intervals, Hosmer-Lemeshow-type figures as comparison criteria. outcomes, calculated bootstrap Brier scores intervals also compared them through bootstrapping. In a supplement, R code performing analyses forest Random Jungle, version 2.1.0. show promising performance over constructed models. They simple apply serve an alternative approach multinomial regression analysis.