Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values

作者: Pedro J. García-Laencina , Pedro Henriques Abreu , Miguel Henriques Abreu , Noémia Afonoso

DOI: 10.1016/J.COMPBIOMED.2015.02.006

关键词:

摘要: Breast cancer is the most frequently diagnosed in women. Using historical patient information stored clinical datasets, data mining and machine learning approaches can be applied to predict survival of breast patients. A common drawback absence information, i.e., missing data, certain trials. However, standard prediction methods are not able handle incomplete samples and, then, imputation a widely approach for solving this inconvenience. Therefore, taking into account characteristics each dataset, it required perform detailed analysis determine appropriate environment. This research work analyzes real dataset from Institute Portuguese Oncology Porto with high percentage unknown categorical (most patients incomplete), which challenge terms complexity. Four scenarios evaluated: (I) 5-year without cleaned (II) Mode imputation, (III) Expectation-Maximization (IV) K-Nearest Neighbors imputation. Prediction models survivability constructed using four different methods: Neighbors, Classification Trees, Logistic Regression Support Vector Machines. Experiments performed nested ten-fold cross-validation procedure according obtained results, best results provided by algorithm: more than 81% accuracy 0.78 area under Receiver Operator Characteristic curve, constitutes very good complex scenario. HighlightsA model context.The complexity due its ratio.Several representative decision analyzed.Obtained interesting accurate dataset.

参考文章(47)
Pedro Henriques Abreu, Hugo Amaro, Daniel Castro Silva, Penousal Machado, Miguel Henriques Abreu, Noémia Afonso, António Dourado, Overall Survival Prediction for Women Breast Cancer Using Ensemble Methods and Incomplete Clinical Data Springer, Cham. pp. 1366- 1369 ,(2014) , 10.1007/978-3-319-00846-2_338
Pedro Henriques Abreu, Hugo Amaro, Daniel Castro Silva, Penousal Machado, Miguel Henriques Abreu, Personalizing Breast Cancer Patients with Heterogeneous Data Springer, Cham. pp. 39- 42 ,(2014) , 10.1007/978-3-319-03005-0_11
Steven L. Salzberg, Alberto Segre, Programs for Machine Learning ,(1994)
Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) Springer-Verlag New York, Inc.. ,(2006)
Bernhard Schölkopf, Alexander J. Smola, Learning with Kernels The MIT Press. pp. 626- ,(2018) , 10.7551/MITPRESS/4175.001.0001
J.A.K. Suykens, J. Vandewalle, Least Squares Support Vector Machine Classifiers Neural Processing Letters. ,vol. 9, pp. 293- 300 ,(1999) , 10.1023/A:1018628609742
Joseph A. Cruz, David S. Wishart, Applications of Machine Learning in Cancer Prediction and Prognosis Cancer Informatics. ,vol. 2, pp. 59- 77 ,(2006) , 10.1177/117693510600200030
Christopher M. Bishop, Pattern Recognition and Machine Learning ,(2006)