作者: Pedro J. García-Laencina , Pedro Henriques Abreu , Miguel Henriques Abreu , Noémia Afonoso
DOI: 10.1016/J.COMPBIOMED.2015.02.006
关键词:
摘要: Breast cancer is the most frequently diagnosed in women. Using historical patient information stored clinical datasets, data mining and machine learning approaches can be applied to predict survival of breast patients. A common drawback absence information, i.e., missing data, certain trials. However, standard prediction methods are not able handle incomplete samples and, then, imputation a widely approach for solving this inconvenience. Therefore, taking into account characteristics each dataset, it required perform detailed analysis determine appropriate environment. This research work analyzes real dataset from Institute Portuguese Oncology Porto with high percentage unknown categorical (most patients incomplete), which challenge terms complexity. Four scenarios evaluated: (I) 5-year without cleaned (II) Mode imputation, (III) Expectation-Maximization (IV) K-Nearest Neighbors imputation. Prediction models survivability constructed using four different methods: Neighbors, Classification Trees, Logistic Regression Support Vector Machines. Experiments performed nested ten-fold cross-validation procedure according obtained results, best results provided by algorithm: more than 81% accuracy 0.78 area under Receiver Operator Characteristic curve, constitutes very good complex scenario. HighlightsA model context.The complexity due its ratio.Several representative decision analyzed.Obtained interesting accurate dataset.