Data Analysis in Python: Anonymized Features and Imbalanced Data Target

作者: Emanuel Rocha Woiski

DOI: 10.1007/978-3-319-55852-3_10

关键词:

摘要: Remaining useful life (RUL) of an equipment or system is a prognostic value that depends on data gathered from multiple and diverse sources. Moreover, assumed for the sake present study as binary classification problem, probability failure any usually very much smaller than same to be in normal operating conditions. The imbalanced outcome (largely more ‘normal’ ‘failure’ states) at time results combined values large set features, some related one another, redundant, most quite noisy. Previewing development requirements robust framework, it advocated by using Python libraries, those difficulties can dealt with. In Chapter, DOROTHEA, dataset UCI library with hundred thousand sparse anonymized (i.e. unrecognizable labels) features classes are analyzed. For that, ipython (jupyter) notebook, pandas used import set, then exploratory analysis feature engineering performed several estimators (classifiers) obtained scikit-learn applied. It demonstrated global accuracy does not work this case, since minority class easily overlooked algorithms. Therefore, receiver characteristics (ROC), Precision-Recall curves respective area under curve (AUCs) evaluated each estimator ensemble, well simple statistics, three hybrid methods, are, mix filter, embedded wrapper selection strategies, were compared.

参考文章(23)
Matthew Wiener, Andy Liaw, Classification and Regression by randomForest ,(2007)
Peter A. Flach, C sar Ferri, Jos Hern ndez-orallo, A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance international conference on machine learning. pp. 657- 664 ,(2011)
Jianqing Fan, Runze Li, Statistical challenges with high dimensionality: feature selection in knowledge discovery Proceedings oh the International Congress of Mathematicians: Madrid, August 22-30,2006 : invited lectures, Vol. 3, 2006, ISBN 978-3-03719-022-7, págs. 595-622. pp. 595- 622 ,(2006) , 10.4171/022
Taklit Akrouf Alitouche, Hassiba Kheliouane Djemaa, Mohamed Bekkar, Evaluation Measures for Models Assessment over Imbalanced Data Sets Journal of Information Engineering and Applications. ,vol. 3, pp. 27- 38 ,(2013)
John D. Hunter, Matplotlib: A 2D Graphics Environment Computing in Science and Engineering. ,vol. 9, pp. 90- 95 ,(2007) , 10.1109/MCSE.2007.55
Fernando Perez, Brian E. Granger, IPython: A System for Interactive Scientific Computing computational science and engineering. ,vol. 9, pp. 21- 29 ,(2007) , 10.1109/MCSE.2007.53
Abu HM Kamal, Xingquan Zhu, Abhijit Pandya, SAM Hsu, Ramaswamy Narayanan, None, FEATURE SELECTION FOR DATASETS WITH IMBALANCED CLASS DISTRIBUTIONS International Journal of Software Engineering and Knowledge Engineering. ,vol. 20, pp. 113- 137 ,(2010) , 10.1142/S0218194010004645
Jorge M. Lobo, Alberto Jiménez-Valverde, Raimundo Real, AUC: a misleading measure of the performance of predictive distribution models Global Ecology and Biogeography. ,vol. 17, pp. 145- 151 ,(2008) , 10.1111/J.1466-8238.2007.00358.X
Gaël Varoquaux, Stéfan van der Walt, S Chris Colbert, The NumPy Array: A Structure for Efficient Numerical Computation Computing in Science and Engineering. ,vol. 13, pp. 22- 30 ,(2011) , 10.1109/MCSE.2011.37
Asa Ben-Hur, Gideon Dror, Isabelle Guyon, Steve Gunn, Result Analysis of the NIPS 2003 Feature Selection Challenge neural information processing systems. ,vol. 17, pp. 545- 552 ,(2004)