Class noise vs. attribute noise: a quantitative study of their impacts

作者: Xingquan Zhu , Xindong Wu

DOI: 10.1007/S10462-004-0751-8

关键词: Data qualityClassifier (UML)Data miningLearning abilitiesArtificial intelligenceMachine learningPreprocessorNoise measurementComputer science

摘要: Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created decisions made based on data. Noise reduce system performance in terms classification accuracy, time building a classifier size classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their abilities noisy environments, but existence noise still introduce serious negative impacts. A more reasonable solution might be employ some preprocessing mechanisms handle instances before learner formed. Unfortunately, rare research has been conducted systematically explore noise, especially handling point view. This processing techniques less significant, specifically when dealing with introduced attributes. In this paper, we present systematic evaluation effect machine learning. Instead taking any unified theory evaluate impacts, differentiate into two categories: class attribute analyze impacts separately. Because widely addressed efforts, concentrate noise. We investigate relationship between at different attributes, possible solutions Our conclusions used guide interested readers quality by designing mechanisms.

参考文章(45)
Nada Lavrac, Ciril Groselj, Dragan Gamberger, Experiments with Noise Filtering in a Medical Domain international conference on machine learning. pp. 143- 151 ,(1999)
Carla E. Brodley, Mark A. Friedl, Identifying and eliminating mislabeled training instances national conference on artificial intelligence. pp. 799- 805 ,(1996)
Xindong Wu, Xingquan Zhu, Ying Yang, Error detection and impact-sensitive instance ranking in noisy datasets national conference on artificial intelligence. pp. 378- 383 ,(2004)
A. Blanton Godfrey, Thomas C. Redman, Data Quality For The Information Age ,(1997)
J. Kubica, A. Moore, Probabilistic noise identification and data cleaning international conference on data mining. pp. 131- 138 ,(2003) , 10.1109/ICDM.2003.1250912
Michael J. Pazzani, Pedro M. Domingos, Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. international conference on machine learning. pp. 105- 112 ,(1996)
Q. Zhao, T. Nishida, Using qualitative hypotheses to identify inaccurate data Journal of Artificial Intelligence Research. ,vol. 3, pp. 119- 145 ,(1995) , 10.1613/JAIR.170
J.R. Quinlan, Unknown attribute values in induction international conference on machine learning. pp. 164- 168 ,(1989) , 10.1016/B978-1-55860-036-2.50048-5
Choh-Man Teng, Correcting Noisy Data international conference on machine learning. pp. 239- 248 ,(1999)
Richard Y. Wang, Diane M. Strong, Beyond accuracy: what data quality means to data consumers Journal of Management Information Systems. ,vol. 12, pp. 5- 33 ,(1996) , 10.1080/07421222.1996.11518099