Similarity measures for categorical data: A comparative evaluation

作者: Shyam Boriah , Varun Chandola , Vipin Kumar

DOI: 10.1137/1.9781611972788.22

关键词:

摘要: Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of continuous relatively well-understood, but categorical data, the computation not straightforward. Several data-driven measures have been proposed in literature to compute instances their relative performance has evaluated. In this paper we study variety context specific task: outlier detection. Results on sets show that while no one measure dominates others all types problems, some are able consistently high performance.

参考文章(33)
Christos Faloutsos, Christopher R. Palmer, Electricity based external similarity of categorical attributes knowledge discovery and data mining. pp. 486- 500 ,(2003) , 10.5555/1760894.1760959
Yoram Biberman, A context similarity measure european conference on machine learning. pp. 49- 63 ,(1994) , 10.1007/3-540-57868-4_50
Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, Sal Stolfo, A Geometric Framework for Unsupervised Anomaly Detection Applications of Data Mining in Computer Security. pp. 77- 101 ,(2002) , 10.1007/978-1-4615-0953-0_4
Vipin Kumar, Pang-Ning Tan, Michael M. Steinbach, Introduction to Data Mining ,(2013)
Keki B. Irani, Usama M. Fayyad, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning international joint conference on artificial intelligence. ,vol. 2, pp. 1022- 1027 ,(1993)
Gautam Das, Heikki Mannila, Context-Based Similarity Measures for Categorical Databases european conference on principles of data mining and knowledge discovery. pp. 201- 210 ,(2000) , 10.1007/3-540-45372-5_20
Dekang Lin, An Information-Theoretic Definition of Similarity international conference on machine learning. pp. 296- 304 ,(1998)
David W. Goodall, A New Similarity Index Based on Probability Biometrics. ,vol. 22, pp. 882- ,(1966) , 10.2307/2528080
Richard C. Dubes, Anil K. Jain, Algorithms for clustering data ,(1988)