Internet traffic classification demystified: on the sources of the discriminative power

作者: Yeon-sup Lim , Hyun-chul Kim , Jiwoong Jeong , Chong-kwon Kim , Ted "Taekyoung" Kwon

DOI: 10.1145/1921168.1921180

关键词: Discriminative modelData miningArtificial intelligenceDiscretizationNetwork packetMachine learningEntropy (information theory)Traffic classificationMinimum description lengthComputer scienceThe InternetStatistical classification

摘要: Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types traffic, but no systematic analysis "Why" some algorithms achieve high accuracies. In pursuit empirically grounded answers to the question, which is critical in understanding and establishing scientific ground research, this paper reveals three sources discriminative power classifying application traffic: (i) ports, (ii) sizes first one-two (for UDP flows) or four-five TCP packets, (iii) discretization those features. We find that C4.5 performs best under any circumstances, as well reason why; because algorithm discretizes input features during operations. also entropy-based Minimum Description Length ports packet size substantially improve accuracy every machine learning tested (by much 59.8%!) make all them >93% average without algorithm-specific tuning processes. Our results indicate dealing with discrete nominal intervals, not continuous numbers, essential basis accurate (i.e., should be discretized first), regardless use.

参考文章(48)
Ron Kohavi, Mehran Sahami, Error-based and entropy-based discretization of continuous features knowledge discovery and data mining. pp. 114- 119 ,(1996)
Hung-Ju Huang, Tzu-Tsung Wong, Why Discretization Works for Naive Bayesian Classifiers international conference on machine learning. pp. 399- 406 ,(2000)
Anthony McGregor, Mark Hall, Perry Lorier, James Brunskill, Flow Clustering Using Machine Learning Techniques passive and active network measurement. ,vol. 3015, pp. 205- 214 ,(2004) , 10.1007/978-3-540-24668-8_21
Steven L. Salzberg, Alberto Segre, Programs for Machine Learning ,(1994)
Andrew W. Moore, Konstantina Papagiannaki, Toward the Accurate Identification of Network Applications Lecture Notes in Computer Science. pp. 41- 54 ,(2005) , 10.1007/978-3-540-31966-5_4
Ying Yang, Geoffrey I. Webb, On why discretization works for Naive-Bayes classifiers australasian joint conference on artificial intelligence. pp. 440- 452 ,(2003) , 10.1007/978-3-540-24581-0_37
Mark Andrew Hall, Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning international conference on machine learning. pp. 359- 366 ,(2000)
Mark A. Hall, Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques ,(1999)
Peter Grünwald, A Tutorial Introduction to the Minimum Description Length Principle arXiv: Statistics Theory. ,(2004)
Keki B. Irani, Usama M. Fayyad, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning international joint conference on artificial intelligence. ,vol. 2, pp. 1022- 1027 ,(1993)