EPLogCleaner: Improving Data Quality of Enterprise Proxy Logs for Efficient Web Usage Mining

作者: Hongzhou Sha , Tingwen Liu , Peng Qin , Yong Sun , Qingyun Liu

DOI: 10.1016/J.PROCS.2013.05.104

关键词:

摘要: Abstract Data cleaning is an important step performed in the preprocessing stage of web usage mining, and widely used many data mining systems. Despite efforts on for server logs, it still open question enterprise proxy logs. With unlimited accesses to websites, logs trace requests from multiple clients servers,which make them quite different sever both location content. Therefore, irrelevant items such as software updating cannot be filtered out by traditional methods. In this paper, we propose first method named EPLogCleaner that can filter plenty based common prefix their URLs. We evaluation with a real network traffic captured one proxy. Experimental results show improve quality further filtering more than 30% URL comparing

参考文章(10)
L. Masinter, T. Berners-Lee, M. McCahill, Uniform Resource Locators (URL) RFC 1738. ,vol. 1738, pp. 1- 25 ,(1994)
Tasawar Hussain, Sohail Asghar, Nayyer Masood, Web usage mining: A survey on preprocessing of web log file international conference on information and emerging technologies. pp. 1- 6 ,(2010) , 10.1109/ICIET.2010.5625730
Yu Zhang, Li Dai, Zhi-Jie Zhou, A New Perspective of Web Usage Mining: Using Enterprise Proxy Log web information systems modeling. ,vol. 1, pp. 38- 42 ,(2010) , 10.1109/WISM.2010.20
M. A. Torsello, A. M. Fanelli, G. Castellano, LODAP: a log data preprocessor for mining web browsing patterns AIKED'07 Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases - Volume 6. pp. 12- 17 ,(2007)
D. Tanasa, B. Trousse, Advanced data preprocessing for intersites Web usage mining IEEE Intelligent Systems. ,vol. 19, pp. 59- 65 ,(2004) , 10.1109/MIS.2004.1274912
Brijendra Singh, Hemant Kumar Singh, Web Data Mining research: A survey international conference on computational intelligence and computing research. pp. 1- 10 ,(2010) , 10.1109/ICCIC.2010.5705856