Learning to crawl: Comparing classification schemes

作者: Gautam Pant , Padmini Srinivasan

DOI: 10.1145/1095872.1095875

关键词:

摘要: Topical crawling is a young and creative area of research that holds the promise benefiting from several sophisticated data mining techniques. The use classification algorithms to guide topical crawlers has been sporadically suggested in literature. No systematic study, however, done on their relative merits. Using lessons learned our previous crawler evaluation studies, we experiment with multiple versions different schemes. process modeled as parallel best-first search over graph defined by Web. classifiers provide heuristics thus biasing it towards certain portions Web graph. Our results show Naive Bayes weak choice for guiding when compared Support Vector Machine or Neural Network. Further, performance can be partly explained extreme skewness posterior probabilities generated it. We also observe despite similar performances, cover subspaces low overlap.

参考文章(66)
Susan Dumais, Using SVMs for Text Categorization IEEE Intelligent Systems Magazine. ,vol. 13, pp. 18- 28 ,(1998)
Gautam Pant, Filippo Menczer, Topical Crawling for Business Intelligence international conference theory and practice digital libraries. pp. 233- 244 ,(2003) , 10.1007/978-3-540-45175-4_22
David Hawking, Nick Craswell, Ross Wilkinson, Mingfang Wu, Overview of the TREC 2003 Web Track. text retrieval conference. pp. 78- 92 ,(2003)
Sergios Theodoridis, Konstantinos Koutroumbas, Pattern Recognition, Third Edition Academic Press, Inc.. ,(2006)
JRA McCallum, Jason Rennie, Using Reinforcement Learning to Spider the Web Efficiently international conference on machine learning. pp. 335- 343 ,(1999)
Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery the web conference. ,vol. 31, pp. 1623- 1640 ,(1999) , 10.1016/S1389-1286(99)00052-3
David G. Stork, Richard O. Duda, Peter E. Hart, Pattern Classification (2nd Edition) Wiley-Interscience. ,(2000)
Thomas G. Dietterich, Machine-Learning Research Ai Magazine. ,vol. 18, pp. 97- 136 ,(1997) , 10.1609/AIMAG.V18I4.1324
Pant Gautam, Srinivasan Padmini, Menczer Filippo, Levene Mark, Poulovassilis Alexandra, Crawling the Web Web Dynamics. pp. 153- 177 ,(2004) , 10.1007/978-3-662-10874-1_7
John C. Platt, Fast training of support vector machines using sequential minimal optimization Advances in kernel methods. pp. 185- 208 ,(1999)