Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

作者: Dennis Fetterly , Mark Manasse , Marc Najork

DOI: 10.1145/1017074.1017077

关键词: SpamdexingWeb pageWeb crawlerSpambotForum spamInformation retrievalTrustRankContent farmWeb search engineWorld Wide WebComputer science

摘要: The increasing importance of search engines to commercial web sites has given rise a phenomenon we call "web spam", that is, pages exist only mislead into (mis)leading users certain sites. Web spam is nuisance as well engines: have harder time finding the information they need, and cope with an inflated corpus, which in turn causes their cost per query increase. Therefore, strong incentive weed out from index.We propose some can be identified through statistical analysis: Certain classes pages, particular those are machine-generated, diverge properties at large. We examined variety such properties, including linkage structure, page content, evolution, found outliers distribution these highly likely caused by spam.This paper describes examined, gives distributions observed, shows kinds correlated spam.

参考文章(11)
Hector Garcia-Molina, Junghoo Cho, The Evolution of the Web and Implications for an Incremental Crawler very large data bases. pp. 200- 209 ,(2000)
Rajeev Motwani, Terry Winograd, Lawrence Page, Sergey Brin, The PageRank Citation Ranking : Bringing Order to the Web the web conference. ,vol. 98, pp. 161- 172 ,(1999)
Monika R. Henzinger, Rajeev Motwani, Craig Silverstein, Challenges in web search engines international acm sigir conference on research and development in information retrieval. ,vol. 36, pp. 11- 22 ,(2002) , 10.1145/792550.792553
Andrei Z. Broder, Marc Najork, Janet L. Wiener, Efficient URL caching for world wide web crawling Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 679- 689 ,(2003) , 10.1145/775152.775247
Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel, Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns acm conference on hypertext. pp. 38- 47 ,(2003) , 10.1145/900051.900060
D. Fetterly, M. Manasse, M. Najork, On the evolution of clusters of near-duplicate Web pages lasers and electro optics society meeting. pp. 37- 45 ,(2003) , 10.1109/LAWEB.2003.1250280
K. Bharat, Bay-Wei Chang, M. Henzinger, M. Ruhl, Who links to whom: mining linkage between Web sites international conference on data mining. pp. 51- 58 ,(2001) , 10.1109/ICDM.2001.989500
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, Syntactic clustering of the Web the web conference. ,vol. 29, pp. 1157- 1166 ,(1997) , 10.1016/S0169-7552(97)00031-7
Dennis Fetterly, Mark Manasse, Marc Najork, Janet Wiener, A large-scale study of the evolution of web pages Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 669- 678 ,(2003) , 10.1145/775152.775246