Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

作者： Dennis Fetterly , Mark Manasse , Marc Najork

关键词: Spamdexing 、 Web page 、 Web crawler 、 Spambot 、 Forum spam 、 Information retrieval 、 TrustRank 、 Content farm 、 Web search engine 、 World Wide Web 、 Computer science

摘要: The increasing importance of search engines to commercial web sites has given rise a phenomenon we call "web spam", that is, pages exist only mislead into (mis)leading users certain sites. Web spam is nuisance as well engines: have harder time finding the information they need, and cope with an inflated corpus, which in turn causes their cost per query increase. Therefore, strong incentive weed out from index.We propose some can be identified through statistical analysis: Certain classes pages, particular those are machine-generated, diverge properties at large. We examined variety such properties, including linkage structure, page content, evolution, found outliers distribution these highly likely caused by spam.This paper describes examined, gives distributions observed, shows kinds correlated spam.

参考文章(11)

Hector Garcia-Molina, Junghoo Cho, The Evolution of the Web and Implications for an Incremental Crawler very large data bases. pp. 200- 209 ,(2000)

Rajeev Motwani, Terry Winograd, Lawrence Page, Sergey Brin, The PageRank Citation Ranking : Bringing Order to the Web the web conference. ,vol. 98, pp. 161- 172 ,(1999)

Monika R. Henzinger, Rajeev Motwani, Craig Silverstein, Challenges in web search engines international acm sigir conference on research and development in information retrieval. ,vol. 36, pp. 11- 22 ,(2002) , 10.1145/792550.792553

Andrei Z. Broder, Marc Najork, Janet L. Wiener, Efficient URL caching for world wide web crawling Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 679- 689 ,(2003) , 10.1145/775152.775247

Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel, Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns acm conference on hypertext. pp. 38- 47 ,(2003) , 10.1145/900051.900060

D. Fetterly, M. Manasse, M. Najork, On the evolution of clusters of near-duplicate Web pages lasers and electro optics society meeting. pp. 37- 45 ,(2003) , 10.1109/LAWEB.2003.1250280

K. Bharat, Bay-Wei Chang, M. Henzinger, M. Ruhl, Who links to whom: mining linkage between Web sites international conference on data mining. pp. 51- 58 ,(2001) , 10.1109/ICDM.2001.989500

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, Syntactic clustering of the Web the web conference. ,vol. 29, pp. 1157- 1166 ,(1997) , 10.1016/S0169-7552(97)00031-7

Brian D. Davison, Recognizing Nepotistic Links on the Web ,(2000)

10.

Dennis Fetterly, Mark Manasse, Marc Najork, Janet Wiener, A large-scale study of the evolution of web pages Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 669- 678 ,(2003) , 10.1145/775152.775246

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

来源期刊

我的账户

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

来源期刊

相似文章 10

我的账户