Link-Based Characterization and Detection of Web Spam

作者: Ricardo A. Baeza-Yates , Luca Becchetti , Carlos Castillo , Stefano Leonardi , Debora Donato

DOI:

关键词:

摘要: We perform a statistical analysis of large collection Web pages, focusing on spam detection. study several metrics such as degree correlations, number neighbors, rank propagation through links, TrustRank and others to build automatic web classiers. This paper presents the performance each these classiers alone, well their combined performance. Using this approach we are able detect 80.4% in our sample, with only 1.1% false positives.

参考文章(27)
Károly Csalogány, András A. Benczúr, Tamás Sarlós, Máté Uher, SpamRank -- Fully Automatic Link Spam Detection. adversarial information retrieval on the web. pp. 25- 38 ,(2005)
R. Baeza-Yates, L. Becchetti, C. Castillo, S. Leonardi, D. Donato, Using rank propagation and Probabilistic counting for Link-Based Spam Detection ,(2006)
J. F. Naughton, R. J. Lipton, Estimating the size of generalized transitive closures very large data bases. pp. 165- 171 ,(1989)
Andrew Tomkins, David Gibson, Ravi Kumar, Discovering large dense subgraphs in massive graphs very large data bases. pp. 721- 732 ,(2005)
r;ribeiro-neto bueza-yates (b), Modern Information Retrieval ,(1999)
Sridhar Rajagopalan, Prabhakar Raghavan, Monika R. Henzinger, Computing on data streams External memory algorithms. pp. 107- 118 ,(1999)
Hui Zhang, Ashish Goel, Ramesh Govindan, Kahn Mason, Benjamin Van Roy, Making Eigenvector-Based Reputation Systems Robust to Collusion workshop on algorithms and models for the web graph. pp. 92- 104 ,(2004) , 10.1007/978-3-540-30216-2_8
Zoltán Gyöngyi, Hector Garcia-Molina, Jan Pedersen, Combating web spam with trustrank very large data bases. pp. 576- 587 ,(2004) , 10.1016/B978-012088469-8.50052-8
Rajeev Motwani, Terry Winograd, Lawrence Page, Sergey Brin, The PageRank Citation Ranking : Bringing Order to the Web the web conference. ,vol. 98, pp. 161- 172 ,(1999)
Dennis Fetterly, Mark Manasse, Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages international workshop on the web and databases. pp. 1- 6 ,(2004) , 10.1145/1017074.1017077