作者: Dennis Fetterly , Mark Manasse , Marc Najork
关键词: Spamdexing 、 Web page 、 Web crawler 、 Spambot 、 Forum spam 、 Information retrieval 、 TrustRank 、 Content farm 、 Web search engine 、 World Wide Web 、 Computer science
摘要: The increasing importance of search engines to commercial web sites has given rise a phenomenon we call "web spam", that is, pages exist only mislead into (mis)leading users certain sites. Web spam is nuisance as well engines: have harder time finding the information they need, and cope with an inflated corpus, which in turn causes their cost per query increase. Therefore, strong incentive weed out from index.We propose some can be identified through statistical analysis: Certain classes pages, particular those are machine-generated, diverge properties at large. We examined variety such properties, including linkage structure, page content, evolution, found outliers distribution these highly likely caused by spam.This paper describes examined, gives distributions observed, shows kinds correlated spam.