作者: Gordon V. Cormack
DOI:
关键词: Information extraction 、 Computer science 、 Communication source 、 Email spam 、 Information retrieval 、 Forum spam 、 Focus (computing) 、 Filter (signal processing) 、 Spambot 、 Component (UML)
摘要: Spam is information crafted to be delivered a large number of recipients, in spite their wishes. A spam filter an automated tool recognize so as prevent its delivery. The purposes and filters are diametrically opposed: effective if it evades filters, while recognizes spam. circular nature these definitions, along with appeal the intent sender recipient make them difficult formalize. typical email user has working definition no more formal than "I know when I see it." Yet, current remarkably effective, might expected given level uncertainty debate over spam, state-of-the-art retrieval machine learning methods for seemingly similar problems. But they enough? Which better? How improved? Will effectiveness compromised by cleverly spam? We survey proposed filtering techniques particular emphasis on how well work. Our primary focus email; Similarities differences other communication storage media — such instant messaging Web addressed peripherally. In doing we examine user's requirements role one component complex universe. Well-known detailed sufficiently exposition self-contained, however, considerations unique Comparisons, wherever possible, use common evaluation measures, control experimental setup. Such comparisons not easy, benchmarks, evaluating still evolving. We efforts, results limitations. recent advances methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain validity methods. outline several propose address them.