Email Spam Filtering: A Systematic Review

作者: Gordon V. Cormack

DOI:

关键词: Information extractionComputer scienceCommunication sourceEmail spamInformation retrievalForum spamFocus (computing)Filter (signal processing)SpambotComponent (UML)

摘要: Spam is information crafted to be delivered a large number of recipients, in spite their wishes. A spam filter an automated tool recognize so as prevent its delivery. The purposes and filters are diametrically opposed: effective if it evades filters, while recognizes spam. circular nature these definitions, along with appeal the intent sender recipient make them difficult formalize. typical email user has working definition no more formal than "I know when I see it." Yet, current remarkably effective, might expected given level uncertainty debate over spam, state-of-the-art retrieval machine learning methods for seemingly similar problems. But they enough? Which better? How improved? Will effectiveness compromised by cleverly spam? We survey proposed filtering techniques particular emphasis on how well work. Our primary focus email; Similarities differences other communication storage media — such instant messaging Web addressed peripherally. In doing we examine user's requirements role one component complex universe. Well-known detailed sufficiently exposition self-contained, however, considerations unique Comparisons, wherever possible, use common evaluation measures, control experimental setup. Such comparisons not easy, benchmarks, evaluating still evolving. We efforts, results limitations. recent advances methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain validity methods. outline several propose address them.

参考文章(139)
William Arnold, Ted Markowitz, Richard Segal, Fast Uncertainty Sampling for Labeling Large E-mail Corpora. conference on email and anti-spam. ,(2006)
Jeffrey O. Kephart, Jason Crawford, Barry Leiba, Richard Segal, SpamGuru: An Enterprise Anti-Spam Filtering System. conference on email and anti-spam. ,(2004)
Gabriel Wachman, David Sculley, Relaxed Online SVMs in the TREC Spam Filtering Track. text retrieval conference. ,(2007)
Ion Androutsopoulos, Eirinaios Michelakis, E. Michelakis, Georgios Paliouras, Learning to Filter Unsolicited Commercial E-Mail ,(2006)
Wilfried N. Gansterer, Andreas G. K. Janecek, Robert Neumayer, Spam Filtering Based on Latent Semantic Indexing Springer, London. pp. 165- 183 ,(2008) , 10.1007/978-1-84800-046-9_9
William S. Yerazunis, Seven Hypothesis about Spam Filtering. text retrieval conference. ,(2006)
Shyhtsun Felix Wu, Gregory L. Wittel, On Attacking Statistical Spam Filters. conference on email and anti-spam. ,(2004)