作者: Andrew Y. Ng , Honglak Lee
DOI:
关键词:
摘要: To circumvent spam filters, many spammers attempt to obfuscate their emails by deliberately misspelling words or introducing other errors into the text. For example viagra may be written vigra, mortgage m0rt gage. Even though humans have little difficulty reading obfuscated emails, most content-based filters are unable recognize these words. In this paper, we present a hidden Markov model for deobfuscating emails. We empirically demonstrate that our is robust types of obfuscation including misspellings, incorrect segmentations (adding/removing spaces), and substitutions/insertions non-alphabetic characters.