Automatically generated spam detection based on sentence-level topic information

作者: Yoshihiko Suhara , Hiroyuki Toda , Shuichi Nishioka , Seiji Susaki

DOI: 10.1145/2487788.2488140

关键词:

摘要: Spammers use a wide range of content generation techniques with low quality pages known as spam to achieve their goals. We argue that must be tackled using features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. combine them other build classifier. Our experiments show our method outperforms conventional methods.

参考文章(19)
Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing ,(1999)
Chris Biemann, Martin Riedl, Sweeping through the Topic Space: Bad luck? Roll again! Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP. pp. 19- 27 ,(2012)
Zoltán Gyöngyi, Hector Garcia-Molina, Jan Pedersen, Combating web spam with trustrank very large data bases. pp. 576- 587 ,(2004) , 10.1016/B978-012088469-8.50052-8
Chris Biemann, Martin Riedl, TopicTiling: A Text Segmentation Algorithm based on LDA meeting of the association for computational linguistics. pp. 37- 42 ,(2012)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Enrique Vallés, Paolo Rosso, Detection of near-duplicate user generated contents Proceedings of the 3rd international workshop on Search and mining user-generated contents - SMUC '11. pp. 27- 34 ,(2011) , 10.1145/2065023.2065031
T. L. Griffiths, M. Steyvers, Finding scientific topics Proceedings of the National Academy of Sciences of the United States of America. ,vol. 101, pp. 5228- 5235 ,(2004) , 10.1073/PNAS.0307752101
Juan Martinez-Romo, Lourdes Araujo, Web spam identification through language model analysis Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web - AIRWeb '09. pp. 21- 28 ,(2009) , 10.1145/1531914.1531920
István Bíró, Dávid Siklósi, Jácint Szabó, András A. Benczúr, Linked latent Dirichlet allocation in web spam filtering Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web - AIRWeb '09. pp. 37- 40 ,(2009) , 10.1145/1531914.1531922
Yohan Jo, Alice H. Oh, Aspect and sentiment unification model for online review analysis web search and data mining. pp. 815- 824 ,(2011) , 10.1145/1935826.1935932