PCA document reconstruction for email classification

作者: Juan Carlos Gomez , Marie-Francine Moens

DOI: 10.1016/J.CSDA.2011.09.023

关键词:

摘要: This paper presents a document classifier based on text content features and its application to email classification. We test the validity of which uses Principal Component Analysis Document Reconstruction (PCADR), where idea is that principal component analysis (PCA) can compress optimally only kind documents-in our experiments classes-that are used compute components (PCs), for other kinds documents compression will not perform well using few components. Thus, computes separately PCA each class, when new instance arrives be classified, this example projected in set computed PCs corresponding then reconstructed same PCs. The reconstruction error assigns class with smallest or divergence from representation. approach filtering by distinguishing between two message classes (e.g. spam ham, phishing ham). show PCADR able obtain very good results different validation datasets employed, reaching better performance than popular Support Vector Machine classifier.

参考文章(48)
Wilfried N. Gansterer, Andreas G. K. Janecek, Robert Neumayer, Spam Filtering Based on Latent Semantic Indexing Springer, London. pp. 165- 183 ,(2008) , 10.1007/978-1-84800-046-9_9
Juan Carlos Gomez, Marie-Francine Moens, Using Biased Discriminant Analysis for Email Filtering Knowledge-Based and Intelligent Information and Engineering Systems. pp. 566- 575 ,(2010) , 10.1007/978-3-642-15387-7_60
Thiago S. Guzella, Walmir M. Caminhas, Review: A review of machine learning approaches to Spam filtering Expert Systems With Applications. ,vol. 36, pp. 10206- 10222 ,(2009) , 10.1016/J.ESWA.2009.02.037
Frederik Schaffalitzky, Richard Hartley, PowerFactorization : 3D reconstruction with missing or uncertain data ,(2003)
Yunqing Xia, Kam-Fai Wong, Binarization Approaches to Email Categorization Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. pp. 474- 481 ,(2006) , 10.1007/11940098_50
Tom Fawcett, "In vivo" spam filtering: A challenge problem for data mining arXiv: Artificial Intelligence. ,(2004)
Lluís Màrquez, Xavier Carreras, Boosting Trees for Anti-Spam Email Filtering arXiv: Computation and Language. ,(2001)
John C. Platt, Fast training of support vector machines using sequential minimal optimization Advances in kernel methods. pp. 185- 208 ,(1999)
P. C. Barman, Nadeem Iqbal, Soo-Young Lee, Non-negative Matrix Factorization Based Text Mining: Feature Extraction and Classification Neural Information Processing. pp. 703- 712 ,(2006) , 10.1007/11893257_78