作者: Juan Carlos Gomez , Marie-Francine Moens
DOI: 10.1016/J.CSDA.2011.09.023
关键词:
摘要: This paper presents a document classifier based on text content features and its application to email classification. We test the validity of which uses Principal Component Analysis Document Reconstruction (PCADR), where idea is that principal component analysis (PCA) can compress optimally only kind documents-in our experiments classes-that are used compute components (PCs), for other kinds documents compression will not perform well using few components. Thus, computes separately PCA each class, when new instance arrives be classified, this example projected in set computed PCs corresponding then reconstructed same PCs. The reconstruction error assigns class with smallest or divergence from representation. approach filtering by distinguishing between two message classes (e.g. spam ham, phishing ham). show PCADR able obtain very good results different validation datasets employed, reaching better performance than popular Support Vector Machine classifier.