Exploiting structural information for semi-structured document categorization

作者: Andrej Bratko , Bogdan Filipič

DOI: 10.1016/J.IPM.2005.06.003

关键词:

摘要: This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting a collection fields, or arbitrary tree-structured that can be adequately modeled with such fiat structure. range from trivial modifications text modeling more elaborate schemes, specifically tailored structured documents. We combine these three classification algorithms and evaluate their performance on four standard datasets containing types best results were obtained stacking, an approach which predictions based components combined by meta classifier. A further improvement this method is achieved including the flat model final prediction.

参考文章(29)
David H. Wolpert, Original Contribution: Stacked generalization Neural Networks. ,vol. 5, pp. 241- 259 ,(1992) , 10.1016/S0893-6080(05)80023-1
Susana Eyheramendy, David Madigan, David D. Lewis, On the Naive Bayes Model for Text Categorization. international conference on artificial intelligence and statistics. ,(2003)
Yiming Yang, Seán Slattery, Rayid Ghani, A Study of Approaches to Hypertext Categorization intelligent information systems. ,vol. 18, pp. 219- 241 ,(2002) , 10.1023/A:1013685612819
George Forman, Ira Cohen, Learning from little: comparison of classifiers given little training european conference on principles of data mining and knowledge discovery. pp. 161- 172 ,(2004) , 10.1007/978-3-540-30116-5_17
Kamal Nigam, Andrew McCallum, A comparison of event models for naive bayes text classification national conference on artificial intelligence. pp. 41- 48 ,(1998)
Christopher Meek, Jake D. Brutlag, Challenges of the Email Domain for Text Classification international conference on machine learning. pp. 103- 110 ,(2000)
Thorsten Joachims, Making large-scale support vector machine learning practical Advances in kernel methods. pp. 169- 184 ,(1999)
Seán Slattery, Rayid Ghani, Yiming Yang, Hypertext Categorization using Hyperlink Patterns and Meta Data international conference on machine learning. pp. 178- 185 ,(2001)
Bryan Klimt, Yiming Yang, The enron corpus: a new dataset for email classification research european conference on machine learning. pp. 217- 226 ,(2004) , 10.1007/978-3-540-30115-8_22