Evaluation of the Document Classification Approaches

作者: Michal Hrala , Pavel Král

DOI: 10.1007/978-3-319-00969-8_86

关键词: CzechArtificial intelligenceFeature selectionNaive Bayes classifierComputer scienceFeature vectorSupport vector machineDocument classificationNatural language processingClass (biology)Principle of maximum entropy

摘要: This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient classification system. Lemmatization POS tagging used for precise representation of the documents. We demonstrated, that tag filtering is very important, while lemmatization plays marginal role classification.We also showed Maximum Entropy Support Vector Machines robust vector size outperform significantly Naive Bayes classifier from view point accuracy. The best accuracy about 90% which enough application News Agency, our commercial partner.

参考文章(14)
T. Devi, P. Ponmuthuramalingam, Effective Term Based Text Clustering Algorithms ,(2010)
Luigi Galavotti, Fabrizio Sebastiani, Maria Simi, Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization european conference on research and advanced technology for digital libraries. ,vol. 1923, pp. 59- 68 ,(2000) , 10.1007/3-540-45268-0_6
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Jiali Yun, Liping Jing, Jian Yu, Houkuan Huang, A multi-layer text classification framework based on two-level representation model Expert Systems With Applications. ,vol. 39, pp. 2035- 2046 ,(2012) , 10.1016/J.ESWA.2011.08.027
Juan Carlos Gomez, Marie-Francine Moens, PCA document reconstruction for email classification Computational Statistics & Data Analysis. ,vol. 56, pp. 741- 751 ,(2012) , 10.1016/J.CSDA.2011.09.023
Andrej Bratko, Bogdan Filipič, Exploiting structural information for semi-structured document categorization Information Processing and Management. ,vol. 42, pp. 679- 694 ,(2006) , 10.1016/J.IPM.2005.06.003
Chul Su Lim, Kong Joo Lee, Gil Chang Kim, Multiple sets of features for automatic genre classification of web documents Information Processing and Management. ,vol. 41, pp. 1263- 1276 ,(2005) , 10.1016/J.IPM.2004.06.004
Thomas M. Cover, Joy A. Thomas, Elements of information theory ,(1991)
George Forman, An extensive empirical study of feature selection metrics for text classification Journal of Machine Learning Research. ,vol. 3, pp. 1289- 1305 ,(2003)