An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

作者: Franca Debole , Fabrizio Sebastiani

DOI: 10.1002/ASI.V56:6

关键词:

摘要: The existence, public availability, and widespread acceptance of a standard benchmark for given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers experimentally compare their own systems by comparing the results have obtained benchmark. Reuters-21578 test collection, together with its earlier variants, has been such text categorization (TC) throughout last 10 years. However, benefits that brought about somehow limited fact “carved” subsets out collection tested one these only; thus not readily comparable. In article, we present systematic, comparative experimental study three most popular among TC researchers. obtain us determine relative hardness subsets, establishing an indirect means have, or will be, subsets. © 2005 Wiley Periodicals, Inc.

参考文章(31)
Mark Stevenson, Miles Whitehead, Tony Rose, The reuters corpus volume 1 - From yesterday's news to tomorrow's language resources language resources and evaluation. ,(2002)
Luigi Galavotti, Fabrizio Sebastiani, Maria Simi, Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization european conference on research and advanced technology for digital libraries. ,vol. 1923, pp. 59- 68 ,(2000) , 10.1007/3-540-45268-0_6
Kamal Nigam, Andrew McCallum, A comparison of event models for naive bayes text classification national conference on artificial intelligence. pp. 41- 48 ,(1998)
Mohammed Benkhalifa, Abdelhak Mouradi, Houssaine Bouyakhf, Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization Information Retrieval. ,vol. 4, pp. 91- 113 ,(2001) , 10.1023/A:1011458711300
Pio Nardiello, Fabrizio Sebastiani, Alessandro Sperduti, Discretizing Continuous Attributes in AdaBoost for Text Categorization Lecture Notes in Computer Science. pp. 320- 334 ,(2003) , 10.1007/3-540-36618-0_23
Maria Fernanda Caropreso, Fabrizio Sebastiani, Stan Matwin, A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization Text databases & document management. pp. 78- 102 ,(2001)
David Dolan Lewis, Representation and Learning in Information Retrieval University of Massachusetts. ,(1991)
Yiming Yang, A study of thresholding strategies for text categorization international acm sigir conference on research and development in information retrieval. pp. 137- 145 ,(2001) , 10.1145/383952.383975
Koby Crammer, Yoram Singer, A new family of online algorithms for category ranking Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. ,vol. 3, pp. 151- 158 ,(2002) , 10.1145/564376.564404
Kristina Toutanova, Francine Chen, Kris Popat, Thomas Hofmann, Text classification in a hierarchical mixture model for small training sets Proceedings of the tenth international conference on Information and knowledge management - CIKM'01. pp. 105- 113 ,(2001) , 10.1145/502585.502604