Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

作者: Luigi Galavotti , Fabrizio Sebastiani , Maria Simi

DOI: 10.1007/3-540-45268-0_6

关键词:

摘要: We tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature (FS) refers to the activity selecting, from set r distinct features (i.e. words) occurring in collection, subset r′ ≪ that are most useful for compactly representing meaning documents. propose a novel FS technique, based on simplified variant X2 statistics. Classifier induction instead problem automatically building by learning documents pre-classified under categories interest. variant, exploitation negative evidence, well-known k-NN method. report results systematic experimentation these methods performed standard REUTERS-21578 benchmark.

参考文章(16)
Fabrizio Sebastiani, Machine learning in automated text categorisation: a survey ACM Computing Surveys. ,(1999)
David Dolan Lewis, Representation and Learning in Information Retrieval University of Massachusetts. ,(1991)
George H John, Ron Kohavi, Karl Pfleger, None, Irrelevant Features and the Subset Selection Problem Machine Learning Proceedings 1994. pp. 121- 129 ,(1994) , 10.1016/B978-1-55860-335-6.50023-4
Hwee Tou Ng, Wei Boon Goh, Kok Leong Low, Feature selection, perceptron learning, and a usability case study for text categorization international acm sigir conference on research and development in information retrieval. ,vol. 31, pp. 67- 73 ,(1997) , 10.1145/258525.258537
Yiming Yang, Xin Liu, A re-examination of text categorization methods international acm sigir conference on research and development in information retrieval. pp. 42- 49 ,(1999) , 10.1145/312624.312647
Amit Singhal, Mandar Mitra, Chris Buckley, Learning routing queries in a query zone international acm sigir conference on research and development in information retrieval. ,vol. 31, pp. 25- 32 ,(1997) , 10.1145/258525.258530
Wai Lam, Chao Yang Ho, Using a generalized instance set for automatic text categorization international acm sigir conference on research and development in information retrieval. pp. 81- 89 ,(1998) , 10.1145/290941.290961
Robert E. Schapire, Yoram Singer, Amit Singhal, Boosting and Rocchio applied to text filtering international acm sigir conference on research and development in information retrieval. pp. 215- 223 ,(1998) , 10.1145/290941.290996
David Hull, Improving text retrieval for the routing problem using latent semantic indexing international acm sigir conference on research and development in information retrieval. pp. 282- 291 ,(1994) , 10.5555/188490.188585
Hinrich Schütze, David A. Hull, Jan O. Pedersen, A comparison of classifiers and document representations for the routing problem Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95. pp. 229- 237 ,(1995) , 10.1145/215206.215365