Feature selection on hierarchy of web documents

作者: Dunja Mladenić , Marko Grobelnik

DOI: 10.1016/S0167-9236(02)00097-0

关键词:

摘要: The paper describes feature subset selection used in learning on text data (text learning) and gives a brief overview of commonly machine learning. Several known some new scoring measures appropriate for large are described related to each other. Experimental comparison the is given real-world collected from Web. Machine techniques Yahoo, hierarchy Web documents. Our approach includes original ideas handling number features, categories high features reduced by additionally using 'stop-list', pruning low-frequency short description document instead itself. Documents represented as feature-vectors that include word sequences including only single words when data. An efficient generating proposed. Based hierarchical structure, we propose way dividing problem into subproblems, representing one included Yahoo hierarchy. In our experiments, naive Bayesian classifier was result set independent classifiers, predict probability example member corresponding category. evaluation shows proposed good results. best performance achieved based measure information retrieval called Odds ratio relatively small features.

参考文章(37)
Jude W. Shavlik, Kevin J. Cherkauer, Growing simpler decision trees to facilitate knowledge discovery knowledge discovery and data mining. pp. 315- 318 ,(1996)
Toshiki Kindo, Hideyuki Yoshida, Tetsuro Morimoto, Taisuke Watanabe, Adaptive Personal Information Filtering System that Organizes Personal Profiles Automatically. international joint conference on artificial intelligence. pp. 716- 721 ,(1997)
Thomas G. Dietterich, Hussein Almuallim, Efficient Algorithms for Identifying Relevant Features Oregon State University. ,(1992)
Igor Kononenko, On biases in estimating multi-valued attributes international joint conference on artificial intelligence. pp. 1034- 1040 ,(1995)
Rudy Setiono, Huan Liu, A probabilistic approach to feature selection - a filter solution international conference on machine learning. pp. 319- 327 ,(1996)
Kenji Kira, Larry A. Rendell, The feature selection problem: traditional methods and a new algorithm national conference on artificial intelligence. pp. 129- 134 ,(1992)
Masud Mansuripur, Introduction to information theory ,(1986)
Andrew McCallum, Ronald Rosenfeld, Thomas Mitchell, Andrew Y Ng, None, Improving Text Classification by Shrinkage in a Hierarchy of Classes international conference on machine learning. pp. 359- 367 ,(1998)
Marko Grobelnik, Dunja Mladenic, Feature Selection for Unbalanced Class Distribution and Naive Bayes international conference on machine learning. pp. 258- 267 ,(1999)
Rich Caruana, Dayne Freitag, Greedy Attribute Selection Machine Learning Proceedings 1994. pp. 28- 36 ,(1994) , 10.1016/B978-1-55860-335-6.50012-X