作者: Dunja Mladenić , Marko Grobelnik
DOI: 10.1016/S0167-9236(02)00097-0
关键词:
摘要: The paper describes feature subset selection used in learning on text data (text learning) and gives a brief overview of commonly machine learning. Several known some new scoring measures appropriate for large are described related to each other. Experimental comparison the is given real-world collected from Web. Machine techniques Yahoo, hierarchy Web documents. Our approach includes original ideas handling number features, categories high features reduced by additionally using 'stop-list', pruning low-frequency short description document instead itself. Documents represented as feature-vectors that include word sequences including only single words when data. An efficient generating proposed. Based hierarchical structure, we propose way dividing problem into subproblems, representing one included Yahoo hierarchy. In our experiments, naive Bayesian classifier was result set independent classifiers, predict probability example member corresponding category. evaluation shows proposed good results. best performance achieved based measure information retrieval called Odds ratio relatively small features.