Clustering based two-stage text classification requiring minimal training data

作者: Xue Zhang , Wang-xin Xiao

DOI: 10.1109/ICSAI.2012.6223496

关键词:

摘要: Clustering aided classification methods are based on the assumption that learned clusters under guidance of initial training data can somewhat characterize underlying distribution set. However, our experiments show whether such holds is both separability considered set and size It often violated bad separability, especially when too few. In this case, clustering would perform worse. paper, we propose a two-stage text approach to address above problem. first stage, labeled unlabeled clustered with data. Then self-training style strategy used iteratively expand an oracle or expert. At second discriminative classifiers subsequently be trained expanded Unlike other methods, proposed effectively cope separability. Furthermore, framework converts problem sparsely into supervised one, therefore, models, e.g. SVM, applied, techniques for learning further improve accuracy, as feature selection, sampling editing noise filtering. Our experimental results demonstrated effectiveness very small.

参考文章(23)
David D. Lewis, Naive (Bayes) at forty: The independence assumption in information retrieval Machine Learning: ECML-98. pp. 4- 15 ,(1998) , 10.1007/BFB0026666
Hwanjo Yu, Jiong Yang, Jiawei Han, Classifying large data sets using SVMs with hierarchical clusters knowledge discovery and data mining. pp. 306- 315 ,(2003) , 10.1145/956750.956786
Hwee Tou Ng, Wei Boon Goh, Kok Leong Low, Feature selection, perceptron learning, and a usability case study for text categorization international acm sigir conference on research and development in information retrieval. ,vol. 31, pp. 67- 73 ,(1997) , 10.1145/258525.258537
Brij Masand, Gordon Linoff, David Waltz, Classifying news stories using memory based reasoning international acm sigir conference on research and development in information retrieval. pp. 59- 65 ,(1992) , 10.1145/133160.133177
Glenn Fung, O. L. Mangasarian, Semi-superyised support vector machines for unlabeled data classification Optimization Methods & Software. ,vol. 15, pp. 29- 44 ,(2001) , 10.1080/10556780108805809
Yiming Yang, Xin Liu, A re-examination of text categorization methods international acm sigir conference on research and development in information retrieval. pp. 42- 49 ,(1999) , 10.1145/312624.312647
Bhavani Raskutti, Herman Ferrá, Adam Kowalczyk, Combining clustering and co-training to enhance text classification using unlabelled data Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 620- 625 ,(2002) , 10.1145/775047.775139
L J Hubert, G De Soete, P Arabie, Clustering And Classification ,(1996)