Text classification without negative examples revisit

作者: Gabriel Pui Cheong Fung , J.X. Yu , Hongjun Lu , P.S. Yu

DOI: 10.1109/TKDE.2006.16

关键词:

摘要: Traditionally, building a classifier requires two sets of examples: positive examples and negative examples. This paper studies the problem text using (P) unlabeled (U). The are mixed with both Since no example is given explicitly, task reliable becomes far more challenging. Simply treating all as thereafter undoubtedly poor approach to tackling this problem. Generally speaking, most solved by two-step heuristic: first, extract (N) from U. Second, build based on P N. Surprisingly, did not try Intuitively, enlarging P' (positive extracted U) should enhance effectiveness classifier. Throughout our study, we find that extracting very difficult. A document in U possesses features exhibited does necessarily mean it example, vice versa. large size high diversity also contribute difficulties P'. In paper, propose labeling heuristic called PNLH tackle aims at quality can be used top any existing classifiers. Extensive experiments several benchmarks conducted. results indicated highly feasible, especially situation where |P| extremely small.

参考文章(25)
Bing Liu, Xiaoli Li, Learning to classify texts using positive and unlabeled data international joint conference on artificial intelligence. pp. 587- 592 ,(2003)
Philip S. Yu, Wee Sun Lee, Bing Liu, Xiaoli Li, Text classification by labeling words national conference on artificial intelligence. pp. 425- 430 ,(2004)
Joseph Bockhorst, Mark Craven, Exploiting Relations Among Concepts to Acquire Weakly Labeled Training Data international conference on machine learning. pp. 43- 50 ,(2002)
Kamal Nigam, Andrew McCallum, A comparison of event models for naive bayes text classification national conference on artificial intelligence. pp. 41- 48 ,(1998)
Rayid Ghani, Combining Labeled and Unlabeled Data for MultiClass Text Categorization international conference on machine learning. pp. 187- 194 ,(2002)
Usama M. Fayyad, Paul S. Bradley, Refining Initial Points for K-Means Clustering international conference on machine learning. pp. 91- 99 ,(1998)
Bing Liu, Wee Sun Lee, Philip S Yu, Xiaoli Li, None, Partially Supervised Classification of Text Documents international conference on machine learning. pp. 387- 394 ,(2002)
Gerard Salton, Christopher Buckley, Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management. ,vol. 24, pp. 323- 328 ,(1988) , 10.1016/0306-4573(88)90021-0
Yiming Yang, A study of thresholding strategies for text categorization international acm sigir conference on research and development in information retrieval. pp. 137- 145 ,(2001) , 10.1145/383952.383975
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections international acm sigir conference on research and development in information retrieval. ,vol. 51, pp. 318- 329 ,(1992) , 10.1145/3130348.3130362