Using unlabeled data to improve text classification

作者: Tom M. Mitchell , Kanal Paul Nigam

DOI:

关键词:

摘要: One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates supervised use a small number of labeled and inexpensive unlabeled can create high-accuracy classifiers. By assuming documents are created by parametric generative model, Expectation-Maximization (EM) finds local maximum posteriori models classifiers from all the data—labeled unlabeled. These do not capture intricacies text; however on some domains this technique substantially improves accuracy, especially when data sparse. Two problems arise basic approach. First, hurt performance in where modeling assumptions too strongly violated. In case be made more representative two ways: sub-topic class structure, super-topic hierarchical relationships. doing so, model probability accuracy come into correspondence, allowing improve performance. The second problem even improvements given sufficiently compensate for paucity data. Here, limited provide EM initializations lead low-probability models. Performance significantly improved using active select high-quality initializations, alternatives avoid maxima.

参考文章(99)
Dale Schuurmans, A new metric-based approach to model selection national conference on artificial intelligence. pp. 552- 558 ,(1997)
G. D. Murray, D. M. Titterington, Estimation Problems with Data from a Mixture Applied Statistics. ,vol. 27, pp. 325- 334 ,(1978) , 10.2307/2347169
D. M. Titterington, Updating a Diagnostic System Using Unconfirmed Cases Applied Statistics. ,vol. 25, pp. 238- 247 ,(1976) , 10.2307/2347231
Prasad Tadepalli, Ray Liere, Active Learning with Committees for Text Categorization national conference on artificial intelligence. pp. 591- 596 ,(1997)
Mehran Sahami, Learning limited dependence Bayesian classifiers knowledge discovery and data mining. pp. 335- 338 ,(1996)
I. Csiszár, Maxent, Mathematics, and Information Theory Maximum Entropy and Bayesian Methods. pp. 35- 50 ,(1996) , 10.1007/978-94-011-5430-7_5
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7