Using unlabeled data to improve text classification

作者： Tom M. Mitchell , Kanal Paul Nigam

DOI:

关键词:

摘要: One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates supervised use a small number of labeled and inexpensive unlabeled can create high-accuracy classifiers. By assuming documents are created by parametric generative model, Expectation-Maximization (EM) finds local maximum posteriori models classifiers from all the data—labeled unlabeled. These do not capture intricacies text; however on some domains this technique substantially improves accuracy, especially when data sparse. Two problems arise basic approach. First, hurt performance in where modeling assumptions too strongly violated. In case be made more representative two ways: sub-topic class structure, super-topic hierarchical relationships. doing so, model probability accuracy come into correspondence, allowing improve performance. The second problem even improvements given sufficiently compensate for paucity data. Here, limited provide EM initializations lead low-probability models. Performance significantly improved using active select high-quality initializations, alternatives avoid maxima.

dtic.mil 本地加速

proquest.com 本地加速

dtic.mil PDF 下载加速

kamalnigam.com PDF 下载加速

dtic.mil PDF 下载加速

参考文章(99)

Dale Schuurmans, A new metric-based approach to model selection national conference on artificial intelligence. pp. 552- 558 ,(1997)

G. D. Murray, D. M. Titterington, Estimation Problems with Data from a Mixture Applied Statistics. ,vol. 27, pp. 325- 334 ,(1978) , 10.2307/2347169

D. M. Titterington, Updating a Diagnostic System Using Unconfirmed Cases Applied Statistics. ,vol. 25, pp. 238- 247 ,(1976) , 10.2307/2347231

Prasad Tadepalli, Ray Liere, Active Learning with Committees for Text Categorization national conference on artificial intelligence. pp. 591- 596 ,(1997)

Mehran Sahami, Learning limited dependence Bayesian classifiers knowledge discovery and data mining. pp. 335- 338 ,(1996)

慧渡辺, Knowing and guessing : a quantitative study of inference and information Wiley. ,(1969)

Joe Zhou, Pascale Fung, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA Association for Computational Linguistics. ,(1999)

I. Csiszár, Maxent, Mathematics, and Information Theory Maximum Entropy and Bayesian Methods. pp. 35- 50 ,(1996) , 10.1007/978-94-011-5430-7_5

Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7

10.

Bradley P. Carlin, Bayes and Empirical Bayes Methods for Data Analysis ,(1996)

Using unlabeled data to improve text classification

来源期刊

我的账户

Using unlabeled data to improve text classification

来源期刊

相似文章 10

我的账户